Flexible GraphRAG - Incremental Update System¶
Automatically synchronizes vector, search, and graph indexes when documents change in monitored data sources.
Overview¶
The Incremental Update System provides real-time and periodic synchronization of your knowledge base. When documents are added, modified, or deleted in your data sources, the system automatically updates all indexes without manual intervention.
Key Benefits: - Automatic Sync: Changes propagate to indexes automatically - Real-time Updates: Event-driven detection (when supported) for sub-second latency - Cost Efficient: Skip unnecessary reprocessing with content hashing - Robust: Handles failures gracefully with retry logic and partial sync recovery - Scalable: Supports multiple data sources simultaneously
Features¶
Intelligent Change Detection¶
- Ordinal-Based Versioning: Monotonic timestamp tracking (microsecond precision) ensures correct ordering
- Content Hash Optimization: Skip reprocessing when only timestamps change
- Per-Document Tracking: Prevents concurrent processing conflicts
- Multi-Target Coordination: Independent sync state for Vector, Search, and Graph indexes
Dual Update Mechanism¶
- Event-Driven Streams (when supported) - Real-time change notifications for instant updates
- Periodic Polling (configurable interval) - Comprehensive fallback that catches any missed changes
Supported Data Sources¶
| Source | Change Detection | Status |
|---|---|---|
| Filesystem | Real-time OS events (watchdog) | ✅ Production |
| Amazon S3 | SQS event notifications | ✅ Production |
| Alfresco | ActiveMQ events + Polling | ✅ Production |
| Google Drive | Changes API (polling) | ✅ Ready |
| Azure Blob Storage | Change feed iterator | ✅ Ready |
| Google Cloud Storage | Pub/Sub notifications | ✅ Ready |
| Box | Events API (polling) | ✅ Ready |
| Microsoft Graph (SharePoint/OneDrive) | Delta query (polling) | ✅ Ready |
Architecture¶
+-----------------+
| Data Sources | Filesystem, S3, Alfresco, Google Drive, etc.
+--------+--------+
| Changes detected
v
+-----------------+
| Change | Event streams + Periodic polling
| Detectors |
+--------+--------+
| Events: CREATE, UPDATE, DELETE
v
+-----------------+
| Incremental | Deduplication, batching, ordering
| Update Engine |
+--------+--------+
| Process changes
v
+-----------------+
| Backend | Document processing (Docling/LlamaParse)
| Pipeline | Chunking, embedding generation
+--------+--------+
| Update indexes
v
+-----------------+
| Vector Search | Qdrant (vectors) + Elasticsearch (text)
| Graph Indexes | Neo4j (knowledge graph)
+-----------------+
Quick Links¶
- Quick Start Guide - Get running in 5 minutes
- Setup Guide - Complete installation and configuration
- API Reference - REST API documentation
- S3 Setup Guide - Configure S3 with event notifications
How It Works¶
1. Monitor Data Sources¶
Each configured data source is monitored by a Change Detector that tracks: - New files/documents (CREATE) - Modified files/documents (UPDATE) - Deleted files/documents (DELETE)
2. Detect Changes¶
Changes are detected through:
Event Streams (preferred): - Filesystem: OS-level file system events - S3: SQS notifications from S3 bucket events - Alfresco: ActiveMQ message queue
Periodic Polling (fallback): - List all files and compare with stored state - Detect additions, modifications, and deletions - Configurable interval (default: 5 minutes)
3. Process Changes¶
The Incremental Update Engine: 1. Deduplicates events (same file changed multiple times) 2. Orders by ordinal (timestamp) to prevent race conditions 3. Checks content hash to skip unnecessary reprocessing 4. Routes to appropriate handler (ADD/UPDATE/DELETE)
4. Update Indexes¶
For ADD/UPDATE: - Load document from source - Process with DocumentProcessor (Docling/LlamaParse) - Generate embeddings - Update vector store (Qdrant) - Update search index (Elasticsearch) - Extract entities and relationships for graph (Neo4j) if enabled
For DELETE: - Remove from vector store by document ID - Remove from search index by document ID - Remove nodes/relationships from graph by document ID
5. Track State¶
PostgreSQL stores sync state:
- datasource_config: Data source configurations and sync status
- document_state: Per-document tracking (content hash, ordinal, source ID)
Configuration¶
Environment Variables¶
Set in your .env file:
# PostgreSQL for incremental updates state management
POSTGRES_INCREMENTAL_URL=postgresql://user:pass@localhost:5433/flexible_graphrag_incremental
# Enable incremental updates
ENABLE_INCREMENTAL_UPDATES=true
Data Source Configuration¶
Configure via UI or REST API:
- Source Name: Human-readable identifier
- Source Type: filesystem, s3, alfresco, google_drive, etc.
- Connection Parameters: Source-specific (paths, credentials, buckets, etc.)
- Refresh Interval: Seconds between periodic scans (default: 300)
- Enable Change Stream: Use event-driven detection when available
- Skip Graph: Optionally skip knowledge graph extraction for faster syncs
Management¶
Via Web UI¶
- Navigate to "Processing" or "Data Sources" tab
- Enable sync when adding/editing a data source
- View sync status and last sync time
- Trigger manual sync on-demand
Via REST API¶
See API-REFERENCE.md for complete documentation.
Quick examples:
Enable sync during ingest:
POST /api/ingest
{
"data_source": "filesystem",
"paths": ["/data/documents"],
"enable_sync": true,
"skip_graph": true
}
Trigger manual sync:
Performance¶
Event-Driven vs Polling¶
| Metric | Event-Driven | Polling (5 min) |
|---|---|---|
| Latency | 1-5 seconds | 1-5 minutes |
| API Calls | ~1 per change | Full scan every interval |
| Cost | Very low | Moderate |
| Scalability | Excellent | Good |
Content Hashing¶
When a file's timestamp changes but content is identical: - Without hashing: Full reprocessing (~10-60 seconds per document) - With hashing: Update ordinal only (~10ms per document)
Result: 100-1000x faster for timestamp-only changes
Troubleshooting¶
System Not Detecting Changes¶
- Check backend logs for incremental system startup
- Verify datasource is marked as
activein PostgreSQL - Confirm
enable_change_streamis true for event-driven sources - Check source-specific requirements (SQS for S3, ActiveMQ for Alfresco, etc.)
Changes Detected But Not Applied¶
- Check
document_statetable for records - Verify vector/search indexes are accessible
- Look for errors in backend processing logs
- Ensure correct credentials for data source
Duplicate Processing¶
- Verify
document_staterecords have correctsource_id - Check for multiple detectors monitoring same location
- Ensure ordinal values are monotonically increasing
Performance Issues¶
- Reduce
refresh_intervalif polling too frequently - Enable
skip_graphto skip knowledge graph extraction - Increase batch sizes in engine configuration
- Use event-driven detection instead of polling
Database Schema¶
Two main tables in PostgreSQL:
datasource_config: Stores data source configurations
- config_id (UUID, primary key)
- source_name, source_type
- connection_params (JSON)
- is_active, sync_status
- refresh_interval_seconds
- enable_change_stream, skip_graph
document_state: Tracks processed documents
- doc_id (primary key, format: config_id:filename)
- config_id (foreign key)
- source_path (filename or path)
- content_hash (for change detection)
- ordinal (microsecond timestamp)
- source_id (cloud file ID for DELETE operations)
See schema.sql for complete schema definition.
Security¶
- Store credentials in environment variables, not in datasource configs
- Use read-only access for data sources when possible
- Restrict PostgreSQL access to backend only
- Enable SSL for PostgreSQL connections in production
- Use IAM roles for cloud services (S3, GCS, Azure) instead of static credentials
License¶
See main project LICENSE file.