Flexible GraphRAG - Incremental Update System¶

Automatically synchronizes vector, search, and graph indexes when documents change in monitored data sources.

Overview¶

The Incremental Update System provides real-time and periodic synchronization of your knowledge base. When documents are added, modified, or deleted in your data sources, the system automatically updates all indexes without manual intervention.

Key Benefits: - Automatic Sync: Changes propagate to indexes automatically - Real-time Updates: Event-driven detection (when supported) for sub-second latency - Cost Efficient: Skip unnecessary reprocessing with content hashing - Robust: Handles failures gracefully with retry logic and partial sync recovery - Scalable: Supports multiple data sources simultaneously

Features¶

Intelligent Change Detection¶

Ordinal-Based Versioning: Monotonic timestamp tracking (microsecond precision) ensures correct ordering
Content Hash Optimization: Skip reprocessing when only timestamps change
Per-Document Tracking: Prevents concurrent processing conflicts
Multi-Target Coordination: Independent sync state for Vector, Search, and Graph indexes

Dual Update Mechanism¶

Event-Driven Streams (when supported) - Real-time change notifications for instant updates
Periodic Polling (configurable interval) - Comprehensive fallback that catches any missed changes

Supported Data Sources¶

Source	Change Detection	Status
Filesystem	Real-time OS events (watchdog)	✅ Production
Amazon S3	SQS event notifications	✅ Production
Alfresco	ActiveMQ events + Polling	✅ Production
Google Drive	Changes API (polling)	✅ Ready
Azure Blob Storage	Change feed iterator	✅ Ready
Google Cloud Storage	Pub/Sub notifications	✅ Ready
Box	Events API (polling)	✅ Ready
Microsoft Graph (SharePoint/OneDrive)	Delta query (polling)	✅ Ready

Architecture¶

+-----------------+
|  Data Sources   |  Filesystem, S3, Alfresco, Google Drive, etc.
+--------+--------+
         | Changes detected
         v
+-----------------+
|  Change         |  Event streams + Periodic polling
|  Detectors      |
+--------+--------+
         | Events: CREATE, UPDATE, DELETE
         v
+-----------------+
|  Incremental    |  Deduplication, batching, ordering
|  Update Engine  |
+--------+--------+
         | Process changes
         v
+-----------------+
|  Backend        |  Document processing (Docling/LlamaParse)
|  Pipeline       |  Chunking, embedding generation
+--------+--------+
         | Update indexes
         v
+-----------------+
|  Vector Search  |  Qdrant (vectors) + Elasticsearch (text)
|  Graph Indexes  |  Neo4j (knowledge graph)
+-----------------+

Quick Links¶

Quick Start Guide - Get running in 5 minutes
Setup Guide - Complete installation and configuration
API Reference - REST API documentation
S3 Setup Guide - Configure S3 with event notifications

How It Works¶

1. Monitor Data Sources¶

Each configured data source is monitored by a Change Detector that tracks: - New files/documents (CREATE) - Modified files/documents (UPDATE) - Deleted files/documents (DELETE)

2. Detect Changes¶

Changes are detected through:

Event Streams (preferred): - Filesystem: OS-level file system events - S3: SQS notifications from S3 bucket events - Alfresco: ActiveMQ message queue

Periodic Polling (fallback): - List all files and compare with stored state - Detect additions, modifications, and deletions - Configurable interval (default: 5 minutes)

3. Process Changes¶

The Incremental Update Engine: 1. Deduplicates events (same file changed multiple times) 2. Orders by ordinal (timestamp) to prevent race conditions 3. Checks content hash to skip unnecessary reprocessing 4. Routes to appropriate handler (ADD/UPDATE/DELETE)

4. Update Indexes¶

For ADD/UPDATE: - Load document from source - Process with DocumentProcessor (Docling/LlamaParse) - Generate embeddings - Update vector store (Qdrant) - Update search index (Elasticsearch) - Extract entities and relationships for graph (Neo4j) if enabled

For DELETE: - Remove from vector store by document ID - Remove from search index by document ID - Remove nodes/relationships from graph by document ID

5. Track State¶

PostgreSQL stores sync state: - datasource_config: Data source configurations and sync status - document_state: Per-document tracking (content hash, ordinal, source ID)

Configuration¶

Environment Variables¶

Set in your .env file:

# PostgreSQL for incremental updates state management
POSTGRES_INCREMENTAL_URL=postgresql://user:pass@localhost:5433/flexible_graphrag_incremental

# Enable incremental updates
ENABLE_INCREMENTAL_UPDATES=true

Data Source Configuration¶

Configure via UI or REST API:

Source Name: Human-readable identifier
Source Type: filesystem, s3, alfresco, google_drive, etc.
Connection Parameters: Source-specific (paths, credentials, buckets, etc.)
Refresh Interval: Seconds between periodic scans (default: 300)
Enable Change Stream: Use event-driven detection when available
Skip Graph: Optionally skip knowledge graph extraction for faster syncs

Management¶

Via Web UI¶

Navigate to "Processing" or "Data Sources" tab
Enable sync when adding/editing a data source
View sync status and last sync time
Trigger manual sync on-demand

Via REST API¶

See API-REFERENCE.md for complete documentation.

Quick examples:

Enable sync during ingest:

POST /api/ingest
{
  "data_source": "filesystem",
  "paths": ["/data/documents"],
  "enable_sync": true,
  "skip_graph": true
}

Trigger manual sync:

POST /api/incremental/sync/{config_id}

Performance¶

Event-Driven vs Polling¶

Metric	Event-Driven	Polling (5 min)
Latency	1-5 seconds	1-5 minutes
API Calls	~1 per change	Full scan every interval
Cost	Very low	Moderate
Scalability	Excellent	Good

Content Hashing¶

When a file's timestamp changes but content is identical: - Without hashing: Full reprocessing (~10-60 seconds per document) - With hashing: Update ordinal only (~10ms per document)

Result: 100-1000x faster for timestamp-only changes

Troubleshooting¶

System Not Detecting Changes¶

Check backend logs for incremental system startup
Verify datasource is marked as active in PostgreSQL
Confirm enable_change_stream is true for event-driven sources
Check source-specific requirements (SQS for S3, ActiveMQ for Alfresco, etc.)

Changes Detected But Not Applied¶

Check document_state table for records
Verify vector/search indexes are accessible
Look for errors in backend processing logs
Ensure correct credentials for data source

Duplicate Processing¶

Verify document_state records have correct source_id
Check for multiple detectors monitoring same location
Ensure ordinal values are monotonically increasing

Performance Issues¶

Reduce refresh_interval if polling too frequently
Enable skip_graph to skip knowledge graph extraction
Increase batch sizes in engine configuration
Use event-driven detection instead of polling

Database Schema¶

Two main tables in PostgreSQL:

datasource_config: Stores data source configurations - config_id (UUID, primary key) - source_name, source_type - connection_params (JSON) - is_active, sync_status - refresh_interval_seconds - enable_change_stream, skip_graph

document_state: Tracks processed documents - doc_id (primary key, format: config_id:filename) - config_id (foreign key) - source_path (filename or path) - content_hash (for change detection) - ordinal (microsecond timestamp) - source_id (cloud file ID for DELETE operations)

See schema.sql for complete schema definition.

Security¶

Store credentials in environment variables, not in datasource configs
Use read-only access for data sources when possible
Restrict PostgreSQL access to backend only
Enable SSL for PostgreSQL connections in production
Use IAM roles for cloud services (S3, GCS, Azure) instead of static credentials

License¶

See main project LICENSE file.