Incremental Updates - Setup Guide¶

Complete installation and configuration guide for the Incremental Update System.

Overview¶

The Incremental Update System automatically monitors data sources and keeps your vector, search, and graph indexes synchronized. This guide covers complete setup from scratch.

Prerequisites¶

Flexible GraphRAG backend installed and running
PostgreSQL 12+ (can reuse existing postgres-pgvector service)
Docker (recommended) or local PostgreSQL installation
Python 3.12+ with backend dependencies installed

Step 1: PostgreSQL Setup¶

Option A: Docker (Recommended)¶

The Flexible GraphRAG postgres-pgvector service works perfectly for incremental updates.

1. Verify Service Configuration

Check docker/docker-compose.yaml includes:

includes:
  - path: includes/postgres-pgvector.yaml

2. Start Services

cd docker
docker-compose -p flexible-graphrag up -d

3. Access pgAdmin

URL: http://localhost:5050
Email: admin@flexible-graphrag.com
Password: admin

4. Create Database

In pgAdmin: 1. Right-click "Servers" → Create → Server 2. Name: "FlexibleGraphRAG" 3. Connection tab: - Host: postgres-pgvector (or localhost from outside Docker) - Port: 5433 - Username: postgres - Password: password 4. Save 5. Right-click "Databases" → Create → Database 6. Database name: flexible_graphrag_incremental 7. Save

5. Run Schema

Open database flexible_graphrag_incremental
Tools → Query Tool
Open file: incremental_updates/schema.sql
Execute (F5)

Option B: Local PostgreSQL¶

1. Create Database

createdb flexible_graphrag_incremental

2. Run Schema

cd flexible-graphrag/incremental_updates
psql -d flexible_graphrag_incremental -f schema.sql

Step 2: Environment Configuration¶

Add to your .env file in the flexible-graphrag backend directory:

# === Incremental Updates Configuration ===

# Enable incremental updates
ENABLE_INCREMENTAL_UPDATES=true

# PostgreSQL connection for state management
POSTGRES_INCREMENTAL_URL=postgresql://postgres:password@localhost:5433/flexible_graphrag_incremental

# Alternative: Reuse main GraphRAG database (not recommended)
# POSTGRES_URL=postgresql://postgres:password@localhost:5433/flexible_graphrag_main

Connection String Format:

postgresql://[user]:[password]@[host]:[port]/[database]

Common Configurations:

Docker Compose (from inside backend container):

POSTGRES_INCREMENTAL_URL=postgresql://postgres:password@postgres-pgvector:5432/flexible_graphrag_incremental

Docker Compose (from host machine):

POSTGRES_INCREMENTAL_URL=postgresql://postgres:password@localhost:5433/flexible_graphrag_incremental

Local PostgreSQL:

POSTGRES_INCREMENTAL_URL=postgresql://postgres:password@localhost:5432/flexible_graphrag_incremental

Step 3: Start Backend¶

The incremental update system starts automatically when the backend starts (if ENABLE_INCREMENTAL_UPDATES=true).

cd flexible-graphrag
python main.py

Expected Log Output:

INFO: Backend initialized
INFO: ============================================================
INFO:   Flexible GraphRAG Incremental Update System
INFO: ============================================================
INFO: Starting orchestrator...
INFO: No data sources configured for auto-sync
INFO: Auto-sync is ready - add datasources via UI or API when needed
INFO: Monitoring for configuration changes...

Step 4: Configure Data Sources¶

Via Web UI (Recommended)¶

For New Ingest with Sync Enabled:

Open http://localhost:5000
Navigate to Processing tab
Select data source type (Filesystem, S3, Google Drive, etc.)
Fill in source-specific fields
Important: Check ✅ "Enable automatic sync"
Set sync interval (default: 300 seconds)
Optional: Check "Skip Graph" for faster syncs
Click Process or Add Source

Result: - Documents are ingested - datasource_config entry created - document_state records created for each document - Monitoring starts automatically

Via REST API¶

See API-REFERENCE.md for complete API documentation.

Example - Enable sync during ingest:

POST /api/ingest
Content-Type: application/json

{
  "data_source": "filesystem",
  "paths": ["/data/documents"],
  "enable_sync": true,
  "skip_graph": true,
  "sync_config": {
    "source_name": "My Local Documents",
    "refresh_interval_seconds": 300,
    "enable_change_stream": true
  }
}

Step 5: Verify Setup¶

Check Backend Logs¶

After configuring a data source, you should see:

INFO: New datasource detected: My Local Documents
INFO: Creating detector for My Local Documents (type: filesystem)...
INFO: Detector created successfully
INFO: Injected backend, state_manager, config_id, and skip_graph into detector
INFO: Started 1 auto-sync updater(s)

Check Database¶

View datasource configurations:

SELECT 
    config_id,
    source_name,
    source_type,
    is_active,
    sync_status,
    refresh_interval_seconds,
    enable_change_stream,
    last_sync_completed_at
FROM datasource_config;

View tracked documents:

SELECT 
    doc_id,
    source_path,
    ordinal,
    created_at
FROM document_state
ORDER BY created_at DESC
LIMIT 10;

Expected Results: - One row in datasource_config with is_active = true - One row per ingested document in document_state

Test Change Detection¶

For Filesystem: 1. Add a new .txt file to monitored directory 2. Within seconds, check logs for: EVENT: CREATE detected for filename.txt 3. Verify document appears in vector store and search index

For Cloud Sources (S3, Google Drive, etc.): 1. Upload a file to monitored location 2. Wait for sync interval (or check event stream logs) 3. Verify document is processed and indexed

Data Source Specific Setup¶

Filesystem¶

Requirements: - Local directory with read access - watchdog package installed (included in requirements)

Configuration:

{
  "data_source": "filesystem",
  "paths": ["/path/to/documents"],
  "enable_sync": true,
  "sync_config": {
    "source_name": "Local Documents",
    "enable_change_stream": true  // Real-time OS events
  }
}

Features: - ✅ Real-time change detection (watchdog) - ✅ Recursive directory monitoring - ✅ Automatic file type detection

Amazon S3¶

Requirements: - AWS credentials with S3 read access - Bucket name and optional prefix - (Optional) SQS queue for event notifications

Configuration:

{
  "data_source": "s3",
  "s3_config": {
    "bucket_name": "my-bucket",
    "prefix": "documents/",
    "sqs_queue_url": "https://sqs.region.amazonaws.com/account/queue"
  },
  "enable_sync": true,
  "sync_config": {
    "source_name": "S3 Documents",
    "enable_change_stream": true  // Requires SQS
  }
}

Setup SQS Event Notifications: See S3-SETUP.md for detailed AWS configuration.

Google Drive¶

Requirements: - Google Cloud project with Drive API enabled - Service account with Drive access - Folder ID to monitor

Configuration:

{
  "data_source": "google_drive",
  "google_drive_config": {
    "credentials": "{...service account JSON...}",
    "folder_id": "1ABC...XYZ"
  },
  "enable_sync": true,
  "sync_config": {
    "source_name": "Google Drive Docs",
    "refresh_interval_seconds": 60  // Polling only (no event stream)
  }
}

Alfresco¶

Requirements: - Alfresco server with ActiveMQ enabled - Alfresco credentials - Site/folder path to monitor

Configuration:

{
  "data_source": "alfresco",
  "alfresco_config": {
    "base_url": "http://alfresco:8080",
    "username": "admin",
    "password": "admin",
    "site_id": "my-site",
    "folder_path": "/documentLibrary/folder"
  },
  "enable_sync": true,
  "sync_config": {
    "source_name": "Alfresco Documents",
    "enable_change_stream": true  // Requires ActiveMQ
  }
}

Configuration Options¶

Refresh Interval¶

How often to check for changes (polling-based detection):

Default: 300 seconds (5 minutes)
Minimum: 10 seconds (careful with API rate limits)
Maximum: 3600 seconds (1 hour)

UPDATE datasource_config 
SET refresh_interval_seconds = 60
WHERE source_name = 'My Documents';

Enable Change Stream¶

Use event-driven detection instead of polling:

Filesystem: Real-time OS events (watchdog) - instant detection
S3: SQS notifications - 1-5 second latency
Alfresco: ActiveMQ events - near real-time

UPDATE datasource_config 
SET enable_change_stream = true
WHERE source_name = 'My Documents';

Restart backend after changing this setting.

Skip Graph¶

Skip knowledge graph extraction for faster syncs:

UPDATE datasource_config 
SET skip_graph = true
WHERE source_name = 'My Documents';

Performance Impact: - With graph: ~10-60 seconds per document - Without graph: ~2-10 seconds per document

Monitoring and Maintenance¶

Check Sync Status¶

SELECT 
    source_name,
    sync_status,
    last_sync_completed_at,
    last_sync_error,
    last_sync_ordinal
FROM datasource_config
WHERE is_active = true;

Sync Status Values: - idle: Waiting for next interval or event - syncing: Currently processing changes - error: Last sync failed (check last_sync_error)

Trigger Manual Sync¶

Force an immediate sync via REST API:

POST /api/incremental/sync/{config_id}

Returns immediately, sync runs in background.

View Diagnostic Information¶

Use the queries in diagnostic_queries.sql:

-- Recent document changes
SELECT 
    doc_id,
    source_path,
    ordinal,
    content_hash,
    created_at,
    updated_at
FROM document_state
ORDER BY updated_at DESC
LIMIT 20;

-- Documents by data source
SELECT 
    dc.source_name,
    COUNT(ds.doc_id) as document_count,
    MAX(ds.updated_at) as last_update
FROM datasource_config dc
LEFT JOIN document_state ds ON dc.config_id = ds.config_id
GROUP BY dc.source_name;

Troubleshooting¶

Backend Not Starting¶

Error: asyncpg.exceptions.InvalidCatalogNameError: database "flexible_graphrag_incremental" does not exist

Solution: Create the database first (Step 1)

No Changes Detected¶

Symptoms: Files added but not processed

Check: 1. Is datasource active? SELECT is_active FROM datasource_config; 2. Backend logs show "Started N auto-sync updater(s)"? 3. For event streams: Is enable_change_stream = true? 4. For S3: Is SQS queue configured and accessible?

Solutions: - Ensure is_active = true in database - Restart backend to pick up configuration changes - Check data source credentials and permissions - Verify event stream setup (SQS, ActiveMQ, etc.)

Duplicate Processing¶

Symptoms: Same document indexed multiple times

Check: 1. source_id populated in document_state? 2. Multiple datasources monitoring same location? 3. Detector initialization logs show "Populated known_file_ids"?

Solutions: - Re-ingest with enable_sync=true to populate source_id - Remove duplicate datasource configurations - Restart backend to reset detector state

Performance Issues¶

Symptoms: Slow change detection or processing

Optimization: 1. Enable event streams instead of polling 2. Increase refresh_interval_seconds if polling 3. Enable skip_graph = true to skip graph extraction 4. Check vector/search/graph services are responding quickly

Security Best Practices¶

Credentials
Store in environment variables, not in datasource configs
Use IAM roles for cloud services (S3, GCS, Azure)
Never commit credentials to version control
Database Access
Restrict PostgreSQL access to backend only
Use strong passwords
Enable SSL for production: ?sslmode=require
Data Source Permissions
Use read-only access when possible
Limit scope to specific folders/buckets
Regular credential rotation
Network Security
Run PostgreSQL on private network
Use VPN or SSH tunnel for remote access
Firewall rules to restrict database access

Backup and Recovery¶

Backup PostgreSQL Database¶

pg_dump -h localhost -p 5433 -U postgres flexible_graphrag_incremental > backup.sql

Restore from Backup¶

psql -h localhost -p 5433 -U postgres flexible_graphrag_incremental < backup.sql

Reset and Re-sync¶

To start fresh:

Stop backend
Clear tables: TRUNCATE document_state, datasource_config CASCADE;
Re-ingest with enable_sync=true
Start backend

Advanced Configuration¶

Custom PostgreSQL Connection Pooling¶

Modify in backend code if needed:

# state_manager.py
self.pool = await asyncpg.create_pool(
    postgres_url,
    min_size=5,        # Minimum connections
    max_size=20,       # Maximum connections
    timeout=30,        # Connection timeout
    command_timeout=60 # Query timeout
)

Adjust Detector Behavior¶

Edit detector configuration in detectors/ folder: - Change polling intervals - Modify batch sizes - Adjust error retry logic - Custom file filtering

Next Steps¶

Quick Start - Test basic functionality
API Reference - Programmatic control
S3 Setup - Configure S3 event notifications
Diagnostic Queries: Explore diagnostic_queries.sql and datasource-queries.sql