Skip to content

Quick Start Guide

Get incremental updates running in 5 minutes!

Prerequisites

  • Flexible GraphRAG backend running
  • PostgreSQL database available
  • At least one data source configured (filesystem, S3, Google Drive, etc.)

Step 1: Database Setup (2 minutes)

Option A: Use Existing PostgreSQL

If you have the postgres-pgvector service running (from main Flexible GraphRAG setup):

  1. Create a new database in the same PostgreSQL instance:
  2. Database name: flexible_graphrag_incremental
  3. Port: Same as your PostgreSQL (usually 5433)

  4. Run the schema:

  5. Use the schema.sql file in the incremental_updates folder
  6. Execute it against the new database

The postgres-pgvector.yaml service provides everything: - PostgreSQL on port 5433 - pgAdmin at http://localhost:5050

  1. Ensure it's uncommented in docker/docker-compose.yaml
  2. Start services: docker-compose up -d
  3. Access pgAdmin and create database flexible_graphrag_incremental
  4. Run schema.sql via pgAdmin query tool

Step 2: Configure Environment (1 minute)

Add to your .env file:

# PostgreSQL connection for incremental updates
POSTGRES_INCREMENTAL_URL=postgresql://postgres:password@localhost:5433/flexible_graphrag_incremental

# Enable incremental updates
ENABLE_INCREMENTAL_UPDATES=true

Note: Replace postgres:password with your actual credentials.

Step 3: Enable Sync via UI (2 minutes)

For New Data Source

  1. Open the web UI (http://localhost:5000)
  2. Go to Processing tab → Click data source type (e.g., "Filesystem", "S3", "Google Drive")
  3. Fill in source details:
  4. Name: A descriptive name
  5. Source-specific fields (path, bucket, folder ID, etc.)
  6. Enable Sync section:
  7. ✅ Enable automatic sync
  8. Choose sync interval (default: 300 seconds)
  9. Optional: Enable "Skip Graph" for faster syncs
  10. Click Process or Add Source

For Existing Data Source

If you already have documents ingested and want to enable sync:

  1. Re-process through the UI with "Enable Sync" checkbox selected
  2. This will:
  3. Ingest documents (if not already present)
  4. Create datasource_config entry
  5. Create document_state records
  6. Start monitoring for changes

Step 4: Verify It's Working (1 minute)

Check Backend Logs

You should see:

INFO: Incremental Update System starting...
INFO: Found 1 active data source(s) configured for auto-sync
INFO: Started 1 auto-sync updater(s)
INFO: Monitoring for configuration changes...

Test Change Detection

For Filesystem: 1. Add a new .txt file to your monitored folder 2. Watch logs - should see: EVENT: CREATE detected for filename.txt 3. Check UI - document count should increase

For S3: 1. Upload a file to your monitored S3 bucket 2. If SQS configured: Change detected in 1-5 seconds 3. If polling only: Change detected within sync interval (5 minutes default)

For Google Drive: 1. Upload a file to your monitored folder 2. Within 60 seconds (default polling), should see: EVENT: CREATE detected for filename.txt

Verify in Database

Using pgAdmin or psql:

-- Check datasource configuration
SELECT source_name, source_type, sync_status, last_sync_completed_at 
FROM datasource_config;

-- Check tracked documents
SELECT doc_id, source_path, ordinal 
FROM document_state 
ORDER BY ordinal DESC 
LIMIT 10;

What Happens Next?

The system is now monitoring your data source and will automatically:

  1. Detect Changes
  2. CREATE: New files added
  3. UPDATE: Existing files modified
  4. DELETE: Files removed

  5. Process Changes

  6. Load document content
  7. Process with DocumentProcessor (Docling/LlamaParse)
  8. Generate embeddings
  9. Update vector store (Qdrant)
  10. Update search index (Elasticsearch)
  11. Update knowledge graph (Neo4j) if enabled

  12. Track State

  13. Store document metadata in document_state
  14. Update sync status in datasource_config
  15. Skip reprocessing if only timestamp changed (content hash optimization)

Common Issues

No Changes Detected

Problem: Added file but nothing happens

Solutions: - Check is_active = true in datasource_config table - Verify backend logs show "Started N auto-sync updater(s)" - For event-driven sources (S3, Filesystem), check event stream is enabled - Try manual sync: POST /api/incremental/sync/{config_id}

Documents Not Showing Up

Problem: Change detected but document not searchable

Solutions: - Check backend logs for processing errors - Verify vector store (Qdrant) and search index (Elasticsearch) are running - Ensure data source credentials are correct - Check document_state table - should have entry for the file

Duplicate Processing

Problem: Same document processed multiple times

Solutions: - Check only one datasource config exists for the location - Verify source_id is populated in document_state - Ensure detector initialization completed (check "Populated known_file_ids" in logs)

Next Steps

  • Setup Guide - Detailed configuration options
  • API Reference - REST API for programmatic control
  • S3 Setup - Configure S3 with SQS event notifications for real-time updates
  • Diagnostic Queries: Use diagnostic_queries.sql to inspect system state
  • Manual Sync: Trigger on-demand sync via API or will be available in UI

Advanced Configuration

Adjust Sync Interval

Lower interval = faster change detection (but more frequent scans):

UPDATE datasource_config 
SET refresh_interval_seconds = 60  -- Check every 60 seconds
WHERE config_id = 'your-config-id';

Skip Knowledge Graph

For faster syncs, skip graph extraction:

UPDATE datasource_config 
SET skip_graph = true
WHERE config_id = 'your-config-id';

Then restart the backend.

Enable Event Stream

For sources that support it (S3 with SQS, Filesystem, Alfresco):

UPDATE datasource_config 
SET enable_change_stream = true
WHERE config_id = 'your-config-id';

This enables real-time change detection instead of periodic polling.

Monitoring

Check Sync Status

SELECT 
    source_name,
    sync_status,  -- 'idle', 'syncing', 'error'
    last_sync_completed_at,
    last_sync_error
FROM datasource_config;

View Recent Changes

SELECT 
    doc_id,
    source_path,
    ordinal,
    content_hash,
    created_at
FROM document_state
ORDER BY ordinal DESC
LIMIT 20;

Count Documents by Source

SELECT 
    dc.source_name,
    COUNT(*) as doc_count
FROM document_state ds
JOIN datasource_config dc ON ds.config_id = dc.config_id
GROUP BY dc.source_name;

Support

For issues or questions: 1. Check backend logs for error messages 2. Review diagnostic_queries.sql for troubleshooting queries 3. See Setup Guide for detailed configuration 4. Check GitHub issues for known problems and solutions