Knowledge Graph Extractors Guide¶

This document explains the different knowledge graph extractors available in Flexible GraphRAG and how to configure them.

📊 Overview¶

Flexible GraphRAG supports three different extraction methods, each optimized for different use cases:

SimpleLLMPathExtractor - Fast, flexible relationship extraction
SchemaLLMPathExtractor - Schema-guided structured extraction (with internal or custom schema)
DynamicLLMPathExtractor - Adaptive schema with flexible type discovery

⚙️ Extractor Configuration¶

Set the extractor type using the KG_EXTRACTOR_TYPE environment variable:

# Schema-based extraction (default, recommended)
KG_EXTRACTOR_TYPE=schema

# Simple path extraction (fastest)
KG_EXTRACTOR_TYPE=simple

# Dynamic extraction (most flexible)
KG_EXTRACTOR_TYPE=dynamic

🔧 Extractor Types¶

1. SimpleLLMPathExtractor¶

Configuration:

KG_EXTRACTOR_TYPE=simple

Characteristics: - Fastest extraction method - No schema constraints or validation - Discovers entities and relationships naturally from text - Best for: Quick exploration, unstructured content analysis

Use Cases: - Rapid prototyping and testing - Exploratory data analysis - When you don't know the domain structure yet - Processing diverse, unstructured content

Performance: - ⚡ Fastest processing time - 🔓 No schema overhead - 🎯 Good for general-purpose extraction

2. SchemaLLMPathExtractor¶

Configuration:

KG_EXTRACTOR_TYPE=schema
SCHEMA_NAME=default     # Use internal schema (LlamaIndex builtin)
# OR
SCHEMA_NAME=sample      # Use project's SAMPLE_SCHEMA
# OR  
SCHEMA_NAME=custom_name # Use your custom schema

Characteristics: - Structured extraction with validation - Can use internal schema, default schema, or custom schema - Produces consistent, well-labeled entities and relationships - Best for: Production systems, domain-specific extraction - Default extractor type

Internal Schema (Recommended for Most Use Cases)¶

When you set SCHEMA_NAME=default, the extractor uses LlamaIndex's built-in internal schema:

10 Entity Types: - PRODUCT - Products, services, offerings - MARKET - Markets, industries, sectors - TECHNOLOGY - Technologies, tools, frameworks - EVENT - Events, incidents, milestones - CONCEPT - Concepts, ideas, theories - ORGANIZATION - Companies, institutions, groups - PERSON - People, individuals, names - LOCATION - Places, addresses, geographic entities - TIME - Dates, times, periods - MISCELLANEOUS - Anything that doesn't fit above

10 Relationship Types: - USED_BY - Entity is used by another - USED_FOR - Entity is used for a purpose - LOCATED_IN - Entity is located in a place - PART_OF - Entity is part of another - WORKED_ON - Person/org worked on something - HAS - Entity has/owns another - IS_A - Entity is a type of another - BORN_IN - Person born in location/time - DIED_IN - Person died in location/time - HAS_ALIAS - Entity has alternative name

27 Validation Rules (Triplet Patterns):

(PRODUCT, USED_BY, PRODUCT)
(PRODUCT, USED_FOR, MARKET)
(PRODUCT, HAS, TECHNOLOGY)
(MARKET, LOCATED_IN, LOCATION)
(MARKET, HAS, TECHNOLOGY)
(TECHNOLOGY, USED_BY, PRODUCT)
(TECHNOLOGY, USED_FOR, MARKET)
(TECHNOLOGY, LOCATED_IN, LOCATION)
(TECHNOLOGY, PART_OF, ORGANIZATION)
(TECHNOLOGY, IS_A, PRODUCT)
(EVENT, LOCATED_IN, LOCATION)
(EVENT, PART_OF, ORGANIZATION)
(CONCEPT, USED_BY, TECHNOLOGY)
(CONCEPT, USED_FOR, PRODUCT)
(ORGANIZATION, LOCATED_IN, LOCATION)
(ORGANIZATION, PART_OF, ORGANIZATION)
(ORGANIZATION, PART_OF, MARKET)
(PERSON, BORN_IN, LOCATION)
(PERSON, BORN_IN, TIME)
(PERSON, DIED_IN, LOCATION)
(PERSON, DIED_IN, TIME)
(PERSON, WORKED_ON, EVENT)
(PERSON, WORKED_ON, PRODUCT)
(PERSON, WORKED_ON, CONCEPT)
(PERSON, WORKED_ON, TECHNOLOGY)
(LOCATION, LOCATED_IN, LOCATION)
(LOCATION, PART_OF, LOCATION)

Example Results with Internal Schema:

Entities: 37 total
- ORGANIZATION (7)
- TECHNOLOGY (10)
- CONCEPT (2)
- EVENT (2)
- PERSON (1)
- PRODUCT (5)
- MARKET (3)
- Others (7)

Relationships: 65 total
- MENTIONS (35)
- HAS (6)
- USED_FOR (7)
- PART_OF (1)
- WORKED_ON (3)
- IS_A (1)
- Others (12)

Use Cases: - General business/technology documents - When you don't want to define a custom schema - Flexible extraction with good type labeling - Recommended starting point for most projects

3. DynamicLLMPathExtractor¶

Configuration:

KG_EXTRACTOR_TYPE=dynamic
SCHEMA_NAME=sample   # Optional: provide initial guidance
# OR
SCHEMA_NAME=default  # No initial guidance (uses internal schema)

Characteristics: - Adaptive schema that evolves with content - Discovers new entity and relationship types dynamically - Can start with schema guidance or completely free-form - Best for: Evolving domains, multi-domain content - Note: May create only text chunks with Ollama LLM

Use Cases: - Content spanning multiple domains - When schema needs to evolve over time - Research and discovery projects - Cross-domain knowledge extraction

Performance: - 🧠 Most intelligent extraction - 🔄 Adaptive to content - ⏱️ Moderate processing time - ⚠️ May be inconsistent with Ollama

🎯 Choosing the Right Extractor¶

Quick Decision Guide¶

1. Do you want the fastest extraction? - ✅ YES → Use KG_EXTRACTOR_TYPE=simple - Fastest processing - More extensive extraction (discovers everything naturally) - Entity type labels are basic (not well-categorized) - Good for: Testing, exploration, rapid prototyping

2. Is the built-in schema good enough for your content? - ✅ YES → Use KG_EXTRACTOR_TYPE=schema + SCHEMA_NAME=default ✅ RECOMMENDED - Uses LlamaIndex's internal schema (10 entity types, 10 relationship types) - Excellent type labeling for business/technology content - No configuration needed - Good for: Most projects, general business/tech documents

3. Is the project's sample schema good enough? - ✅ YES → Use KG_EXTRACTOR_TYPE=schema + SCHEMA_NAME=sample - Uses project's SAMPLE_SCHEMA (6 entity types, 10 relationship types) - Good type labeling for general content - No configuration needed - Good for: General-purpose projects

4. Do you need a custom schema for your specific domain? - ✅ YES → Use KG_EXTRACTOR_TYPE=schema + SCHEMA_NAME=your_custom_name - Define your schema in the config file (.env → SCHEMAS=[...]) - Tailored entity and relationship types for your domain - Use strict: false for flexible extraction (recommended) - Use strict: true only for compliance/legal requirements - Good for: Domain-specific projects, production systems

5. Do you need adaptive, multi-domain extraction? - ✅ YES → Use KG_EXTRACTOR_TYPE=dynamic - Discovers and adapts entity types as it processes content - Most flexible, but may be inconsistent - Good for: Research, cross-domain content

🚀 Recommended Configurations¶

For Most Projects (Recommended)¶

KG_EXTRACTOR_TYPE=schema
SCHEMA_NAME=default
MAX_TRIPLETS_PER_CHUNK=20

✅ Uses internal schema with excellent type coverage ✅ No schema definition needed ✅ Great balance of speed and quality

For Domain-Specific Projects¶

Flexible Domain Extraction (Recommended)

KG_EXTRACTOR_TYPE=schema
SCHEMA_NAME=your_custom_schema
MAX_TRIPLETS_PER_CHUNK=20
SCHEMAS=[{
  "name": "your_custom_schema",
  "schema": {
    "entities": [...],
    "relations": [...],
    "strict": false  # ← Allows discovery beyond schema
  }
}]

✅ Tailored to your specific domain ✅ Can discover additional entity types ✅ Best for evolving domains

Strict Domain Extraction (Compliance/Legal)

KG_EXTRACTOR_TYPE=schema
SCHEMA_NAME=your_custom_schema
MAX_TRIPLETS_PER_CHUNK=20
SCHEMAS=[{
  "name": "your_custom_schema",
  "schema": {
    "entities": [...],
    "relations": [...],
    "strict": true  # ← Hard constraint, no discovery
  }
}]

✅ Only extracts defined types ✅ Highly predictable and consistent ✅ Best for compliance, legal, regulatory

For Quick Testing¶

KG_EXTRACTOR_TYPE=simple
MAX_PATHS_PER_CHUNK=20

✅ Fastest processing ✅ Good for prototyping ✅ No configuration needed

For Research/Exploration¶

KG_EXTRACTOR_TYPE=dynamic
SCHEMA_NAME=default
MAX_TRIPLETS_PER_CHUNK=30

✅ Discovers new entity types ✅ Adapts to content ✅ Good for unknown domains

⚠️ Provider-Specific Behavior¶

Bedrock, Groq, and Fireworks¶

These providers have tool-calling limitations with SchemaLLMPathExtractor. The system automatically switches to DynamicLLMPathExtractor when you configure KG_EXTRACTOR_TYPE=schema:

# Your config:
KG_EXTRACTOR_TYPE=schema    # ← Automatically changed to 'dynamic' for Bedrock/Groq/Fireworks
SCHEMA_NAME=sample          # ← Used as initial ontology guidance for DynamicLLMPathExtractor

# Actual behavior:
# Automatically uses DynamicLLMPathExtractor with schema guidance

Log Output:

WARNING - Provider bedrock has SchemaLLMPathExtractor LlamaIndex issue
WARNING - Switching to DynamicLLMPathExtractor for reliable extraction
INFO - Using DynamicLLMPathExtractor for flexible relationship discovery
INFO - Providing initial ontology guidance to DynamicLLMPathExtractor

Affected Providers: - Amazon Bedrock (all models) - Groq (all models) - Fireworks AI (all models)

Behavior: - If you configure KG_EXTRACTOR_TYPE=schema, it automatically switches to dynamic - If you configure KG_EXTRACTOR_TYPE=simple, it uses simple (no change) - If you configure KG_EXTRACTOR_TYPE=dynamic, it uses dynamic (no change) - Your schema configuration (SCHEMA_NAME=sample or custom) is used to provide initial ontology guidance to the dynamic extractor - This ensures reliable extraction while still providing schema-guided structure

Why: These providers have known issues with LlamaIndex's tool-calling integration, causing extraction failures with schema-based extractors.

Recommendation: If you need structured extraction with these providers, consider: 1. Using OpenAI, Azure OpenAI, Google Gemini, Vertex AI, Anthropic Claude, or Ollama instead 2. Post-processing the SimpleLLMPathExtractor results 3. Switching to a supported provider for graph extraction

🔧 Advanced Configuration¶

Extraction Limits¶

Control how much the extractor processes per text chunk:

# For schema and dynamic extractors
MAX_TRIPLETS_PER_CHUNK=20    # Default: 20, Range: 1-100

# For simple extractor  
MAX_PATHS_PER_CHUNK=20       # Default: 20, Range: 1-100

Guidelines: - Low (5-10): Very fast, may miss entities in dense content - Medium (20-30): Balanced, works for most content ✅ Recommended - High (50-100): Comprehensive, slower, for complex documents

Combining Extractor with Schema¶

The relationship between KG_EXTRACTOR_TYPE and SCHEMA_NAME:

KG_EXTRACTOR_TYPE	SCHEMA_NAME	Result
`simple`	(any)	SimpleLLMPathExtractor (schema ignored)
`schema`	`default`	SchemaLLMPathExtractor with internal schema ✅
`schema`	`sample`	SchemaLLMPathExtractor with project's SAMPLE_SCHEMA
`schema`	`custom`	SchemaLLMPathExtractor with your custom schema
`dynamic`	`default`	DynamicLLMPathExtractor (no initial guidance)
`dynamic`	`sample`	DynamicLLMPathExtractor (with SAMPLE_SCHEMA guidance)

Strict Mode Configuration¶

When using SchemaLLMPathExtractor with a custom schema, the strict parameter controls how rigidly the schema is enforced:

strict: false (Recommended)

SCHEMAS=[{
  "name": "my_schema",
  "schema": {
    "entities": ["PERSON", "ORGANIZATION", "TECHNOLOGY"],
    "relations": ["WORKS_FOR", "USES"],
    "strict": false  # ← Allows additional types
  }
}]

Behavior: - ✅ Extracts entities/relationships defined in schema - ✅ Also extracts entities/relationships not in schema - ✅ Schema provides guidance, not constraints - ✅ LLM can discover new entity/relationship types - ✅ More flexible and comprehensive extraction - ⚠️ May include unexpected entity types

When to use: - General-purpose extraction - When you're not sure what all entity types might appear - When documents may contain diverse content - Recommended for most use cases ✅

Example Results:

Schema defines: PERSON, ORGANIZATION, TECHNOLOGY
Extracted: PERSON (5), ORGANIZATION (7), TECHNOLOGY (3), 
          LOCATION (2), EVENT (1), CONCEPT (4)  ← Additional types discovered!

strict: true (Restrictive)

SCHEMAS=[{
  "name": "my_schema",
  "schema": {
    "entities": ["PERSON", "ORGANIZATION", "TECHNOLOGY"],
    "relations": ["WORKS_FOR", "USES"],
    "strict": true  # ← Only allows defined types
  }
}]

Behavior: - ✅ Extracts only entities/relationships defined in schema - ❌ Ignores any entity/relationship not in schema - ✅ Schema provides hard constraints - ✅ LLM cannot discover new types - ✅ Highly consistent, predictable output - ⚠️ May miss important entities that don't fit schema

When to use: - Highly controlled, domain-specific extraction - When schema is comprehensive and well-defined - When consistency is more important than completeness - Production systems with strict requirements

Example Results:

Schema defines: PERSON, ORGANIZATION, TECHNOLOGY
Extracted: PERSON (5), ORGANIZATION (7), TECHNOLOGY (3)
          ← Locations, events, concepts ignored even if present in text

Strict Mode Comparison:

Aspect	strict: false	strict: true
Schema role	Guidance	Hard constraint
Entity types	Schema + discovered	Schema only
Relationship types	Schema + discovered	Schema only
Flexibility	⭐⭐⭐⭐ High	⭐⭐ Low
Completeness	⭐⭐⭐⭐ High	⭐⭐⭐ Variable
Consistency	⭐⭐⭐ Good	⭐⭐⭐⭐ Excellent
Risk of missing data	⭐ Low	⭐⭐⭐ High
Best for	General use, exploration	Strict domains, compliance

Internal Schema Note:

When using the internal schema (SCHEMA_NAME=default or not set), the strict parameter defaults to false because our implementation doesn't pass a strict parameter to LlamaIndex. This allows the internal schema to be flexible and comprehensive, balancing structure with discovery. This is a design choice - the internal schema works best with strict=false to provide both guidance and flexibility.

Recommendation:

Start with strict: false for your custom schemas. Only use strict: true if: 1. You have a complete, well-tested schema for your domain 2. Consistency is critical (legal, compliance, regulatory documents) 3. You want to exclude entities outside your specific domain 4. You're willing to sacrifice completeness for predictability

📈 Performance Comparison¶

Based on cmispress.txt (2480 chars) extraction:

Extractor	Processing Time	Entities	Relationships	Quality
Simple (Groq)	~3.5s	34	17	⭐⭐⭐ Good
Schema Internal (OpenAI)	~26s	37	65	⭐⭐⭐⭐ Excellent
Schema Default (OpenAI)	~26s	38	19	⭐⭐⭐⭐ Excellent

Notes: - Simple extractor is 7x faster but extracts fewer relationships - Schema extractors provide richer, more structured graphs - Internal schema discovered more relationship types (65 vs 19)

Schema Configuration: docs/CONFIGURATION/SCHEMA-EXAMPLES.md - Custom schema examples
LLM Testing: docs/LLM/LLM-TESTING-RESULTS.md - Provider compatibility
Environment setup: docs/GETTING-STARTED/ENVIRONMENT-CONFIGURATION.md - Complete configuration
Performance Tuning: docs/ADVANCED/PERFORMANCE.md - Optimization tips

💡 Best Practices¶

Start with internal schema (SCHEMA_NAME=default + KG_EXTRACTOR_TYPE=schema)
Use strict: false by default - Only use strict: true for compliance/legal requirements
Test on small samples before processing large datasets
Monitor extraction quality by checking entity/relationship counts
Use simple extractor for rapid prototyping and testing
Create custom schemas only when internal schema doesn't fit your domain
Adjust extraction limits based on document complexity
Consider provider limitations when choosing extractors
Compare strict vs non-strict results before committing to strict mode

🐛 Troubleshooting¶

Issue: No entities extracted¶

Solution: Check logs for extractor type and schema being used. Verify LLM provider compatibility.

Issue: Only chunk nodes, no entities¶

Solution: If using Bedrock/Groq/Fireworks, the system automatically switches from schema to dynamic extractor. Check logs for "Switching to DynamicLLMPathExtractor" message. If still seeing issues, manually set KG_EXTRACTOR_TYPE=simple.

Issue: Wrong extractor being used¶

Solution: Verify KG_EXTRACTOR_TYPE in .env file. Check logs for "create_extractor called with: extractor_type='...'" message.

Issue: Poor entity type labels¶

Solution: Switch from simple to schema extractor with internal schema (SCHEMA_NAME=default + KG_EXTRACTOR_TYPE=schema).

Issue: Too few relationships extracted¶

Solution: Increase MAX_TRIPLETS_PER_CHUNK or MAX_PATHS_PER_CHUNK to 50-100.

Issue: Missing important entities¶

Solution: - If using custom schema with strict: true, switch to strict: false - Check if entities are outside your schema definition - Consider using internal schema (SCHEMA_NAME=default) for broader coverage

Issue: Too many unexpected entity types¶

Solution: - If using strict: false and need more control, create a comprehensive custom schema - Or use strict: true to enforce hard constraints (but test first!) - Review and refine your schema definition to include expected types

📝 Source Code Reference¶

The internal schema is defined in LlamaIndex:

venv/Lib/site-packages/llama_index/core/indices/property_graph/transformations/schema_llm.py
Lines 22-78: DEFAULT_ENTITIES, DEFAULT_RELATIONS, DEFAULT_VALIDATION_SCHEMA

This is the canonical source for the internal schema used when SCHEMA_NAME=default.

Knowledge Graph Extractors Guide¶

📊 Overview¶

⚙️ Extractor Configuration¶

🔧 Extractor Types¶

1. SimpleLLMPathExtractor¶

2. SchemaLLMPathExtractor¶

Internal Schema (Recommended for Most Use Cases)¶

3. DynamicLLMPathExtractor¶

🎯 Choosing the Right Extractor¶

Quick Decision Guide¶

🚀 Recommended Configurations¶

For Most Projects (Recommended)¶

For Domain-Specific Projects¶

For Quick Testing¶

For Research/Exploration¶

⚠️ Provider-Specific Behavior¶

Bedrock, Groq, and Fireworks¶

🔧 Advanced Configuration¶

Extraction Limits¶

Combining Extractor with Schema¶

Strict Mode Configuration¶

📈 Performance Comparison¶

📚 Related Documentation¶

💡 Best Practices¶

🐛 Troubleshooting¶

Issue: No entities extracted¶

Issue: Only chunk nodes, no entities¶

Issue: Wrong extractor being used¶

Issue: Poor entity type labels¶

Issue: Too few relationships extracted¶

Issue: Missing important entities¶

Issue: Too many unexpected entity types¶

📝 Source Code Reference¶