OntoCast Workflow¶

This document describes the workflow of OntoCast's document processing pipeline.

Overview¶

The OntoCast workflow consists of several stages that transform input documents into structured knowledge:

Document Conversion
Input documents are converted to markdown format
Supports various input formats (PDF, DOCX, TXT, MD)
Text Chunking
Documents are split into manageable chunks
Chunks are processed sequentially
Head chunks are processed first to establish context
Ontology Processing
Selection: Choose appropriate ontology for content
Extraction: Extract ontological concepts from text using GraphUpdate operations
GraphUpdate: LLM outputs structured SPARQL operations (insert/delete) instead of full TTL
Update Application: GraphUpdate operations are applied incrementally to the ontology graph
Sublimation: Refine and enhance the ontology
Criticism: Validate ontology structure and relationships
Versioning: Automatic semantic version increment based on changes (MAJOR/MINOR/PATCH)
Timestamp: Tracks last update time with updated_at field
Fact Processing
Extraction: Extract factual information from text using GraphUpdate operations
GraphUpdate: LLM outputs structured SPARQL operations for facts updates
Update Application: GraphUpdate operations are applied incrementally to the facts graph
Criticism: Validate extracted facts
Aggregation: Combine facts from all chunks

The workflow can be configured through command-line parameters:

Chunk Size
Keep chunks manageable
Consider context preservation
Balance between detail and processing time
Ontology Selection
Choose appropriate ontology
Consider domain specificity
Allow for ontology evolution
Monitor version increments to track evolution
Fact Validation
Validate extracted facts
Check for consistency
Handle contradictions
Resource Management
Monitor memory usage
Control processing time
Handle large documents
Review budget summaries to track LLM usage and costs
Use budget metrics to estimate processing costs for large documents
GraphUpdate operations significantly reduce token usage compared to full graph generation
Monitor triple generation metrics to understand graph growth