Skip to content

OntoCast Workflow

This document describes the workflow of OntoCast's document processing pipeline.

Overview

The OntoCast workflow consists of several stages that transform input documents into structured knowledge:

  1. Document Conversion
  2. Input documents are converted to markdown format
  3. Supports various input formats (PDF, DOCX, TXT, MD)

  4. Text Chunking

  5. Documents are split into manageable chunks
  6. Chunks are processed sequentially
  7. Head chunks are processed first to establish context

  8. Ontology Processing

  9. Selection: Choose appropriate ontology for content
  10. Extraction: Extract ontological concepts from text using GraphUpdate operations
  11. GraphUpdate: LLM outputs structured SPARQL operations (insert/delete) instead of full TTL
  12. Update Application: GraphUpdate operations are applied incrementally to the ontology graph
  13. Sublimation: Refine and enhance the ontology
  14. Criticism: Validate ontology structure and relationships
  15. Versioning: Automatic semantic version increment based on changes (MAJOR/MINOR/PATCH)
  16. Timestamp: Tracks last update time with updated_at field

  17. Fact Processing

  18. Extraction: Extract factual information from text using GraphUpdate operations
  19. GraphUpdate: LLM outputs structured SPARQL operations for facts updates
  20. Update Application: GraphUpdate operations are applied incrementally to the facts graph
  21. Criticism: Validate extracted facts
  22. Aggregation: Combine facts from all chunks

Detailed Flow

1. Document Input

  • Accepts text or file input
  • Converts to markdown format
  • Preserves document structure

2. Text Processing

  • Splits text into chunks
  • Processes head chunks first
  • Maintains context between chunks

3. Ontology Management

  • Selects relevant ontology
  • Extracts new concepts using GraphUpdate operations (token-efficient)
  • Applies incremental updates to ontology graph
  • Validates relationships
  • Refines structure
  • Automatically increments version based on change analysis (MAJOR/MINOR/PATCH)
  • Updates timestamp when ontology is modified
  • Tracks version lineage with hash-based identifiers

4. Fact Extraction

  • Identifies entities
  • Extracts relationships using GraphUpdate operations (token-efficient)
  • Applies incremental updates to facts graph
  • Validates facts
  • Combines information from all chunks

5. Output Generation

  • Produces RDF graph
  • Generates ontology with version and timestamp
  • Provides extracted facts
  • Reports budget usage (LLM calls, characters sent/received, triples generated)
  • Logs budget summary at end of processing

Configuration Options

The workflow can be configured through command-line parameters:

  • --head-chunks: Number of chunks to process first
  • --max-visits: Maximum visits per node

Best Practices

  1. Chunk Size
  2. Keep chunks manageable
  3. Consider context preservation
  4. Balance between detail and processing time

  5. Ontology Selection

  6. Choose appropriate ontology
  7. Consider domain specificity
  8. Allow for ontology evolution
  9. Monitor version increments to track evolution

  10. Fact Validation

  11. Validate extracted facts
  12. Check for consistency
  13. Handle contradictions

  14. Resource Management

  15. Monitor memory usage
  16. Control processing time
  17. Handle large documents
  18. Review budget summaries to track LLM usage and costs
  19. Use budget metrics to estimate processing costs for large documents
  20. GraphUpdate operations significantly reduce token usage compared to full graph generation
  21. Monitor triple generation metrics to understand graph growth

Next Steps