OntoCast
¶
Agentic ontology assisted framework for semantic triple extraction from documents¶
Overview¶
OntoCast is a powerful framework that automatically extracts semantic triples from documents using an agentic approach. It combines ontology management with natural language processing to create structured knowledge from unstructured text.
Key Features¶
- Ontology-Guided Extraction: Uses ontologies to guide the extraction process and ensure semantic consistency
- Entity Disambiguation: Resolves entity and property references across chunks
- Multi-Format Support: Handles various input formats including text, JSON, PDF, and Markdown
- Semantic Chunking: Intelligent text chunking based on semantic similarity
- MCP Compatibility: Fully compatible with the Model Control Protocol (MCP) specification, providing standardized endpoints for health checks, info, and document processing
- RDF Output: Generates standardized RDF/Turtle output
Extraction Steps¶
-
Document Processing
- Supports PDF, markdown, and text documents
- Automated text chunking and processing
-
Automated Ontology Management
- Intelligent ontology selection and construction
- Multi-stage validation and critique system
- Ontology sublimation and refinement
-
Knowledge Graph Integration
- RDF-based knowledge graph storage
- Triple extraction for both ontologies and facts
- Configurable workflow with visit limits
- Chunk aggregation preserving fact lineage
Installation¶
Configuration¶
Create a .env
file with your OpenAI API key:
Running the Server¶
Process Endpoint¶
The /process
endpoint accepts:
- application/json
: JSON data
- multipart/form-data
: File uploads
And returns:
- application/json
: Processing results including:
- Extracted facts in Turtle format
- Generated ontology in Turtle format
- Processing metadata
# Process a PDF file
curl -X POST http://url:port/process -F "file=@data/pdf/sample.pdf"
curl -X POST http://url:port/process -F "file=@test2/sample.json"
# Process text content
curl -X POST http://localhost:8999/process \
-H "Content-Type: application/json" \
-d '{"text": "Your document text here"}'
MCP Endpoints¶
OntoCast implements the following MCP-compatible endpoints:
GET /health
: Health check endpointGET /info
: Service information endpointPOST /process
: Document processing endpoint
Processing Filesystem Documents¶
uv run serve \
--ontology-directory ONTOLOGY_DIR \
--working-directory WORKING_DIR \
--input-path DOCUMENT_DIR
NB¶
- json documents are expected to contain text in
text
field - recursion_limit is calculated based on max_visits * estimated_chunks, the estimated number of chunks is taken to be 30 or otherwise fetched from
.env
(vieESTIMATED_CHUNKS
) - default 8999 is used default port
Docker¶
To build docker
Project Structure¶
src/
├── agent.py # Main agent workflow implementation
├── onto.py # Ontology and RDF graph handling
├── nodes/ # Individual workflow nodes
├── tools/ # Tool implementations
└── prompts/ # LLM prompts
Workflow¶
The extraction follows a multi-stage workflow:
-
Document Preparation
- [Optional] Convert to Markdown
- Text chunking
-
Ontology Processing
- Ontology selection
- Text to ontology triples
- Ontology critique
-
Fact Extraction
- Text to facts
- Facts critique
- Ontology sublimation
-
Chunk Normalization
- Chunk KG aggregation
- Entity/Property Disambiguation
-
Storage
- Knowledge graph storage
Documentation¶
Full documentation is available at: growgraph.github.io/ontocast
Roadmap¶
- Add a triple store for serialization/ontology management
- Replace graph to text by a symbolic graph interface (agent tools for working with triples)
Contributing¶
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments¶
- Uses RDFlib for semantic triple management
- Uses docling for pdf/pptx conversion
- Uses OpenAI language models / open models served via Ollama for fact extraction
- Uses langchain/langgraph