OntoCast
¶
Agentic ontology-assisted framework for semantic triple extraction¶
Overview¶
OntoCast is a framework for extracting semantic triples (creating a Knowledge Graph) from documents using an agentic, ontology-driven approach. It combines ontology management, natural language processing, and knowledge graph serialization to turn unstructured text into structured, queryable data.
Key Features¶
- Ontology-Guided Extraction: Ensures semantic consistency and co-evolves ontologies
- Entity Disambiguation: Resolves references across document chunks
- Multi-Format Support: Handles text, JSON, PDF, and Markdown
- Semantic Chunking: Splits text based on semantic similarity
- MCP Compatibility: Implements Model Control Protocol endpoints
- RDF Output: Produces standardized RDF/Turtle
- Triple Store Integration: Supports Neo4j (n10s) and Apache Fuseki
Applications¶
OntoCast can be used for:
- Knowledge Graph Construction: Build domain-specific or general-purpose knowledge graphs from documents
- Semantic Search: Power search and retrieval with structured triples
- GraphRAG: Enable retrieval-augmented generation over knowledge graphs (e.g., with LLMs)
- Ontology Management: Automate ontology creation, validation, and refinement
- Data Integration: Unify data from diverse sources into a semantic graph
Installation¶
Configuration¶
Environment Variables¶
Copy the example file and edit as needed:
Main options:
# LLM Configuration
# common
LLM_PROVIDER=openai # or ollama
LLM_MODEL_NAME=gpt-4o-mini # ollama model
LLM_TEMPERATURE=0.0
# openai
OPENAI_API_KEY=your_openai_api_key_here
# ollama
LLM_BASE_URL=
# Server
PORT=8999
RECURSION_LIMIT=1000
ESTIMATED_CHUNKS=30
# Optional: Triple Store Configuration (Fuseki preferred over Neo4j)
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin/abc123-qwe
NEO4J_URI=bolt://localhost:7689
NEO4J_AUTH=neo4j/test!passfortesting
Triple Store Setup¶
OntoCast supports multiple triple store backends. When both Fuseki and Neo4j are configured, Fuseki is preferred.
- See Triple Store Setup for detailed Docker Compose instructions and sample
.env.example
files. - Quick summary: copy and edit the provided
.env.example
indocker/fuseki
ordocker/neo4j
, then rundocker compose --env-file .env <service> up -d
in the respective directory.
Running OntoCast Server¶
--clean
(optional): If set, the triple store (Neo4j or Fuseki) will be initialized as clean (all data deleted on startup). Warning: Use with caution in production!
API Usage¶
- POST /process: Accepts
application/json
or file uploads (multipart/form-data
). - Returns: JSON with extracted facts (Turtle), ontology (Turtle), and processing metadata. Triples are also serialized to the configured triple store.
Example:
curl -X POST http://localhost:8999/process \
-H "Content-Type: application/json" \
-d '{"text": "Your document text here"}'
# Process a PDF file
curl -X POST http://url:port/process -F "file=@data/pdf/sample.pdf"
# Process a json file
curl -X POST http://url:port/process -F "file=@test2/sample.json"
MCP Endpoints¶
GET /health
: Health checkGET /info
: Service infoPOST /process
: Document processing
Filesystem Mode¶
If no triple store is configured, OntoCast stores ontologies and facts as Turtle files in the working directory.
Notes¶
- JSON documents must contain a
text
field, e.g.: recursion_limit
is calculated asmax_visits * estimated_chunks
(default 30, or set via.env
)- Default port: 8999
Docker¶
To build the OntoCast Docker image:
Project Structure¶
ontocast/
├── agent/ # Agent workflow and orchestration
├── cli/ # CLI utilities and server
├── prompt/ # LLM prompt templates
├── stategraph/ # State graph logic
├── tool/ # Triple store, chunking, and ontology tools
├── toolbox.py # Toolbox for agent tools
├── onto.py # Ontology and RDF graph handling
├── util.py # Utilities
docker/
– Docker Compose and .env.example files for triple stores
- data/
– Example data, ontologies, and test files
- docs/
– Documentation and user guides
- test/
– Test suite
Workflow¶
The extraction follows a multi-stage workflow:
- Document Preparation
- [Optional] Convert to Markdown
- Text chunking
- Ontology Processing
- Ontology selection
- Text to ontology triples
- Ontology critique
- Fact Extraction
- Text to facts
- Facts critique
- Ontology sublimation
- Chunk Normalization
- Chunk KG aggregation
- Entity/Property Disambiguation
- Storage
- Triple/KG serialization
Roadmap¶
- Add Jena Fuseki triple store for triple serialization
- Add Neo4j n10s for triple serialization
- Replace triple prompting with a tool for local graph retrieval
Contributing¶
Contributions are welcome! Please feel free to submit a Pull Request.
Acknowledgments¶
- Uses RDFlib for semantic triple management
- Uses docling for pdf/pptx conversion
- Uses OpenAI language models / open models served via Ollama for fact extraction
- Uses langchain/langgraph