OntoCast ¶

Agentic ontology-assisted framework for semantic triple extraction¶

Overview¶

OntoCast is a framework for extracting semantic triples (creating a Knowledge Graph) from documents using an agentic, ontology-driven approach. It combines ontology management, natural language processing, and knowledge graph serialization to turn unstructured text into structured, queryable data.

Key Features¶

Ontology-Guided Extraction: Ensures semantic consistency and co-evolves ontologies
Entity Disambiguation: Resolves references across document chunks
Multi-Format Support: Handles text, JSON, PDF, and Markdown
Semantic Chunking: Splits text based on semantic similarity
MCP Compatibility: Implements Model Control Protocol endpoints
RDF Output: Produces standardized RDF/Turtle
Triple Store Integration: Supports Neo4j (n10s) and Apache Fuseki

Applications¶

OntoCast can be used for:

Knowledge Graph Construction: Build domain-specific or general-purpose knowledge graphs from documents
Semantic Search: Power search and retrieval with structured triples
GraphRAG: Enable retrieval-augmented generation over knowledge graphs (e.g., with LLMs)
Ontology Management: Automate ontology creation, validation, and refinement
Data Integration: Unify data from diverse sources into a semantic graph

Installation¶

uv add ontocast 
# or
pip install ontocast

Configuration¶

Environment Variables¶

Copy the example file and edit as needed:

cp env.example .env
# Edit with your values

Main options:

# LLM Configuration
# common
LLM_PROVIDER=openai # or ollama
LLM_MODEL_NAME=gpt-4o-mini # ollama model
LLM_TEMPERATURE=0.0

# openai
OPENAI_API_KEY=your_openai_api_key_here

# ollama
LLM_BASE_URL=

# Server
PORT=8999
RECURSION_LIMIT=1000
ESTIMATED_CHUNKS=30

# Optional: Triple Store Configuration (Fuseki preferred over Neo4j)
FUSEKI_URI=http://localhost:3032/test
FUSEKI_AUTH=admin/abc123-qwe

NEO4J_URI=bolt://localhost:7689
NEO4J_AUTH=neo4j/test!passfortesting

Triple Store Setup¶

OntoCast supports multiple triple store backends. When both Fuseki and Neo4j are configured, Fuseki is preferred.

See Triple Store Setup for detailed Docker Compose instructions and sample .env.example files.
Quick summary: copy and edit the provided .env.example in docker/fuseki or docker/neo4j, then run docker compose --env-file .env <service> up -d in the respective directory.

Running OntoCast Server¶

uv run serve \
    --ontology-directory ONTOLOGY_DIR \
    --working-directory WORKING_DIR \
    --clean

--clean (optional): If set, the triple store (Neo4j or Fuseki) will be initialized as clean (all data deleted on startup). Warning: Use with caution in production!

API Usage¶

POST /process: Accepts application/json or file uploads (multipart/form-data).
Returns: JSON with extracted facts (Turtle), ontology (Turtle), and processing metadata. Triples are also serialized to the configured triple store.

Example:

curl -X POST http://localhost:8999/process \
    -H "Content-Type: application/json" \
    -d '{"text": "Your document text here"}'

# Process a PDF file
curl -X POST http://url:port/process -F "file=@data/pdf/sample.pdf"

# Process a json file
curl -X POST http://url:port/process -F "file=@test2/sample.json"

MCP Endpoints¶

GET /health: Health check
GET /info: Service info
POST /process: Document processing

Filesystem Mode¶

If no triple store is configured, OntoCast stores ontologies and facts as Turtle files in the working directory.

Notes¶

JSON documents must contain a text field, e.g.:
```
{ "text": "abc" }
```
recursion_limit is calculated as max_visits * estimated_chunks (default 30, or set via .env)
Default port: 8999

Docker¶

To build the OntoCast Docker image:

docker buildx build -t growgraph/ontocast:0.1.4 . 2>&1 | tee build.log

Project Structure¶

ontocast/
├── agent/           # Agent workflow and orchestration
├── cli/             # CLI utilities and server
├── prompt/          # LLM prompt templates
├── stategraph/      # State graph logic
├── tool/            # Triple store, chunking, and ontology tools
├── toolbox.py       # Toolbox for agent tools
├── onto.py          # Ontology and RDF graph handling
├── util.py          # Utilities

Other directories: - docker/ – Docker Compose and .env.example files for triple stores - data/ – Example data, ontologies, and test files - docs/ – Documentation and user guides - test/ – Test suite

Workflow¶

The extraction follows a multi-stage workflow:

Document Preparation
- [Optional] Convert to Markdown
- Text chunking
Ontology Processing
- Ontology selection
- Text to ontology triples
- Ontology critique
Fact Extraction
- Text to facts
- Facts critique
- Ontology sublimation
Chunk Normalization
- Chunk KG aggregation
- Entity/Property Disambiguation
Storage
- Triple/KG serialization

Roadmap¶

Add Jena Fuseki triple store for triple serialization
Add Neo4j n10s for triple serialization
Replace triple prompting with a tool for local graph retrieval

Contributing¶

Contributions are welcome! Please feel free to submit a Pull Request.

Acknowledgments¶

Uses RDFlib for semantic triple management
Uses docling for pdf/pptx conversion
Uses OpenAI language models / open models served via Ollama for fact extraction
Uses langchain/langgraph