OntoCast Workflow¶
This document describes the document processing pipeline implemented in stategraph/create.py.
Overview¶
OntoCast transforms input documents into RDF ontology and facts graphs through a parallel map/reduce pipeline:
- Document conversion — PDF, DOCX, TXT, MD, or JSON → Markdown
- Semantic chunking — split into content units (optionally limited with
--head-chunks) - Ontology map/reduce (when
render_modeincludes ontology): - Per-unit context assembly (catalog selection or vector retrieval)
- Render/critic loops with optional web evidence
- Global normalize (provenance split) → optional consolidate → structural check → consistency critic
- Facts map/reduce (when
render_modeincludes facts): - Per-unit render/critic loops
- Merge facts across units with entity disambiguation
- Serialize — write to triple store and return Turtle in the API response
Document-Level Graph¶
The LangGraph compiled by create_agent_graph() is rendered from the live workflow. Regenerate after graph changes:
Outputs (under docs/assets/):
| File | Layout | Description |
|---|---|---|
| graph.png | Top-to-bottom | Full document pipeline (default) |
| graph.lr.png | Left-to-right | Same graph, landscape layout |
| graph.svg / graph.lr.svg | Vector | Scalable versions |
| graph.preview.png | Mermaid API | Small hand-drawn preview (optional) |
| graph.mmd | Mermaid source | Editable source at repo root |
Landscape layout (LR)
Nodes such as Update Ontology and Render Facts each run the per-unit atomic loop below (in parallel across content units).
Per-Unit Atomic Loop¶
Inside stategraph/atomic.py, each content unit runs an independent render → critic loop with optional web evidence. The same pattern applies to ontology (ontology_loop) and facts (facts_loop).
flowchart TD
START([Unit start]) --> CTX[Resolve ontology context]
CTX --> RLOOP{render attempt<br/>1 … max_visits}
RLOOP --> RENDER[Render GraphUpdate]
RENDER -->|success| FINAL{final render<br/>attempt?}
RENDER -->|fail| SEARCH_R{initiate_search?}
SEARCH_R -->|yes| EVID_R[Plan + fetch web evidence]
EVID_R --> RENDER2[Re-render]
RENDER2 -->|success| FINAL
RENDER2 -->|fail| RLOOP
SEARCH_R -->|no| RLOOP
FINAL -->|yes| DONE([Return unit state])
FINAL -->|no| CLOOP{critic attempt<br/>1 … max_visits}
CLOOP --> CRITIC[Criticise output]
CRITIC -->|success| DONE
CRITIC -->|fail| SEARCH_C{initiate_search?}
SEARCH_C -->|yes| EVID_C[Plan + fetch web evidence]
EVID_C --> CRITIC2[Re-criticise]
CRITIC2 -->|success| DONE
CRITIC2 -->|fail| CLOOP
SEARCH_C -->|no| CLOOP
CLOOP -->|exhausted| RLOOP
RLOOP -->|exhausted| FAIL([Return with failure])
Notes:
- First render/critic pass always runs without web search; search runs only when the node sets
initiate_search. - On the last allowed render attempt, the critic is skipped (no further extract to critique).
/process_unitruns this loop on a single unit viaunit_pipeline.py(no chunking or document-level reduce).
Implementation: stategraph/atomic.py.
Stage Details¶
1. Document Input¶
- Accepts text, JSON (
textfield), or file uploads via/process - Converts supported formats to Markdown while preserving structure
2. Chunking¶
- Semantic chunking splits the document into content units
- Units are processed in parallel up to
PARALLEL_WORKERS - Use
--head-chunks Non the CLI to process only the first N units (testing)
3. Per-Unit Ontology Loop¶
Each content unit runs an independent ontology loop (stategraph/atomic.py):
- Context assembly — pick or retrieve ontology context for the unit:
- LLM catalog selection (
selected_single_ontology) - Qdrant vector ensemble (
selected_vector_search_ontology) - Fixed catalog ontology (
fixed_single_ontology) - Render — LLM emits
GraphUpdateoperations (Turtle or JSON-LD wire format) - Critic — validate structure; retry up to
max_visits(config or per-request override) - External evidence (optional) — web search on retry when the node requests it
See Ontology Context and User Instructions.
4. Ontology Reduce (Document Level)¶
After all units finish:
| Stage | Purpose |
|---|---|
| Normalize | Merge unit deltas; split RDF 1.2 provenance/reification into a side artifact |
| Consolidate (optional) | Single-pass refinement when ENABLE_ONTOLOGY_CONSOLIDATION=true |
| Structural check | Connectivity and schema validation |
| Consistency critic | Cross-unit ontology consistency |
Provenance triples (prov:, reification, chunk metadata) are kept in ontology_provenance_artifact, not in the working ontology graph passed to consolidation.
5. Per-Unit Facts Loop¶
When facts rendering is enabled, each unit runs a facts loop (render → critic, with optional web evidence), then merge facts applies cross-chunk entity disambiguation and aggregation.
Facts output uses the cd: namespace for text-derived instances; domain ontology IRIs are read-only schema and pre-declared reference individuals (see Facts extraction model). Optional facts_user_instruction adds focus on top of these built-in guidelines.
6. Output¶
- Ontology and facts serialized to the configured triple store
- API returns Turtle (optionally with
strip_provenance=trueto omit reification scaffolding) - Budget summary logged (LLM calls, characters, triple counts)
Configuration¶
| Setting / parameter | Effect |
|---|---|
RENDER_MODE |
ontology, facts, or ontology_and_facts |
PARALLEL_WORKERS |
Max concurrent unit workers |
MAX_VISITS / max_visits |
Render/critic retry budget per loop |
ENABLE_ONTOLOGY_CONSOLIDATION |
Optional post-normalization consolidation |
ONTOLOGY_CONTEXT_MODE |
How per-unit ontology context is sourced |
LLM_GRAPH_FORMAT |
turtle or jsonld LLM wire encoding |
--head-chunks |
CLI limit on units processed |
Full reference: Configuration System.
Best Practices¶
- Start with defaults —
MAX_VISITS=1,ontology_and_facts, consolidation off; tune after inspecting output. - Use
--head-chunksfor large documents during development. - Monitor budget summaries to estimate LLM cost at scale.
- Provide seed ontologies in
ONTOCAST_ONTOLOGY_DIRECTORYfor catalog selection modes. - Enable vector mode only when Qdrant and embeddings are configured.
Next Steps¶
- Core Concepts — GraphUpdate, provenance, disambiguation
- API Endpoints —
/process,/process_unit, parameters - API Reference —
AgentStateand workflow types
