Entity Disambiguation and Aggregation¶
After per-unit facts extraction, OntoCast merges chunk-level graphs into a document-level facts graph with cross-chunk entity disambiguation.
Overview¶
The merge stage (tool/agg/aggregate.py):
- Collects facts graphs from all processed content units
- Clusters entity mentions using embeddings and symbolic compatibility
- Rewrites URIs to canonical identities
- Annotates merged triples with provenance where applicable
Ontology aggregation uses a similar embedding-based pipeline for anchor selection and URI rewriting during document-level ontology reduce.
Configuration¶
| Variable | Description | Default |
|---|---|---|
AGG_EMBEDDING_MODEL |
Sentence-transformers model for entity embeddings | paraphrase-multilingual-MiniLM-L12-v2 |
AGG_SIMILARITY_THRESHOLD |
Cosine similarity threshold for DBSCAN clustering | 0.80 |
Lower thresholds merge more aggressively (fewer duplicate entities, higher false-merge risk). Raise the threshold when precision matters more than recall.
How Disambiguation Works¶
- Candidate extraction — entities from each unit's facts graph
- Embedding — dense vectors from
AGG_EMBEDDING_MODEL - Symbolic checks — labels,
skos:altName, IRI compatibility; identicalURIRefalways compatible (e.g. the same ontology class appearing in predicted and ground-truth graphs clusters with score 1.0 even when labels are missing or embeddings disagree) - Clustering — connected components over similarity + compatibility edges
- URI rewrite — merge graphs under canonical entity URIs
- Provenance — track which unit contributed each merged triple
The standalone EntityAligner (tool/agg/entity_aligner.py) powers global alignment for the /match/entities API (benchmark use), using the same embedding and symbolic regime concepts (ontology_loose / ontology_strict).
Graph Matching API¶
For evaluation against ground truth, use the match endpoints (see API Endpoints):
- Align entities across multiple graphs globally
- Derive pairwise predicted↔GT mappings
- Compute triple and entity precision/recall/F1
Entity match payloads accept IRI strings or URIRef values; evaluation normalizes to URIRef for projection. Entity false positives/negatives count unmatched entities in each graph (set difference), so a shared ontology vocabulary IRI matched once is not also counted as an extra false positive on the other side.
The match-dirs CLI automates this for directory pairs of TTL files.
Tuning Tips¶
- Inspect merge output before lowering
AGG_SIMILARITY_THRESHOLD. - Domain-specific embeddings — if you change
EMBEDDING_MODEL_NAMEfor Qdrant, consider aligningAGG_EMBEDDING_MODELfor consistent geometry. - Large documents — more units increase merge complexity; use
--head-chunkswhile tuning.
Related¶
- Workflow — where merge fits in the pipeline
- Core Concepts — disambiguation overview
- Configuration — aggregation env vars