Entity Disambiguation and Aggregation¶

After per-unit facts extraction, OntoCast merges chunk-level graphs into a document-level facts graph with cross-chunk entity disambiguation.

Overview¶

The merge stage (tool/agg/aggregate.py):

Collects facts graphs from all processed content units
Clusters entity mentions using embeddings and symbolic compatibility
Rewrites URIs to canonical identities
Annotates merged triples with provenance where applicable

Ontology aggregation uses a similar embedding-based pipeline for anchor selection and URI rewriting during document-level ontology reduce.

Configuration¶

AGG_EMBEDDING_MODEL=paraphrase-multilingual-MiniLM-L12-v2
AGG_SIMILARITY_THRESHOLD=0.80

Variable	Description	Default
`AGG_EMBEDDING_MODEL`	Sentence-transformers model for entity embeddings	`paraphrase-multilingual-MiniLM-L12-v2`
`AGG_SIMILARITY_THRESHOLD`	Cosine similarity threshold for DBSCAN clustering	`0.80`

Lower thresholds merge more aggressively (fewer duplicate entities, higher false-merge risk). Raise the threshold when precision matters more than recall.

How Disambiguation Works¶

Candidate extraction — entities from each unit's facts graph
Embedding — dense vectors from AGG_EMBEDDING_MODEL
Symbolic checks — labels, skos:altName, IRI compatibility; identical URIRef always compatible (e.g. the same ontology class appearing in predicted and ground-truth graphs clusters with score 1.0 even when labels are missing or embeddings disagree)
Clustering — connected components over similarity + compatibility edges
URI rewrite — merge graphs under canonical entity URIs
Provenance — track which unit contributed each merged triple

The standalone EntityAligner (tool/agg/entity_aligner.py) powers global alignment for the /match/entities API (benchmark use), using the same embedding and symbolic regime concepts (ontology_loose / ontology_strict).

Graph Matching API¶

For evaluation against ground truth, use the match endpoints (see API Endpoints):

Align entities across multiple graphs globally
Derive pairwise predicted↔GT mappings
Compute triple, facts, and entity precision/recall/F1

Facts vs triple metrics: triple-level scores count typing and taxonomy (rdf:type, rdfs:subClassOf, …). Facts scores measure only instance-to-instance relations (e.g. book → character via an ontology property), excluding schema predicates and triples that touch class/concept nodes in subject or object position. Relation property IRIs in predicate position still count toward facts.

Entity match payloads accept IRI strings or URIRef values; evaluation normalizes to URIRef for projection. Entity false positives/negatives count unmatched entities in each graph (set difference), so a shared ontology vocabulary IRI matched once is not also counted as an extra false positive on the other side.

The match-dirs CLI automates this for directory pairs of TTL files.

Tuning Tips¶

Inspect merge output before lowering AGG_SIMILARITY_THRESHOLD.
Domain-specific embeddings — if you change EMBEDDING_MODEL_NAME for Qdrant, consider aligning AGG_EMBEDDING_MODEL for consistent geometry.
Large documents — more units increase merge complexity; use --head-chunks while tuning.

Workflow — where merge fits in the pipeline
Core Concepts — disambiguation overview
Configuration — aggregation env vars