Skip to content

ontocast.onto.embedding_policy

strip_provenance_triples_for_embedding(graph)

Return a copy of graph suitable for embedding generation.

Provenance/reification triples can dominate embedding neighborhoods and cause RAG retrieval to focus on document provenance rather than asserted facts. This helper removes a minimal set of provenance-related triples based on RDF predicate vocabulary.

Source code in ontocast/onto/embedding_policy.py
def strip_provenance_triples_for_embedding(graph: RDFGraph) -> RDFGraph:
    """Return a copy of *graph* suitable for embedding generation.

    Provenance/reification triples can dominate embedding neighborhoods and
    cause RAG retrieval to focus on document provenance rather than asserted
    facts. This helper removes a minimal set of provenance-related triples
    based on RDF predicate vocabulary.
    """
    result = RDFGraph()
    for prefix, namespace in graph.namespaces():
        if prefix:
            result.bind(prefix, namespace)

    prov_prefix = str(PROV)
    for s, p, o in graph:
        if not isinstance(p, URIRef):
            result.add((s, p, o))
            continue

        if p == DCTERMS.source:
            continue
        if p == RDF_REIFIES:
            continue
        if str(p).startswith(prov_prefix):
            continue

        result.add((s, p, o))

    return result