Skip to content

pelinker.linker_cluster_training

Training-frame cluster composition and provisional entity→cluster maps for :class:~pelinker.model.Linker.

cluster_composition_from_training_frame(training)

Aggregate mention counts per entity and per cluster from a fitted training frame.

Rows are weighted equally (each row is one mention). Proportions in :attr:ClusterCompositionSnapshot.cluster_within_fraction are relative to each cluster’s total mass; :attr:ClusterCompositionSnapshot.cluster_fraction_of_property_mass is relative to each entity's global mass in this frame.

Source code in pelinker/linker_cluster_training.py
def cluster_composition_from_training_frame(
    training: pd.DataFrame,
) -> ClusterCompositionSnapshot:
    """
    Aggregate mention counts per ``entity`` and per ``cluster`` from a fitted training frame.

    Rows are weighted equally (each row is one mention). Proportions in
    :attr:`ClusterCompositionSnapshot.cluster_within_fraction` are relative to each cluster’s
    total mass; :attr:`ClusterCompositionSnapshot.cluster_fraction_of_property_mass` is
    relative to each entity's global mass in this frame.
    """
    if "entity" not in training.columns or "cluster" not in training.columns:
        raise ValueError("training frame must contain 'entity' and 'cluster' columns")
    work = training[["entity", "cluster"]].copy()
    work["entity"] = work["entity"].astype(str)
    global_vc = work["entity"].value_counts()
    global_property_mass = {str(k): int(v) for k, v in global_vc.items()}
    cluster_within: dict[int, dict[str, float]] = {}
    cluster_capture: dict[int, dict[str, float]] = {}
    for cid, grp in work.groupby("cluster", sort=True):
        c = int(cid)
        counts = grp["entity"].value_counts()
        total = int(counts.sum())
        if total == 0:
            continue
        cluster_within[c] = {
            str(p): float(counts[p]) / float(total) for p in counts.index
        }
        cap: dict[str, float] = {}
        for p, cnt in counts.items():
            gp = global_property_mass[str(p)]
            if gp > 0:
                cap[str(p)] = float(int(cnt)) / float(gp)
        cluster_capture[c] = cap
    return ClusterCompositionSnapshot(
        global_property_mass=global_property_mass,
        cluster_within_fraction=cluster_within,
        cluster_fraction_of_property_mass=cluster_capture,
    )

consensus_cluster_names(composition, *, uniform_width_tol=0.15, dominance_min_share=0.52, dominance_min_gap=0.12, noise_cluster_label='noise')

Derive a short human-readable name per cluster from within-cluster entity mixtures.

  • Single-property clusters use that property name.
  • Flat / near-uniform admixture uses hyphenated sorted property names.
  • Clear single dominant property uses that name; duplicate dominant names across clusters get _A, _B, … suffixes (stable order by cluster id).
  • Remaining mixed cases use hyphenated sorted names; collisions are disambiguated the same way.
  • Cluster -1 (HDBSCAN noise) is named noise_cluster_label unless overridden by callers.
Source code in pelinker/linker_cluster_training.py
def consensus_cluster_names(
    composition: ClusterCompositionSnapshot,
    *,
    uniform_width_tol: float = 0.15,
    dominance_min_share: float = 0.52,
    dominance_min_gap: float = 0.12,
    noise_cluster_label: str = "noise",
) -> dict[int, str]:
    """
    Derive a short human-readable name per cluster from within-cluster entity mixtures.

    * Single-property clusters use that property name.
    * Flat / near-uniform admixture uses hyphenated sorted property names.
    * Clear single dominant property uses that name; duplicate dominant names across clusters
      get ``_A``, ``_B``, … suffixes (stable order by cluster id).
    * Remaining mixed cases use hyphenated sorted names; collisions are disambiguated the same way.
    * Cluster ``-1`` (HDBSCAN noise) is named ``noise_cluster_label`` unless overridden by callers.
    """
    raw: dict[int, str] = {}
    for cid, mass_frac in composition.cluster_within_fraction.items():
        if cid == -1:
            raw[cid] = noise_cluster_label
            continue
        if not mass_frac:
            raw[cid] = str(cid)
            continue
        k = len(mass_frac)
        width_tol = min(uniform_width_tol, 0.5 / float(max(k, 1)))
        if k == 1:
            base = next(iter(mass_frac))
        elif _is_near_uniform_mixture(mass_frac, width_tol=width_tol):
            base = _hyphen_join_properties(mass_frac)
        else:
            dom = _singular_dominant_property(
                mass_frac,
                min_share=dominance_min_share,
                min_gap=dominance_min_gap,
            )
            base = dom if dom is not None else _hyphen_join_properties(mass_frac)
        raw[cid] = base
    return _disambiguate_consensus_names(raw)

provisional_cluster_assignments_from_training_frame(labels_map, training)

Map each entity_id to a single cluster id for predict compatibility.

Heuristic: modal training cluster among rows whose entity equals labels_map[entity_id], ignoring -1. Interpretation of clusters is otherwise left to downstream analysis (see Linker.training_cluster_frame).

Source code in pelinker/linker_cluster_training.py
def provisional_cluster_assignments_from_training_frame(
    labels_map: dict[str, str],
    training: pd.DataFrame,
) -> dict[str, int]:
    """
    Map each ``entity_id`` to a single cluster id for ``predict`` compatibility.

    Heuristic: modal training cluster among rows whose ``entity`` equals
    ``labels_map[entity_id]``, ignoring -1. Interpretation of clusters is otherwise
    left to downstream analysis (see ``Linker.training_cluster_frame``).
    """
    out: dict[str, int] = {}
    if "entity" not in training.columns or "cluster" not in training.columns:
        return out
    for entity_id, label in labels_map.items():
        rows = training.loc[training["entity"] == label, "cluster"]
        if len(rows) == 0:
            continue
        mode = _modal_cluster_deterministic(rows.astype(int).tolist())
        if mode is None:
            continue
        out[str(entity_id)] = int(mode)
    return out