pelinker.linker_cluster_training¶
Training-frame cluster composition and provisional entity→cluster maps for :class:~pelinker.model.Linker.
cluster_composition_from_training_frame(training)
¶
Aggregate mention counts per entity and per cluster from a fitted training frame.
Rows are weighted equally (each row is one mention). Proportions in
:attr:ClusterCompositionSnapshot.cluster_within_fraction are relative to each cluster’s
total mass; :attr:ClusterCompositionSnapshot.cluster_fraction_of_property_mass is
relative to each entity's global mass in this frame.
Source code in pelinker/linker_cluster_training.py
consensus_cluster_names(composition, *, uniform_width_tol=0.15, dominance_min_share=0.52, dominance_min_gap=0.12, noise_cluster_label='noise')
¶
Derive a short human-readable name per cluster from within-cluster entity mixtures.
- Single-property clusters use that property name.
- Flat / near-uniform admixture uses hyphenated sorted property names.
- Clear single dominant property uses that name; duplicate dominant names across clusters
get
_A,_B, … suffixes (stable order by cluster id). - Remaining mixed cases use hyphenated sorted names; collisions are disambiguated the same way.
- Cluster
-1(HDBSCAN noise) is namednoise_cluster_labelunless overridden by callers.
Source code in pelinker/linker_cluster_training.py
provisional_cluster_assignments_from_training_frame(labels_map, training)
¶
Map each entity_id to a single cluster id for predict compatibility.
Heuristic: modal training cluster among rows whose entity equals
labels_map[entity_id], ignoring -1. Interpretation of clusters is otherwise
left to downstream analysis (see Linker.training_cluster_frame).