pelinker.config¶
ClusterCompositionSnapshot
dataclass
¶
Mention-weighted mixture of KB property labels per HDBSCAN cluster after Linker.fit.
- :attr:
global_property_mass— total mention count per property in the fitted corpus (denominator for “fraction of that property’s mass” views). - :attr:
cluster_within_fraction— within each cluster, each property’s share of that cluster’s mention mass (sums to 1.0 per cluster). - :attr:
cluster_fraction_of_property_mass— for each cluster and property,mentions(cluster ∩ property) / global_property_mass[property](how much of that property’s corpus sits in this cluster; sums to ≤ 1.0 across disjoint cluster rows for a fixed property, excluding double-counting issues from overlapping keys).
Source code in pelinker/config.py
ClusteringOptimizationConfig
dataclass
¶
Configuration for clustering optimization grid search.
Source code in pelinker/config.py
233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 | |
batch_size = 1000
class-attribute
instance-attribute
¶
Rows per batch when reading mention-level embedding parquet (not encoder batch size).
clustering_grid_step = 5
class-attribute
instance-attribute
¶
Step between consecutive min_cluster_size values on the grid (numpy.arange step).
grid_derivative_rel_tol = 0.12
class-attribute
instance-attribute
¶
|df/dx| below this times max|df/dx| counts as “derivative near zero” on the smoothed curve.
grid_objective = 'dbcv_ari_mean_minmax'
class-attribute
instance-attribute
¶
Which scalar to optimize on the grid (single metric or pooled DBCV+ARI; see clustering_grid).
grid_plateau_fraction = 0.92
class-attribute
instance-attribute
¶
Plateau threshold on the smoothed curve: y_min + this * (y_max - y_min) (finite values only).
grid_smooth_window = 3
class-attribute
instance-attribute
¶
Odd-length centered moving-average window for smoothing f(x). Even values are bumped up by one.
min_scale = None
class-attribute
instance-attribute
¶
Lower bound (inclusive) for the min_cluster_size grid.
When None, defaults to max(1, min_class_size // 2) (legacy behavior: half of
:attr:min_class_size). Set explicitly to decouple grid start from mention-level
filtering (:attr:min_class_size).
n_embedding_batches = None
class-attribute
instance-attribute
¶
Cap parquet reads at this many batches (batch_size rows each); None = read all.
negative_screener = field(default_factory=NegativeScreenerConfig)
class-attribute
instance-attribute
¶
Negative-class screening before PCA→UMAP (see :class:NegativeScreenerConfig).
optimization_method = 'mean'
class-attribute
instance-attribute
¶
How to build the objective f(min_cluster_size) before smoothing (mean / lower_bound / weighted).
resolved_min_scale()
¶
Inclusive start of the min_cluster_size grid (HDBSCAN hyperparameter).
EmbeddingModelMetadata
dataclass
¶
Describes which embedding backbones/layers produced the model (saved with the Linker).
Source code in pelinker/config.py
EmbeddingSourceSpec
dataclass
¶
One backbone + layer selection (e.g. for a single encoder or one branch of a fused model).
Source code in pelinker/config.py
EmbeddingTrainingConfig
dataclass
¶
Inputs and runtime settings used only while embedding the corpus (not part of model identity).
Source code in pelinker/config.py
encoder_batch_size = 200
class-attribute
instance-attribute
¶
How many table rows are encoded per transformer forward pass; lower if GPU memory is tight.
input_buffer_rows = 1000
class-attribute
instance-attribute
¶
Rows read per pandas.read_csv(..., chunksize=...) pass over the text table (I/O buffer only).
max_input_buffers = None
class-attribute
instance-attribute
¶
If set, stop after this many text-table read passes (each up to input_buffer_rows rows).
negative_label = NEGATIVE_LABEL
class-attribute
instance-attribute
¶
Entity label to use for synthetic negative rows.
negative_seed = 13
class-attribute
instance-attribute
¶
Optional random seed for deterministic negative sampling.
negatives_per_positive = 0.0
class-attribute
instance-attribute
¶
Number of random negative mentions to sample per positive mention.
KBConfig
dataclass
¶
Metadata for the knowledge base packaged with a fitted Linker.
Source code in pelinker/config.py
entity_count = None
class-attribute
instance-attribute
¶
Set after fit from vocabulary size when None at construction time.
LinkerFitConfig
dataclass
¶
Parquet read + mention filters + screener settings for :meth:~pelinker.model.Linker.fit.
Source code in pelinker/config.py
min_class_size = 20
class-attribute
instance-attribute
¶
Minimum mention rows per KB entity before training (negative label exempt).
ManifoldOovScreenerConfig
dataclass
¶
3D (residual, Mahalanobis, spectral entropy) OOV score model; predict-time gate only.
Source code in pelinker/config.py
dt_max_depth_candidates = (None, 4, 8)
class-attribute
instance-attribute
¶
None means unrestricted depth (sklearn default).
NegativeScreenerConfig
dataclass
¶
Binary LDA/SVM screen for negative_label vs KB mentions before PCA→UMAP.
Source code in pelinker/config.py
kind = 'lda'
class-attribute
instance-attribute
¶
Estimator persisted on :class:~pelinker.model.Linker (Linker.screener).
TransformConfig
dataclass
¶
Configuration for the embedding transformation pipeline.
Source code in pelinker/config.py
pca_components = 50
class-attribute
instance-attribute
¶
Number of principal components to keep after PCA reduction.
umap_components = 4
class-attribute
instance-attribute
¶
Number of UMAP dimensions for clustering (typically 3-5).
umap_metric = 'cosine'
class-attribute
instance-attribute
¶
Distance metric for UMAP (default: 'cosine').
umap_viz_components = 3
class-attribute
instance-attribute
¶
Number of UMAP dimensions for visualization (default: 3).
umap_viz_metric = 'cosine'
class-attribute
instance-attribute
¶
Distance metric for visualization UMAP (default: 'cosine').
__post_init__()
¶
Validate configuration parameters.