pelinker.clustering_grid¶
HDBSCAN min_cluster_size grid evaluation, cross-sample aggregation, and smooth optimum selection.
AggregatedGridPoint
dataclass
¶
One grid value of min_cluster_size with aggregated metrics across samples.
Source code in pelinker/clustering_grid.py
AggregatedGridReport
dataclass
¶
Typed aggregation of per-sample grid metrics; points are sorted by min_cluster_size.
Source code in pelinker/clustering_grid.py
ScalarMetricAggregate
dataclass
¶
Mean, dispersion, and sample count for one metric at a single grid point.
Source code in pelinker/clustering_grid.py
SmoothedGridOptimumResult
dataclass
¶
Diagnostics for solve_optimal_min_cluster_size_from_aggregated.
score_mean_at_chosen / score_std_at_chosen refer to the raw objective (before
smoothing) at the chosen grid point.
Source code in pelinker/clustering_grid.py
aggregate_grid_metrics(all_metrics_dfs)
¶
Aggregate grid evaluation metrics across multiple samples into a typed report.
Per min_cluster_size we keep DBCV mean, std, and count (so uncertainty is not
discarded). ICM and cluster count are aggregated as means for diagnostics.
Source code in pelinker/clustering_grid.py
aggregated_grid_report_to_dataframe(report)
¶
Lossless round-trip style export for notebooks (typed report → table).
Source code in pelinker/clustering_grid.py
cosine_similarity_std(tensor, max_pairs=200000, random_seed=13)
¶
Calculate the standard deviation of pairwise cosine similarities for a tensor of shape (n_b, dim_emb).
Source code in pelinker/clustering_grid.py
evaluate_cluster_size_grid(dfr2, umap_columns, sizes, max_pairs_per_cluster=200000)
¶
Evaluate clustering metrics on a grid of min_cluster_size values.
Uses DBCV (Density-Based Clustering Validation) and, when entity is present,
adjusted Rand index vs. entity codes (noise label -1 excluded).
Returns:
| Type | Description |
|---|---|
DataFrame
|
DataFrame with columns: min_cluster_size, icm, n_clusters, dbcv, ari |
Source code in pelinker/clustering_grid.py
solve_optimal_min_cluster_size_from_aggregated(report, *, objective='dbcv', method='mean', uncertainty_penalty=1.0, smooth_window=3, plateau_fraction=0.92, derivative_rel_tol=0.12, precision_weighted_smooth=None)
¶
Choose min_cluster_size from aggregated noisy grid scores.
Builds f(x) from objective (single metric or pooled DBCV+ARI), then optionally
method (mean / lower_bound / weighted). Smooths f with a centered moving average,
then prefers the leftmost x where the smoothed curve is near the top of its range
(f ≥ y_min + plateau_fraction · (y_max - y_min) on the smoothed curve, with
|df/dx| small). If none qualify, uses the smoothed argmax.
precision_weighted_smooth defaults to True for lower_bound and weighted,
and False for mean.
Source code in pelinker/clustering_grid.py
222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 | |