Run scripts and CLIs¶
This page is the high-level map for training, serving, batch linking, and offline analysis. Detailed flags, tables, and preprocessing steps live in the repository’s run/README.md (kept next to the scripts).
Environment¶
From the repository root, use uv so dependencies match uv.lock:
Documentation site builds (optional):
Packaged commands¶
| Command | Module | Role |
|---|---|---|
uv run pelinker-fit |
pelinker.cli.fit |
Corpus embedding (optional) + Linker.fit → serialized artifact (.gz). Hydra overrides; defaults in pelinker/conf/fit.yaml. |
uv run pelinker-serves |
pelinker.cli.server |
FastAPI server: /health, /info, /model, /link, /link/debug. Defaults in pelinker/conf/server.yaml. |
uv run pelinker-link-files |
pelinker.cli.link_files |
Batch Linker.predict on UTF-8 files or JSON documents; optional JSON report and mention-level anomaly dump for OOV workflows. |
Equivalent module invocations: uv run python -m pelinker.cli.fit, pelinker.cli.server, pelinker.cli.link_files.
Batch linking (pelinker-link-files)¶
- Model:
-m/--model— path to the linker dump (Linker.loadrules;.gzis resolved like elsewhere). - Threshold:
--thr-score— same idea as the server’s score threshold. - Outputs:
-o/--output— full prediction JSON;--dump-mention-anomaly PATH— per-mention rows with PCA residual / Mahalanobis-style metrics (extension selects.parquet,.csv, or.jsonl). - Extras:
--include-anomaly-metricsand--kb-validationmirror server/debug style fields on entities;--use-gpufor CUDA when available.
Plain text files are one document per file; JSON inputs support text plus optional ground_truth hits (see --help on the module).
OOV and anomaly figures¶
- Fit a model and retain the clustering report from training (see fit reporting /
clustering_qualitycheckpoints in code). - Run
pelinker-link-fileswith--dump-mention-anomalyto produce an OOV-oriented mention table. - Run
run/analysis/oov_analysis.pywith--fit-report,--oov-csv, and--out-dirto generate PDF figures (marginals, ROC/PR, decision boundary sweeps, alignment with the negative screener). The script docstring lists the full argument set.
run/ directory (scripts)¶
| Area | Contents |
|---|---|
| Root | embed_kb_corpus.py, test_server.py, loop.embed.kb.corpus.sh, loop.fit.sh |
preprocessing/ |
GO / RO property extraction and merge → synthesis KB CSVs |
analysis/ |
clustering_quality.py (embedding grid + metrics), select_diverse_entities.py, oov_analysis.py (fit report + OOV dump → figures), replot_dbcv_ari_scatter.py (PNG from an existing results_grid_per_sample.csv) |
obsolete/ |
Deprecated experiments (not maintained) |
Always invoke scripts with uv run python … (see project rules) so the locked environment is used.
See also¶
- Vector representations — encoder + spaCy window path (
texts_to_vrep). - API Reference — generated module pages (
pelinker.model,pelinker.analysis, …).