pelinker.util¶
build_expression_container(cm, expression_lists_per_chunk, word_grouping)
¶
Merge per-chunk expressions and embedding rows into one holder per document.
For each document index, concatenates embedding tensors for all its chunks (in chunk order) and concatenates expression lists the same way, so downstream code can match lemmas and spans at document level.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cm
|
ChunkMapper
|
Chunk mapper with |
required |
expression_lists_per_chunk
|
list[list[Expression]]
|
Parallel to |
required |
word_grouping
|
WordGrouping
|
Which :class: |
required |
Returns:
| Name | Type | Description |
|---|---|---|
An |
ExpressionHolderBatch
|
|
ExpressionHolderBatch
|
|
Source code in pelinker/util.py
embed_texts(phrases, tokenizer, model, layers, nlp)
¶
Embed a list of text phrases using texts_to_vrep.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
phrases
|
list[str]
|
List of text phrases to embed |
required |
tokenizer
|
Tokenizer for the model |
required | |
model
|
Model for embedding |
required | |
layers
|
str | list[int]
|
Layer specification |
required |
nlp
|
Language
|
spaCy |
required |
Returns:
| Type | Description |
|---|---|
list[Tensor]
|
List of embedding tensors, one per phrase |
Source code in pelinker/util.py
expand_config_path(path)
¶
Expand environment variables and ~ in config or CLI path strings.
extract_and_embed_mentions(entities, data, pmids, tokenizer, model, nlp, layers, batch_size, word_modes=(WordGrouping.W1, WordGrouping.W2, WordGrouping.W3), negatives_per_positive=0.0, negative_label=NEGATIVE_LABEL, random_seed=None, negative_random_state=None, on_encoder_batch=None)
¶
Modified to return list of dicts instead of DataFrame for better memory management and consistent schema handling.
Negative rows are sampled with :class:numpy.random.RandomState. Pass
negative_random_state to reuse one RNG across several calls (e.g. successive
read buffers); otherwise random_seed builds a fresh RandomState per call.
If on_encoder_batch is set, it is invoked after each encoder mini-batch with
(batch_index_0based, n_batches, n_mention_rows_accumulated).
Source code in pelinker/util.py
918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 | |
extract_ordered_mention_tensors(report_batch, *, keep=None)
¶
Pool rows for each expression in W1→W2→W3 order, optionally filtering expressions.
Source code in pelinker/util.py
get_word_boundaries(text)
¶
render word boundaries in text
:param text: :return:
Source code in pelinker/util.py
keep_expression_for_prediction(expr)
¶
Whether to keep a sliding-window mention for :meth:~pelinker.model.Linker.predict.
Drops any window that contains punctuation (spaCy pos_ == "PUNCT"). Drops
windows whose tokens are all stop words; keeps windows that mix content and
function words (e.g. type of).
Source code in pelinker/util.py
map_spans_to_spans_basic(words_boundaries, token_boundaries)
¶
Map each word character span to tokenizer subword indices.
Both spans use half-open intervals [start, end) in character offsets. A
subword token belongs to a word span iff the two intervals have positive overlap.
Word spans may overlap (sliding W2/W3 windows). A single forward pointer over tokens is incorrect in that case; each word span is matched independently.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
words_boundaries
|
Sequence of |
required | |
token_boundaries
|
Sequence of |
required |
Returns:
| Type | Description |
|---|---|
dict[tuple[int, int], list[int]]
|
Dict |
Source code in pelinker/util.py
map_words_to_tokens(text_token_spans, text_word_spans)
¶
given text token and word spans,
words : [...(start_pos_i, end_pos_i)...]
tokens : [...(start_pos_i, end_pos_i)...]
we define work -> token spans, i.e. a mapping of which tokens belong to words groups
if there is a positive overlap between word i and token j spans, we consider that token j belongs to work i group
return a list of work -> token bounds : [...(start_token_i, end_token_i)...] and a refreshed
Source code in pelinker/util.py
map_words_to_tokens_list(text_token_spans_list, text_word_spans_list)
¶
take a batch of token spans and a batch of word spans
return a batch of work to token maps and also an updated batch of word spans (some words can not be mapped to tokens, so they are excluded)
Source code in pelinker/util.py
normalize_layers_spec(layers_spec, *, n_hidden_states=None)
¶
Parse and validate indices for the stacked hidden_states tensor.
String form follows the same convention as historical :func:str2layers: each
digit is a distinct layer counted from the end, e.g. "1" → [-1],
"12" → [-2, -1]. Commas in the string are ignored.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
layers_spec
|
str | list[int]
|
Digit-only string or list of negative indices (HF convention). |
required |
n_hidden_states
|
int | None
|
If set (length of first dim of stacked hidden states), indices
must satisfy |
None
|
Returns:
| Type | Description |
|---|---|
list[int]
|
Sorted unique negative layer indices. |
Raises:
| Type | Description |
|---|---|
ValueError
|
Empty spec, positive indices, |
Source code in pelinker/util.py
process_text(batched_texts, tokenizer, model, *, keep_hidden_states_on_device=False)
¶
Encode all text chunks in one forward pass and build a :class:~pelinker.onto.ChunkMapper.
Flattens batched_texts (each inner list is one logical document split into
length-limited chunks), runs :func:text_to_tokens_embeddings once, and packages
tensors with bookkeeping to map chunk indices back to (document_index,
chunk_index_within_document) and cumulative character offsets within each document.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batched_texts
|
list[list[str]]
|
|
required |
tokenizer
|
Hugging Face tokenizer for |
required | |
model
|
Hugging Face model used with hidden states. |
required | |
keep_hidden_states_on_device
|
bool
|
If True, leave activations on the model device
(see :func: |
False
|
Returns:
| Name | Type | Description |
|---|---|---|
A |
ChunkMapper
|
|
ChunkMapper
|
chunks in flatten order (same order as |
Source code in pelinker/util.py
render_elementary_tensor_table(chunk_mapper, text_word_spans, layers)
¶
Map spaCy word spans to tokenizer tokens and fill chunk_mapper.tt_expressions.
Updates chunk_mapper in place: converts character-level word spans to token
index ranges, builds the global mapping table, then computes per-span embedding
rows via :func:tt_normalize and stores them in chunk_mapper.tt_expressions
(one tensor per chunk, rows aligned with surviving word spans).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
chunk_mapper
|
ChunkMapper
|
Mapper with encoder hidden states already in |
required |
text_word_spans
|
list[list[tuple[int, int]]]
|
For each chunk, a list of |
required |
layers
|
list[int]
|
Layer indices forwarded to :func: |
required |
Returns:
| Type | Description |
|---|---|
None
|
None; mutates |
Source code in pelinker/util.py
split_text_into_batches(text, max_length)
¶
Split a single string into chunks no longer than max_length characters.
Uses a regex that prefers breaking after whitespace near the limit; a chunk may
reach max_length when no earlier break exists.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Full document or segment to split. |
required |
max_length
|
int
|
Maximum characters per chunk (typically tokenizer/model limit). |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Non-empty string segments whose concatenation recovers |
list[str]
|
regex edge cases on pathological input). |
Source code in pelinker/util.py
split_text_into_token_budget(text, tokenizer, max_tokens)
¶
Split text so each segment encodes to at most max_tokens subword tokens.
Uses a longest-prefix binary search per segment (by character offset), then prefers breaking at the last space still within the token budget. Avoids relying on a fixed character cap, which can exceed the model's tokenizer limit.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Full document string for one logical chunking pass. |
required |
tokenizer
|
Hugging Face tokenizer ( |
required | |
max_tokens
|
int
|
Maximum subword count per segment (typically |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
Segments whose concatenation equals text exactly (no dropped characters). |
Raises:
| Type | Description |
|---|---|
ValueError
|
If |
Source code in pelinker/util.py
str2layers(layers_spec)
¶
Parse layer specification; same rules as :func:normalize_layers_spec.
text_to_tokens_embeddings(texts, tokenizer, model, *, keep_hidden_states_on_device=False)
¶
Run the transformer encoder and return hidden states plus character spans per token.
Encodes texts without special tokens, pads to a batch, runs model with
output_hidden_states=True, and zeroes padded positions using the attention mask.
By default hidden states are moved to CPU; set keep_hidden_states_on_device=True
to keep them on the model device (lower host RAM, higher GPU memory use).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
Batch of strings to encode (one chunk per row after batching). |
required |
tokenizer
|
A Hugging Face |
required | |
model
|
A Hugging Face |
required | |
keep_hidden_states_on_device
|
bool
|
If False (default), stack hidden states on CPU. |
False
|
Returns:
| Type | Description |
|---|---|
|
Tuple |
|
|
|
|
|
|
|
|
lists |
|
|
span pairs from |
Source code in pelinker/util.py
texts_to_vrep(texts, tokenizer, model, layers_spec, word_modes, nlp, max_length=MAX_LENGTH, *, chunk_by_token_budget=True, keep_hidden_states_on_device=False)
¶
Turn raw texts into encoder-based vector representations for sliding word windows.
Pipeline (high level):
- Split each document into chunks (token-budget by default, see
chunk_by_token_budget). - Encode all chunks in one transformer forward pass (:func:
process_text). - For each chunk, tokenize with spaCy once (:func:
text_to_tokens) and build sliding windows ofwtokens per :class:~pelinker.onto.WordGrouping(:func:token_list_with_window). - Map word character spans to tokenizer token ranges and pool layer activations
(:func:
render_elementary_tensor_table→ :func:tt_normalize). - Drop expressions whose start character was lost when mapping words to tokens,
then merge chunks per document (:func:
build_expression_container).
The same encoder activations are reused for every word_modes entry; only the
spaCy windows and pooling targets differ. Each pass over word_modes calls
:func:render_elementary_tensor_table, so fields on chunk_mapper such as
tt_expressions and text_word_spans_list reflect only the last grouping;
use the :class:~pelinker.onto.ReportBatch slots for per-mode tensors.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
texts
|
list[str]
|
One string per document (logical item in the returned batch). |
required |
tokenizer
|
Hugging Face tokenizer for |
required | |
model
|
Transformer model with hidden states (not sentence-transformers |
required | |
layers_spec
|
str | list[int]
|
Layer selection; string digits ( |
required |
word_modes
|
list[WordGrouping]
|
For each mode, build |
required |
nlp
|
Language
|
Loaded spaCy pipeline (:func: |
required |
max_length
|
int
|
When |
MAX_LENGTH
|
chunk_by_token_budget
|
bool
|
If True (default), split with :func: |
True
|
keep_hidden_states_on_device
|
bool
|
If True, keep stacked hidden states on the model device (saves host RAM; requires GPU memory for large batches). |
False
|
Returns:
| Type | Description |
|---|---|
ReportBatch
|
|
ReportBatch
|
|
ReportBatch
|
expression holders with pooled embeddings. |
Source code in pelinker/util.py
670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 | |
token_list_with_window(tokens, window, itext=None, ichunk=None)
¶
Build every contiguous window-token slice as an :class:~pelinker.onto.Expression.
Each expression stores the participating :class:~pelinker.onto.SimplifiedToken
objects and, after __post_init__, character bounds a/b for the span.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
tokens
|
list[SimplifiedToken]
|
spaCy-derived tokens for one chunk. |
required |
window
|
WordGrouping
|
Window size via :class: |
required |
itext
|
Document index in the outer batch (optional metadata on expressions). |
None
|
|
ichunk
|
Chunk index within the document (optional metadata). |
None
|
Returns:
| Type | Description |
|---|---|
list[Expression]
|
Length |
Source code in pelinker/util.py
tt_aggregate_normalize(tt, ls)
¶
:param tt: incoming dims: n_layers x nb x n_tokens x n_emb :param ls: :return: nb x n_emb
Source code in pelinker/util.py
tt_normalize(cm, layers)
¶
Average selected layers, then average token vectors per word span to form word vectors.
Indexes cm.tensor with layers (layer indices on the first dimension),
averages across those layers and batch/time to get per-token embeddings, then for
each chunk takes contiguous token ranges from cm.token_word_spans_list and
averages those token vectors (mean pooling) to produce one vector per word span.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
cm
|
ChunkMapper
|
Chunk mapper with |
required |
layers
|
Layer indices (typically negative indices into |
required |
Returns:
| Type | Description |
|---|---|
list[list[Tensor]]
|
Outer list is per chunk; inner list is one |
list[list[Tensor]]
|
word span in that chunk, in order. |