ontocast.tool.agg.normalizer¶
Entity normalization for disambiguation.
This module handles the preparation of entities for embedding-based disambiguation. It creates normalized string representations r(e) that include: - Normalized form of the entity URI - Semantic neighbors (types, properties)
EntityNormalizer
¶
Normalizes entities and creates string representations for embedding.
This class is responsible for transforming entity URIs into normalized string representations that can be embedded and compared.
Source code in ontocast/tool/agg/normalizer.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 | |
__init__(ontology_namespaces=None)
¶
Initialize the entity normalizer.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ontology_namespaces
|
set[str] | None
|
Set of namespace URIs that identify ontology entities. Entities from these namespaces are preferred as representatives. |
None
|
Source code in ontocast/tool/agg/normalizer.py
create_representation(entity, graph)
¶
Create a normalized representation r(e) for an entity.
This combines the normalized form with semantic neighbors to create a rich representation suitable for embedding.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
URIRef
|
Entity URI |
required |
graph
|
RDFGraph
|
RDF graph containing the entity |
required |
Returns:
| Type | Description |
|---|---|
EntityRepresentation
|
EntityRepresentation containing r(e) and metadata |
Source code in ontocast/tool/agg/normalizer.py
create_representations_batch(entities, graphs)
¶
Create representations for multiple entities.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entities
|
list[URIRef]
|
List of entity URIs |
required |
graphs
|
dict[URIRef, RDFGraph]
|
Mapping from entity to its source graph |
required |
Returns:
| Type | Description |
|---|---|
dict[URIRef, EntityRepresentation]
|
Dictionary mapping entity URIs to their representations |
Source code in ontocast/tool/agg/normalizer.py
extract_entity_context(entity, graph)
¶
Extract semantic context for an entity from the graph.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
URIRef
|
Entity to extract context for |
required |
graph
|
RDFGraph
|
RDF graph containing the entity |
required |
Returns:
| Type | Description |
|---|---|
tuple[list[URIRef], list[URIRef], list[str]]
|
Tuple of (types, properties, labels) |
Source code in ontocast/tool/agg/normalizer.py
is_ontology_entity(entity)
¶
Check if an entity belongs to an ontology namespace.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
entity
|
URIRef
|
Entity URI to check |
required |
Returns:
| Type | Description |
|---|---|
bool
|
True if entity is from an ontology namespace |
Source code in ontocast/tool/agg/normalizer.py
normalize_string(text)
¶
Normalize a string: lowercase, remove diacritics, clean special chars.
CamelCase is split so that it yields the same logical tokens as snake_case (e.g. 'PLRedShift' -> 'pl red shift').
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
text
|
str
|
Input string to normalize |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalized string suitable for comparison |
Examples:
'PLRedShift' -> 'pl red shift' 'PL_red_shift_value' -> 'pl red shift value' 'Café' -> 'cafe'
Source code in ontocast/tool/agg/normalizer.py
normalize_uri(uri)
¶
Extract and normalize the local part of a URI.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
uri
|
URIRef
|
URI to normalize |
required |
Returns:
| Type | Description |
|---|---|
str
|
Normalized local name |
Examples:
'http://example.org/PLRedShift' -> 'pl red shift' 'http://example.org/PL_red_shift_value' -> 'pl red shift value'
Source code in ontocast/tool/agg/normalizer.py
EntityRepresentation
dataclass
¶
Normalized representation of an entity for embedding.
Attributes:
| Name | Type | Description |
|---|---|---|
entity |
URIRef
|
Original entity URI |
normal_form |
str
|
Normalized string (lowercase, no diacritics, etc.) |
types |
list[URIRef]
|
List of type URIs for this entity |
properties |
list[URIRef]
|
List of property URIs used with this entity |
labels |
list[str]
|
List of labels found for this entity |
representation |
str
|
Combined string representation r(e) for embedding |
is_ontology_entity |
bool
|
Whether this entity is from an ontology namespace |