Core components¶
Reference for logical schema pieces, ingestion runtime, actors, and transforms.
Core Components¶
Schema¶
The Schema is the single source of truth for the LPG structure. It encapsulates:
- Vertex and edge definitions with optional type information
- Identity and physical index configurations
- DB profile defaults and DB-aware projection settings
- Automatic schema inference from normalized PostgreSQL databases (3NF with PK/FK) or from OWL/RDFS ontologies
IngestionModel¶
The IngestionModel is the source of truth for ingestion runtime behavior. It encapsulates:
- Resource mappings and actor pipelines
- Reusable named transforms
- Runtime initialization against the core schema (
finish_init(schema.core_schema))
Manifest-level sanitization¶
When targeting stricter engines (notably TigerGraph), identifier normalization is handled at the manifest boundary:
- Implementation lives in
graflo.architecture.evolutionasSanitizeOp/apply_sanitize(reserved words,DatabaseProfilestorage names, TigerGraph per-relation identity alignment, and coordinated ingestion rewrites). Sanitizer.sanitize_manifest(manifest)is the ergonomic wrapper: it builds the evolution op list for the configuredDBTypeand applies it in place (same public API as before).GraphEngine.infer_manifest(...)runsSanitizeron the assembledGraphManifestbefore returning, so PostgreSQL inference through the engine stays target-flavor-safe.SQLInferenceManager(infer_artifacts,infer_complete_schema, …) does not sanitize; it keeps source column names in resources so you can compose a manifest and then callSanitizeronce at the boundary (or rely oninfer_manifestwhen usingGraphEngine).
Manifest/schema renaming¶
When you need to rename vertex types, edge relations, or ingestion resource names in bulk, use the built-in rename APIs:
GraphManifest.rename_entities(vertices=..., edges=..., resources=...)Schema.rename_entities(vertices=..., edges=...)
Each rename argument accepts either:
- a mapping (
dict[str, str]) for explicit substitutions, or - a callable (
Callable[[str], str]) for programmatic transforms (prefix/suffix/camelize, etc.).
from graflo.architecture.contract import GraphManifest
from graflo.util.transform import camel_to_snake
manifest = GraphManifest.from_dict(payload)
renamed = manifest.rename_entities(
vertices={"Person": "author", "Organization": "institution"},
edges=lambda relation: f"rel_{camel_to_snake(relation)}",
resources=lambda name: f"src_{name}",
)
GraphManifest.rename_entities(...) updates all relevant references consistently:
- schema vertex names + edge endpoints/relations
- ingestion pipelines (
vertex,edge/create_edge, nesteddescend, router mappings) - resource edge selectors (
infer_edge_only/infer_edge_except) andextra_weights - bindings resource references (
connectors[].resource_name,resource_connector[].resource)
This API is meant for deterministic contract refactors and complements (not replaces) DB-specific sanitization.
Vertex¶
A Vertex describes vertices and their logical identity. It supports:
- Single or compound identity fields (e.g.,
["first_name", "last_name"]instead of"full_name") - Property definitions with optional type information
- Fields can be specified as strings (backward compatible) or typed
Fieldobjects - Supported types:
INT,FLOAT,BOOL,STRING,DATETIME - Type information enables better validation and database-specific optimizations
- Duplicate property declarations are normalized by field name
- Same type duplicates merge into one field
- If one duplicate is typed and the other is untyped, the typed definition wins
- Conflicting non-null types for the same field name are rejected
- Filtering conditions
- Optional blank vertex configuration
Identity defaults are strict by default at schema level:
- VertexConfig.identity_from_all_properties: false (default) do not require explicit vertex identity, defaults to all properties
- VertexConfig.identity_from_all_properties: false disables compatibility fallback where missing identity uses all property names
Edge¶
An Edge describes edges and their logical identities. It allows:
- Optional uniqueness semantics through
identities(multiple candidate keys are allowed) properties: relationship payload (names and optional types), same accepted forms as vertex properties (strings,Field, or dicts with at leastname)- Optional static
relationlabel (e.g. Neo4j relationship type) when it is not derived at ingest time
Ingestion-only controls (relation_field, relation_from_key, match_source, match_target, vertex-sourced edge payload) live on EdgeActor steps and EdgeDerivation, not on the logical Edge model.
Edge properties and configuration¶
Basic logical fields¶
source: Source vertex name (required)target: Target vertex name (required)identities: Logical identity keys for the edge (each key can induce uniqueness)properties: Declared relationship attributes (typed or untyped)
Neo4j, Memgraph, FalkorDB — relationship MERGE keys: Writers match source and target nodes on vertex identity, then MERGE the relationship. Which relationship properties participate in that MERGE (so multiple edges between the same two vertices do not collapse) is derived as follows: use the first identities key, keep only tokens that refer to relationship payload (skip source and target; the relation token becomes the relation property on the relationship where used). If that produces no fields—e.g. identities is empty—the writer falls back to all names in Edge.properties. Declare identities when the full property list is a superset of what should define edge uniqueness.
Relationship type at ingest time¶
relationon the logical edge: static relationship type when applicablerelation_fieldon an edge actor step: column/field holding dynamic relationship type values (CSV/tabular; see Example 3)relation_from_keyon an edge actor step: use JSON object keys as relationship types (nested JSON; see Example 4)
Payload from vertices at ingest time¶
Vertex fields that should appear on edges are configured via edge actor options (e.g. vertex_weights, maps), not via a weights block on the logical Edge. DB layers may still use an internal WeightConfig built from Edge.properties for backends that need it.
Edge behavior control¶
- Edge physical variants should be modeled with
schema.db_profile.edge_specs[*].purpose(YAML) /db_profile.edge_specs[*].purpose(in code). Edge.auxis no longer a behavior switch.
DB-only physical edge metadata (including
purpose) is configured underschema.db_profile.edge_specs, not onEdge.
Matching and filtering (ingestion)¶
match_source/match_target/match: edge actor options for branch selection when building edges from hierarchical documents
Advanced logical configuration¶
type: Edge type (DIRECT or INDIRECT)by: Vertex name for indirect edges- DB-specific edge storage/type names are resolved from
schema.db_profilethrough DB-aware wrappers (EdgeConfigDBAware), not stored onEdge.
When to use what¶
relation_field (Example 3):
- Set on the
source/targetedge step in the resource pipeline when relationship types live in a column (e.g.company_a, company_b, relation, date).
relation_from_key (Example 4):
- Set on the edge step for nested JSON where keys imply relationship types.
properties on the logical edge:
- Declare every relationship attribute you want in the schema (dates, scores, metadata).
- Typed example:
properties: [{name: date, type: DATETIME}, {name: confidence_score, type: FLOAT}] - String list:
properties: [date, confidence_score]
match_source / match_target:
- Edge actor options when multiple branches feed the same vertex types; use to restrict which branches participate in an edge.
DataSource & DataSourceRegistry¶
An AbstractDataSource subclass defines where data comes from and how it is retrieved. Each carries a DataSourceType. The DataSourceRegistry maps data sources to Resources by name.
DataSourceType |
Adapter | Sources |
|---|---|---|
FILE |
FileDataSource |
JSON, JSONL, CSV/TSV, Parquet files |
SPARQL |
RdfFileDataSource |
Turtle (.ttl), RDF/XML (.rdf), N3 (.n3), JSON-LD files — parsed via rdflib |
SPARQL |
SparqlEndpointDataSource |
Remote SPARQL endpoints (e.g. Apache Fuseki) queried via SPARQLWrapper |
API |
APIDataSource |
REST API endpoints with pagination, authentication, and retry logic |
SQL |
SQLDataSource |
SQL databases via SQLAlchemy with parameterised queries |
IN_MEMORY |
InMemoryDataSource |
Python objects (lists, DataFrames) already in memory |
Data sources handle retrieval only. They bind to Resources by name via the DataSourceRegistry, so the same Resource can ingest data from multiple sources without modification.
Resource¶
A Resource is the central abstraction that bridges data sources and the graph schema. Each Resource defines a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements:
- How data structures map to vertices and edges
- What transformations to apply
- The actor pipeline for processing documents
Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, a SQL table, or a SPARQL endpoint.
Resource-level edge inference controls:
- infer_edges: Global toggle for inferred edge emission during assembly (default: true).
- infer_edge_only: Allow-list of inferred edges (source, target, optional relation).
- infer_edge_except: Deny-list of inferred edges (source, target, optional relation).
- infer_edge_only and infer_edge_except are mutually exclusive and validated against declared schema edges.
- These controls apply to inferred edges only; explicit edge actors in the pipeline are still emitted.
- Auto-exclusion: When a resource pipeline contains any EdgeActor for edges of type (source, target), (source, target, None) is automatically added to infer_edge_except for that resource, so inferred edges do not duplicate edges produced by explicit edge actors.
Actor¶
An Actor describes how the current level of the document should be mapped/transformed to the property graph vertices and edges. There are five actor types:
DescendActor: Navigates to the next level in the hierarchy. Supports:key: Process a specific key in a dictionaryany_key: Process all keys in a dictionary (useful when you want to handle multiple keys dynamically)TransformActor: Applies data transformationsVertexActor: Creates vertices from the current level. Key options:role(optional): named accumulator slot. When set the vertex is stored atlindex.extend((role, 0))instead of barelindex, so multiple vertices of the same type in one row (e.g.role: self,role: parent,role: child) occupy distinct slots and can be addressed individually by a downstream edge step.from: rename map{vertex_field: doc_field}. Only mismatched column names need listing; remaining vertex schema properties are absorbed from the doc automatically (passthrough).keep_fields: restrict passthrough to this field subset. Use on role-vertex steps to prevent shared row columns from leaking into placeholder vertices that only carry an ID.EdgeActor: Creates edges between vertices. Operates in three modes:- Static mode (
from/toset on both sides): vertex types declared at config time. - Dynamic / mixed mode (at least one of
source_type_field/target_type_field/source_role/target_roleset): vertex types resolved at extraction time by looking up accumulator slots.source_role/target_roleare ergonomic aliases forsource_type_field/target_type_field— the slot lookup is identical whether the slot was populated byvertex+roleorvertex_router+role(with router role inferred fromtype_fieldwhen omitted). - Multi-link mode (
linkslist set): each item inlinksemits one edge intent per row. Use when one flat row encodes multiple distinct relationship types (e.g.is_child_ofandis_parent_offrom the same row). VertexRouterActor: Routes documents to the correctVertexActorbased on a type field read from the document at runtime. Vertices are stored atlindex.extend((role, 0)); whenroleis omitted it is inferred fromtype_field. Optional router-levelfromprovides a default{vertex_field: doc_field}projection;vertex_from_mapoverrides per resolved vertex type. Use when the vertex type varies per row; for a fixed vertex type with role-distinct slots, usevertex+roleinstead.
flowchart TB
subgraph actors [Actor Types]
D[DescendActor]
T[TransformActor]
V["VertexActor\n(optional role)"]
E["EdgeActor\n(static · dynamic · multi-link)"]
VR["VertexRouterActor\n(type from doc)"]
end
Doc[Document] --> D
Doc --> T
Doc --> V
Doc --> E
Doc --> VR
V -.->|"role='r'\n→ store at lindex.(r,0)"| slot_r["acc_vertex slot (r,0)"]
VR -.->|"role='r' (or inferred from type_field)\n→ store at lindex.(r,0)"| slot_tf["acc_vertex slot (r,0)"]
E -.->|"source_role='r' or\nsource_type_field='tf'\n→ scan acc_vertex at slot"| slot_r
E -.->|"links: [...]"| multi["N edge intents per row"]
Accumulator slots: vertex+role vs vertex_router¶
Both mechanisms write vertices to a named sub-slot of the current LocationIndex. A downstream dynamic EdgeActor scans acc_vertex for data at the same slot path.
| Mechanism | When the vertex type is... | Slot name comes from... |
|---|---|---|
vertex: T, role: r |
static (known at schema design time) | role value |
vertex_router: type_field: tf (optional role: r) |
dynamic (read from a doc column at runtime) | role (type_field when role is omitted) |
The EdgeActor vocabulary matches:
| Slot populated by | Edge config field |
|---|---|
vertex+role |
source_role / target_role |
vertex_router |
source_type_field / target_type_field (or source_role / target_role) |
Both pairs are equivalent at runtime — they name the same path segment in acc_vertex.
Dynamic Edge Scenario Matrix¶
| Vertex types | Relation | Config pattern |
|---|---|---|
| Both static | Static | from: server, to: database, relation: uses |
| Both static | Dynamic from field | from: server, to: database, relation_field: rt |
| Both static | Dynamic from key | from: server, to: database, relation_from_key: true |
| Both dynamic (router) | Static | source_role: src, target_role: tgt, relation: uses |
| Both dynamic (router) | Dynamic from field | source_role: src, target_role: tgt, relation_field: rt |
| Both role-slot | Static | source_role: self, target_role: parent, relation: is_child_of |
| Mixed (static + dynamic) | Dynamic | from: person, target_role: tgt, relation_field: rt |
| Mixed (dynamic + static) | Dynamic | source_role: src, to: institution, relation_field: rt |
| Multiple relations from one row | Static per link | links: [{source_role: self, target_role: parent, relation: is_child_of}, ...] |
source_type_field / target_type_field (or source_role / target_role) must equal the accumulator slot segment of the upstream VertexRouterActor — role (inferred from type_field when omitted). For a static vertex step, source_role / target_role must equal that step’s role.
Type Safety Controls¶
When dynamic edge types are used, a row may encounter a (source_type, target_type) pair not pre-declared in the schema edge_config. By default (strict_edge_types: false) this pair is registered at runtime. For strictly-typed databases that require DDL before writes, set:
edge:
source_type_field: S
target_type_field: T
strict_edge_types: true # skip rows whose resolved pair is not pre-declared
Location-scoped observations, transforms, and routers¶
Ingestion pipelines walk nested JSON (or list-shaped branches). At each step, actors receive:
- A
LocationIndex— a path into the document (which list index, which object key, and so on). - An observation slice — usually a
dictthat is the current fragment of the document for that path (for example the element produced by aDescendActoriteration). Tabular sources are the special case where the top-level slice is one flat object per record.
Transform output is not written back onto that slice automatically. TransformActor appends a TransformPayload to ExtractionContext.transform_buffer[location] for the same LocationIndex it was invoked with. Later actors at that location can consume those named fields.
VertexActor with role stores the vertex at lindex.extend((role, 0)) using the configured role string as the slot segment. Extraction reads from an effective observation built from the current doc slice plus same-location transform buffer values (transform values override raw doc values on conflicts). Field greediness is controlled explicitly via extraction_scope: full (default) keeps passthrough behavior for remaining schema properties, while mapped_only extracts only fields explicitly mapped in from. In full, keep_fields restricts passthrough to a subset and helps prevent unrelated row columns from leaking into placeholder role vertices that only need an ID. A downstream edge step references the slot via source_role / target_role.
VertexRep carries only the extracted vertex document. Row-level merged observation state used for edge relation/weight derivation lives in ExtractionContext.obs_buffer and is looked up by LocationIndex (with parent lookup for nested scopes).
VertexRouterActor builds an effective observation by merging the current dict slice with all TransformPayload entries at that LocationIndex. Routing fields (type_field, optional from / vertex_from_map, optional keep_fields, optional extraction_scope) are read from this merged view — the same dict is passed to the lazily created VertexActor (no separate rename/slice layer). The vertex is accumulated at lindex.extend((role, 0)), where role is inferred from type_field when omitted. A downstream dynamic EdgeActor finds it by setting source_role / target_role (or source_type_field / target_type_field) to that same slot segment.
Dynamic EdgeActor (slot mode) also merges the doc with the transform buffer before reading relation_field; this ensures that values produced by upstream transforms (e.g. canonicalized relation names) are available at edge construction time.
Multi-link EdgeActor (when links is set) delegates to one sub-actor per link. Each sub-actor performs a full single-intent edge resolution; the results accumulate into the same ExtractionContext. The links field is mutually exclusive with all top-level source/target fields on the same step.
Scoping: transform_buffer is keyed only by the exact LocationIndex. A transform at a parent path does not appear in the buffer for a child path, and vice versa. That keeps parent/child branches separate.
Descend behavior: When DescendActor expands a collection, inner actors see sub_doc (one child value) per iteration — not the full parent object — unless you denormalize parent fields onto each child or structure the pipeline so the router runs at a level where the slice already contains what you need.
Future discussion (not implemented): Opt-in inheritance of specific fields from a parent LocationIndex (or a parent observation stack) could simplify parent–child edges without duplicating data on every child; that would be an explicit configuration surface to avoid breaking the default isolation above.
Transform¶
A Transform defines data transforms, from renaming and type-casting to
arbitrary Python functions. The transform system is built on two layers:
For a dedicated guide covering all transform use cases and configuration
options (inline/local usage, reusable use references, multi-field
strategies, and key transforms), see Transforms.
- ProtoTransform — the raw function wrapper. It holds
module,foo(function name), andparams. Itsapply()method invokes the function without caring about where the inputs come from or how the outputs are packaged. - Transform — wraps a ProtoTransform with input extraction, output formatting, field mapping, and optional dressing.
Output modes¶
A Transform can produce output in three ways:
-
Direct output (
output) — the function returns one or more values that map 1:1 to output field names:- foo: parse_date_ibes module: graflo.util.transform input: [ANNDATS, ANNTIMS] output: [datetime_announce]The function takes two arguments and returns a single string; the string is placed into the
datetime_announcefield. -
Field mapping (
map) — pure renaming with no function: -
Dressed output (
dress) — the function returns a single scalar, and the result is packaged together with the input field name into a dict. This is useful for pivoting wide columns into key/value rows:- foo: round_str module: graflo.util.transform params: ndigits: 3 input: - Open dress: key: name value: valueGiven a document
{Open: "6.430062..."}, this produces{name: "Open", value: 6.43}. Thedressdict has two roles:key— the output field that receives the input field name (here"Open")value— the output field that receives the function result (here6.43)
You can also use
dressas a shorthand without a callable when you only want to pivot one field into key/value form:Given
{vol: 0.123}, this produces{type: "vol", value: 0.123}.This cleanly separates what function to apply (ProtoTransform) from how to present the result (dressing).
Key transforms¶
Transforms can also target document keys (not values) using
transform.call.target: keys. Key mode uses implicit per-key execution and a
selector under call.keys:
mode: all— apply to all keysmode: include— apply only to listed keysmode: exclude— apply to all keys except listed keys
Example: normalize all keys to snake case:
Example: strip raw_ only from selected keys:
- transform:
call:
module: graflo.util.transform
foo: remove_prefix
params: {prefix: "raw_"}
target: keys
keys:
mode: include
names: [raw_id, raw_label]
Grouped value transforms¶
For repeated tuple-style value calls, use explicit input_groups in
transform.call:
- transform:
call:
module: my_pkg.transforms
foo: join_name
input_groups:
- [fname_parent, lname_parent]
- [fname_child, lname_child]
output: [parent_name, child_name]
This executes one function call per group with deterministic output mapping.
flowchart LR
Doc["Input Document"] -->|"extract input fields"| Proto["ProtoTransform.apply()"]
Proto -->|"dress is set"| Dressed["{dress.key: input_key,<br/>dress.value: result}"]
Proto -->|"output is set"| Direct["zip(output, result)"]
Proto -->|"map only"| Mapped["{new_key: old_value}"]
Schema-level transforms¶
Transforms are declared as a list under ingestion_model.transforms and
referenced from resource steps via transform.call.use. This keeps ordering
explicit and allows reuse across multiple pipelines:
transforms:
- name: keep_suffix_id
foo: split_keep_part
module: graflo.util.transform
params: { sep: "/", keep: -1 }
input: [id]
output: [_key]
resources:
- name: works
apply:
- transform:
call:
use: keep_suffix_id # references the transform above
input: [doi] # override input for this usage
- vertex: work
Transform steps are executed in the order they appear in apply.