Core components¶

Reference for logical schema pieces, ingestion runtime, actors, and transforms.

Core Components¶

Schema¶

The Schema is the single source of truth for the LPG structure. It encapsulates:

Vertex and edge definitions with optional type information
Identity and physical index configurations
DB profile defaults and DB-aware projection settings
Automatic schema inference from normalized PostgreSQL databases (3NF with PK/FK) or from OWL/RDFS ontologies

IngestionModel¶

The IngestionModel is the source of truth for ingestion runtime behavior. It encapsulates:

Resource mappings and actor pipelines
Reusable named transforms
Runtime initialization against the core schema (finish_init(schema.core_schema))

Manifest-level sanitization¶

When targeting stricter engines (notably TigerGraph), identifier normalization is handled at the manifest boundary:

Implementation lives in graflo.architecture.evolution as SanitizeOp / apply_sanitize (reserved words, DatabaseProfile storage names, TigerGraph per-relation identity alignment, and coordinated ingestion rewrites).
Sanitizer.sanitize_manifest(manifest) is the ergonomic wrapper: it builds the evolution op list for the configured DBType and applies it in place (same public API as before).
GraphEngine.infer_manifest(...) runs Sanitizer on the assembled GraphManifest before returning, so PostgreSQL inference through the engine stays target-flavor-safe.
SQLInferenceManager (infer_artifacts, infer_complete_schema, …) does not sanitize; it keeps source column names in resources so you can compose a manifest and then call Sanitizer once at the boundary (or rely on infer_manifest when using GraphEngine).

Manifest/schema renaming¶

When you need to rename vertex types, edge relations, or ingestion resource names in bulk, use the built-in rename APIs:

GraphManifest.rename_entities(vertices=..., edges=..., resources=...)
Schema.rename_entities(vertices=..., edges=...)

Each rename argument accepts either:

a mapping (dict[str, str]) for explicit substitutions, or
a callable (Callable[[str], str]) for programmatic transforms (prefix/suffix/camelize, etc.).

from graflo.architecture.contract import GraphManifest
from graflo.util.transform import camel_to_snake

manifest = GraphManifest.from_dict(payload)
renamed = manifest.rename_entities(
    vertices={"Person": "author", "Organization": "institution"},
    edges=lambda relation: f"rel_{camel_to_snake(relation)}",
    resources=lambda name: f"src_{name}",
)

GraphManifest.rename_entities(...) updates all relevant references consistently:

schema vertex names + edge endpoints/relations
ingestion pipelines (vertex, edge/create_edge, nested descend, router mappings)
resource edge selectors (infer_edge_only / infer_edge_except) and extra_weights
bindings resource references (connectors[].resource_name, resource_connector[].resource)

This API is meant for deterministic contract refactors and complements (not replaces) DB-specific sanitization.

Vertex¶

A Vertex describes vertices and their logical identity. It supports:

Single or compound identity fields (e.g., ["first_name", "last_name"] instead of "full_name")
Property definitions with optional type information
Fields can be specified as strings (backward compatible) or typed Field objects
Supported types: INT, FLOAT, BOOL, STRING, DATETIME
Type information enables better validation and database-specific optimizations
Duplicate property declarations are normalized by field name
Same type duplicates merge into one field
If one duplicate is typed and the other is untyped, the typed definition wins
Conflicting non-null types for the same field name are rejected
Filtering conditions
Optional blank vertex configuration

Identity defaults are strict by default at schema level: - VertexConfig.identity_from_all_properties: false (default) do not require explicit vertex identity, defaults to all properties - VertexConfig.identity_from_all_properties: false disables compatibility fallback where missing identity uses all property names

Edge¶

An Edge describes edges and their logical identities. It allows:

Optional uniqueness semantics through identities (multiple candidate keys are allowed)
properties: relationship payload (names and optional types), same accepted forms as vertex properties (strings, Field, or dicts with at least name)
Optional static relation label (e.g. Neo4j relationship type) when it is not derived at ingest time

Ingestion-only controls (relation_field, relation_from_key, match_source, match_target, vertex-sourced edge payload) live on EdgeActor steps and EdgeDerivation, not on the logical Edge model.

Edge properties and configuration¶

Basic logical fields¶

source: Source vertex name (required)
target: Target vertex name (required)
identities: Logical identity keys for the edge (each key can induce uniqueness)
properties: Declared relationship attributes (typed or untyped)

Neo4j, Memgraph, FalkorDB — relationship MERGE keys: Writers match source and target nodes on vertex identity, then MERGE the relationship. Which relationship properties participate in that MERGE (so multiple edges between the same two vertices do not collapse) is derived as follows: use the first identities key, keep only tokens that refer to relationship payload (skip source and target; the relation token becomes the relation property on the relationship where used). If that produces no fields—e.g. identities is empty—the writer falls back to all names in Edge.properties. Declare identities when the full property list is a superset of what should define edge uniqueness.

Relationship type at ingest time¶

relation on the logical edge: static relationship type when applicable
relation_field on an edge actor step: column/field holding dynamic relationship type values (CSV/tabular; see Example 3)
relation_from_key on an edge actor step: use JSON object keys as relationship types (nested JSON; see Example 4)

Payload from vertices at ingest time¶

Vertex fields that should appear on edges are configured via edge actor options (e.g. vertex_weights, maps), not via a weights block on the logical Edge. DB layers may still use an internal WeightConfig built from Edge.properties for backends that need it.

Edge behavior control¶

Edge physical variants should be modeled with schema.db_profile.edge_specs[*].purpose (YAML) / db_profile.edge_specs[*].purpose (in code).
Edge.aux is no longer a behavior switch.

DB-only physical edge metadata (including purpose) is configured under schema.db_profile.edge_specs, not on Edge.

Matching and filtering (ingestion)¶

match_source / match_target / match: edge actor options for branch selection when building edges from hierarchical documents

Advanced logical configuration¶

type: Edge type (DIRECT or INDIRECT)
by: Vertex name for indirect edges
DB-specific edge storage/type names are resolved from schema.db_profile through DB-aware wrappers (EdgeConfigDBAware), not stored on Edge.

When to use what¶

relation_field (Example 3):

Set on the source / target edge step in the resource pipeline when relationship types live in a column (e.g. company_a, company_b, relation, date).

relation_from_key (Example 4):

Set on the edge step for nested JSON where keys imply relationship types.

properties on the logical edge:

Declare every relationship attribute you want in the schema (dates, scores, metadata).
Typed example: properties: [{name: date, type: DATETIME}, {name: confidence_score, type: FLOAT}]
String list: properties: [date, confidence_score]

match_source / match_target:

Edge actor options when multiple branches feed the same vertex types; use to restrict which branches participate in an edge.

DataSource & DataSourceRegistry¶

An AbstractDataSource subclass defines where data comes from and how it is retrieved. Each carries a DataSourceType. The DataSourceRegistry maps data sources to Resources by name.

`DataSourceType`	Adapter	Sources
`FILE`	`FileDataSource`	JSON, JSONL, CSV/TSV, Parquet files
`SPARQL`	`RdfFileDataSource`	Turtle (`.ttl`), RDF/XML (`.rdf`), N3 (`.n3`), JSON-LD files — parsed via `rdflib`
`SPARQL`	`SparqlEndpointDataSource`	Remote SPARQL endpoints (e.g. Apache Fuseki) queried via `SPARQLWrapper`
`API`	`APIDataSource`	REST API endpoints with pagination, authentication, and retry logic
`SQL`	`SQLDataSource`	SQL databases via SQLAlchemy with parameterised queries
`IN_MEMORY`	`InMemoryDataSource`	Python objects (lists, DataFrames) already in memory

Data sources handle retrieval only. They bind to Resources by name via the DataSourceRegistry, so the same Resource can ingest data from multiple sources without modification.

Resource¶

A Resource is the central abstraction that bridges data sources and the graph schema. Each Resource defines a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements:

How data structures map to vertices and edges
What transformations to apply
The actor pipeline for processing documents

Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, a SQL table, or a SPARQL endpoint.

Resource-level edge inference controls: - infer_edges: Global toggle for inferred edge emission during assembly (default: true). - infer_edge_only: Allow-list of inferred edges (source, target, optional relation). - infer_edge_except: Deny-list of inferred edges (source, target, optional relation). - infer_edge_only and infer_edge_except are mutually exclusive and validated against declared schema edges. - These controls apply to inferred edges only; explicit edge actors in the pipeline are still emitted. - Auto-exclusion: When a resource pipeline contains any EdgeActor for edges of type (source, target), (source, target, None) is automatically added to infer_edge_except for that resource, so inferred edges do not duplicate edges produced by explicit edge actors.

Actor¶

An Actor describes how the current level of the document should be mapped/transformed to the property graph vertices and edges. There are five actor types:

DescendActor: Navigates to the next level in the hierarchy. Supports:
key: Process a specific key in a dictionary
any_key: Process all keys in a dictionary (useful when you want to handle multiple keys dynamically)
TransformActor: Applies data transformations
VertexActor: Creates vertices from the current level. Key options:
role (optional): named accumulator slot. When set the vertex is stored at lindex.extend((role, 0)) instead of bare lindex, so multiple vertices of the same type in one row (e.g. role: self, role: parent, role: child) occupy distinct slots and can be addressed individually by a downstream edge step.
from: rename map {vertex_field: doc_field}. Only mismatched column names need listing; remaining vertex schema properties are absorbed from the doc automatically (passthrough).
keep_fields: restrict passthrough to this field subset. Use on role-vertex steps to prevent shared row columns from leaking into placeholder vertices that only carry an ID.
EdgeActor: Creates edges between vertices. Operates in three modes:
Static mode (from/to set on both sides): vertex types declared at config time.
Dynamic / mixed mode (at least one of source_type_field / target_type_field / source_role / target_role set): vertex types resolved at extraction time by looking up accumulator slots. source_role / target_role are ergonomic aliases for source_type_field / target_type_field — the slot lookup is identical whether the slot was populated by vertex+role or vertex_router+role (with router role inferred from type_field when omitted).
Multi-link mode (links list set): each item in links emits one edge intent per row. Use when one flat row encodes multiple distinct relationship types (e.g. is_child_of and is_parent_of from the same row).
VertexRouterActor: Routes documents to the correct VertexActor based on a type field read from the document at runtime. Vertices are stored at lindex.extend((role, 0)); when role is omitted it is inferred from type_field. Optional router-level from provides a default {vertex_field: doc_field} projection; vertex_from_map overrides per resolved vertex type. Use when the vertex type varies per row; for a fixed vertex type with role-distinct slots, use vertex+role instead.

flowchart TB
    subgraph actors [Actor Types]
        D[DescendActor]
        T[TransformActor]
        V["VertexActor\n(optional role)"]
        E["EdgeActor\n(static · dynamic · multi-link)"]
        VR["VertexRouterActor\n(type from doc)"]
    end
    Doc[Document] --> D
    Doc --> T
    Doc --> V
    Doc --> E
    Doc --> VR
    V -.->|"role='r'\n→ store at lindex.(r,0)"| slot_r["acc_vertex slot (r,0)"]
    VR -.->|"role='r' (or inferred from type_field)\n→ store at lindex.(r,0)"| slot_tf["acc_vertex slot (r,0)"]
    E -.->|"source_role='r' or\nsource_type_field='tf'\n→ scan acc_vertex at slot"| slot_r
    E -.->|"links: [...]"| multi["N edge intents per row"]

Accumulator slots: `vertex+role` vs `vertex_router`¶

Both mechanisms write vertices to a named sub-slot of the current LocationIndex. A downstream dynamic EdgeActor scans acc_vertex for data at the same slot path.

Mechanism	When the vertex type is...	Slot name comes from...
`vertex: T, role: r`	static (known at schema design time)	`role` value
`vertex_router: type_field: tf` (optional `role: r`)	dynamic (read from a doc column at runtime)	`role` (`type_field` when `role` is omitted)

The EdgeActor vocabulary matches:

Slot populated by	Edge config field
`vertex+role`	`source_role` / `target_role`
`vertex_router`	`source_type_field` / `target_type_field` (or `source_role` / `target_role`)

Both pairs are equivalent at runtime — they name the same path segment in acc_vertex.

Dynamic Edge Scenario Matrix¶

Vertex types	Relation	Config pattern
Both static	Static	`from: server, to: database, relation: uses`
Both static	Dynamic from field	`from: server, to: database, relation_field: rt`
Both static	Dynamic from key	`from: server, to: database, relation_from_key: true`
Both dynamic (router)	Static	`source_role: src, target_role: tgt, relation: uses`
Both dynamic (router)	Dynamic from field	`source_role: src, target_role: tgt, relation_field: rt`
Both role-slot	Static	`source_role: self, target_role: parent, relation: is_child_of`
Mixed (static + dynamic)	Dynamic	`from: person, target_role: tgt, relation_field: rt`
Mixed (dynamic + static)	Dynamic	`source_role: src, to: institution, relation_field: rt`
Multiple relations from one row	Static per link	`links: [{source_role: self, target_role: parent, relation: is_child_of}, ...]`

source_type_field / target_type_field (or source_role / target_role) must equal the accumulator slot segment of the upstream VertexRouterActor — role (inferred from type_field when omitted). For a static vertex step, source_role / target_role must equal that step’s role.

Type Safety Controls¶

When dynamic edge types are used, a row may encounter a (source_type, target_type) pair not pre-declared in the schema edge_config. By default (strict_edge_types: false) this pair is registered at runtime. For strictly-typed databases that require DDL before writes, set:

edge:
  source_type_field: S
  target_type_field: T
  strict_edge_types: true   # skip rows whose resolved pair is not pre-declared

Location-scoped observations, transforms, and routers¶

Ingestion pipelines walk nested JSON (or list-shaped branches). At each step, actors receive:

A LocationIndex — a path into the document (which list index, which object key, and so on).
An observation slice — usually a dict that is the current fragment of the document for that path (for example the element produced by a DescendActor iteration). Tabular sources are the special case where the top-level slice is one flat object per record.

Transform output is not written back onto that slice automatically. TransformActor appends a TransformPayload to ExtractionContext.transform_buffer[location] for the same LocationIndex it was invoked with. Later actors at that location can consume those named fields.

VertexActor with role stores the vertex at lindex.extend((role, 0)) using the configured role string as the slot segment. Extraction reads from an effective observation built from the current doc slice plus same-location transform buffer values (transform values override raw doc values on conflicts). Field greediness is controlled explicitly via extraction_scope: full (default) keeps passthrough behavior for remaining schema properties, while mapped_only extracts only fields explicitly mapped in from. In full, keep_fields restricts passthrough to a subset and helps prevent unrelated row columns from leaking into placeholder role vertices that only need an ID. A downstream edge step references the slot via source_role / target_role.

VertexRep carries only the extracted vertex document. Row-level merged observation state used for edge relation/weight derivation lives in ExtractionContext.obs_buffer and is looked up by LocationIndex (with parent lookup for nested scopes).

VertexRouterActor builds an effective observation by merging the current dict slice with all TransformPayload entries at that LocationIndex. Routing fields (type_field, optional from / vertex_from_map, optional keep_fields, optional extraction_scope) are read from this merged view — the same dict is passed to the lazily created VertexActor (no separate rename/slice layer). The vertex is accumulated at lindex.extend((role, 0)), where role is inferred from type_field when omitted. A downstream dynamic EdgeActor finds it by setting source_role / target_role (or source_type_field / target_type_field) to that same slot segment.

Dynamic EdgeActor (slot mode) also merges the doc with the transform buffer before reading relation_field; this ensures that values produced by upstream transforms (e.g. canonicalized relation names) are available at edge construction time.

Multi-link EdgeActor (when links is set) delegates to one sub-actor per link. Each sub-actor performs a full single-intent edge resolution; the results accumulate into the same ExtractionContext. The links field is mutually exclusive with all top-level source/target fields on the same step.

Scoping: transform_buffer is keyed only by the exact LocationIndex. A transform at a parent path does not appear in the buffer for a child path, and vice versa. That keeps parent/child branches separate.

Descend behavior: When DescendActor expands a collection, inner actors see sub_doc (one child value) per iteration — not the full parent object — unless you denormalize parent fields onto each child or structure the pipeline so the router runs at a level where the slice already contains what you need.

Future discussion (not implemented): Opt-in inheritance of specific fields from a parent LocationIndex (or a parent observation stack) could simplify parent–child edges without duplicating data on every child; that would be an explicit configuration surface to avoid breaking the default isolation above.

Transform¶

A Transform defines data transforms, from renaming and type-casting to arbitrary Python functions. The transform system is built on two layers:

For a dedicated guide covering all transform use cases and configuration options (inline/local usage, reusable use references, multi-field strategies, and key transforms), see Transforms.

ProtoTransform — the raw function wrapper. It holds module, foo (function name), and params. Its apply() method invokes the function without caring about where the inputs come from or how the outputs are packaged.
Transform — wraps a ProtoTransform with input extraction, output formatting, field mapping, and optional dressing.

Output modes¶

A Transform can produce output in three ways:

Direct output (output) — the function returns one or more values that map 1:1 to output field names:
```
- foo: parse_date_ibes
  module: graflo.util.transform
  input: [ANNDATS, ANNTIMS]
  output: [datetime_announce]
```
The function takes two arguments and returns a single string; the string is placed into the datetime_announce field.
Field mapping (map) — pure renaming with no function:
```
- map:
    Date: t_obs
```
Dressed output (dress) — the function returns a single scalar, and the result is packaged together with the input field name into a dict. This is useful for pivoting wide columns into key/value rows:
```
- foo: round_str
  module: graflo.util.transform
  params:
    ndigits: 3
  input:
  - Open
  dress:
    key: name
    value: value
```
Given a document {Open: "6.430062..."}, this produces {name: "Open", value: 6.43}. The dress dict has two roles:
- key — the output field that receives the input field name (here "Open")
- value — the output field that receives the function result (here 6.43)
You can also use dress as a shorthand without a callable when you only want to pivot one field into key/value form:
```
- transform:
    call:
      input: [vol]
      dress:
        key: type
        value: value
```
Given {vol: 0.123}, this produces {type: "vol", value: 0.123}.

This cleanly separates what function to apply (ProtoTransform) from how to present the result (dressing).

Key transforms¶

Transforms can also target document keys (not values) using transform.call.target: keys. Key mode uses implicit per-key execution and a selector under call.keys:

mode: all — apply to all keys
mode: include — apply only to listed keys
mode: exclude — apply to all keys except listed keys

Example: normalize all keys to snake case:

- transform:
    call:
      module: graflo.util.transform
      foo: camel_to_snake
      target: keys
      keys:
        mode: all

Example: strip raw_ only from selected keys:

- transform:
    call:
      module: graflo.util.transform
      foo: remove_prefix
      params: {prefix: "raw_"}
      target: keys
      keys:
        mode: include
        names: [raw_id, raw_label]

Grouped value transforms¶

For repeated tuple-style value calls, use explicit input_groups in transform.call:

- transform:
    call:
      module: my_pkg.transforms
      foo: join_name
      input_groups:
        - [fname_parent, lname_parent]
        - [fname_child, lname_child]
      output: [parent_name, child_name]

This executes one function call per group with deterministic output mapping.

flowchart LR
    Doc["Input Document"] -->|"extract input fields"| Proto["ProtoTransform.apply()"]
    Proto -->|"dress is set"| Dressed["{dress.key: input_key,<br/>dress.value: result}"]
    Proto -->|"output is set"| Direct["zip(output, result)"]
    Proto -->|"map only"| Mapped["{new_key: old_value}"]

Schema-level transforms¶

Transforms are declared as a list under ingestion_model.transforms and referenced from resource steps via transform.call.use. This keeps ordering explicit and allows reuse across multiple pipelines:

transforms:
  - name: keep_suffix_id
    foo: split_keep_part
    module: graflo.util.transform
    params: { sep: "/", keep: -1 }
    input: [id]
    output: [_key]

resources:
- name: works
  apply:
  - transform:
      call:
        use: keep_suffix_id      # references the transform above
        input: [doi]             # override input for this usage
  - vertex: work

Transform steps are executed in the order they appear in apply.