Creating a Schema¶

This guide explains how to define a graflo Schema: the central configuration that describes your graph structure (vertices and edges), how data is transformed (resources and actors), and optional metadata. The content is structured so that both developers and automated agents can follow the same principles.

Principles¶

Schema is the single source of truth for the graph: vertex types, edge types, indexes, and the mapping from raw data to vertices/edges.
All schema configs are Pydantic models (ConfigBaseModel). You can load from YAML or dicts; validation runs at load time.
Resources define data pipelines: each resource has a unique resource_name and an apply (or pipeline) list of actor steps. Data sources (files, APIs, SQL) are bound to resources by name elsewhere (e.g. Patterns).
Order of definition is flexible in YAML: general, vertex_config, edge_config, resources, and transforms can appear in any order. References (e.g. vertex names in edges or in apply) must refer to names defined in the same schema.

Schema structure¶

A Schema has five top-level parts:

Section	Required	Description
`general`	Yes	Schema name and optional version.
`vertex_config`	Yes	Vertex types and their fields, indexes, filters.
`edge_config`	Yes	Edge types (source, target, weights, indexes).
`resources`	No	List of resources: data pipelines (apply/pipeline) that map data to vertices and edges.
`transforms`	No	Named transform functions used by resources.

`general` (SchemaMetadata)¶

Identifies the schema. Used for versioning and as fallback graph/schema name when the database config does not set one.

general:
  name: my_graph          # required
  version: "1.0"          # optional

name: Required. Identifier for the schema (e.g. graph or database name).
version: Optional. Semantic or custom version string.

`vertex_config`¶

Defines vertex types: their fields, indexes, and optional filters. Each vertex type has a unique name and is referenced by that name in edges and in resources.

Structure¶

vertex_config:
  vertices:
    - name: person
      fields: [id, name, age]
      indexes:
        - fields: [id]
    - name: department
      fields: [name]
      indexes:
        - fields: [name]
  blank_vertices: []       # optional: vertex names allowed without explicit data
  force_types: {}           # optional: vertex -> list of field type names
  db_flavor: ARANGO         # optional: ARANGO | NEO4J | TIGERGRAPH

Vertex fields¶

name: Required. Vertex type name (e.g. person, department). Must be unique.
fields: List of field definitions. Each item can be:
A string (field name, type inferred or omitted).
A dict with name and optional type: {"name": "created_at", "type": "DATETIME"}.
For TigerGraph or typed backends, use types: INT, UINT, FLOAT, DOUBLE, BOOL, STRING, DATETIME.
indexes: List of index definitions. If empty, a single primary index on all fields is created. Each index can specify fields and optionally unique: true/false.
filters: Optional list of filter expressions for querying this vertex.
dbname: Optional. Database-specific name (e.g. collection/table). Defaults to name if not set.

VertexConfig-level options¶

blank_vertices: Vertex names that may be created without explicit row data (e.g. placeholders). Each must exist in vertices.
force_types: Override mapping from vertex name to list of field type names for inference.
db_flavor: Database flavor used for schema/index generation: ARANGO, NEO4J, or TIGERGRAPH.

`edge_config`¶

Defines edge types: source and target vertex types, relation name, weights, and indexes.

Structure¶

edge_config:
  edges:
    - source: person
      target: department
      # optional: relation, match_source, match_target, weights, indexes, etc.

Edge fields¶

source, target: Required. Vertex type names (must exist in vertex_config.vertices).
relation: Optional. Relationship/edge type name (especially for Neo4j). For ArangoDB can be used as weight.
relation_field: Optional. Field name that stores or reads the relation type (e.g. for TigerGraph).
relation_from_key: Optional. If true, derive relation from the location key during ingestion (e.g. JSON key).
match_source, match_target: Optional. Fields used to match source/target vertices when creating edges.
weights: Optional. Weight/attribute configuration:
direct: List of field names or typed fields to attach directly to the edge (e.g. ["date", "weight"] or [{"name": "date", "type": "DATETIME"}]).
vertices: List of vertex-based weight definitions.
indexes (or index): Optional. List of index definitions for the edge.
purpose: Optional. Extra label for utility edges between the same vertex types.
type: Optional. DIRECT (default) or INDIRECT.
aux: Optional. If true, edge is created in DB but not used by graflo ingestion.
by: Optional. For INDIRECT edges: vertex type name used to define the edge.

`resources` (focus)¶

Resources define how each data stream is turned into vertices and edges. Each resource has a unique resource_name (used by Patterns / DataSourceRegistry to bind files, APIs, or SQL to this pipeline) and an apply (or pipeline) list of actor steps. Steps are executed in order; the pipeline can branch with descend steps.

Resource-level fields¶

resource_name: Required. Unique identifier (e.g. table or file name). Used when mapping data sources to this resource.
apply (or pipeline): Required. List of actor steps (see below).
encoding: Optional. Character encoding (default UTF_8).
merge_collections: Optional. List of collection names to merge when writing.
extra_weights: Optional. Additional edge weight configs for this resource.
types: Optional. Field name → Python type expression for casting during ingestion (e.g. {"age": "int"}, {"amount": "float"}, {"created_at": "datetime"}). Useful when input is string-only (CSV, JSON) and you need numeric or date values.
edge_greedy: Optional. If true (default), emit edges as soon as source/target exist; if false, wait for explicit targets.

Actor steps in `apply` / `pipeline`¶

Each step is a dict. You can write steps in shorthand (e.g. vertex: person) or with an explicit type (vertex, transform, edge, descend). The system recognizes:

Vertex step — create vertices of a given type from the current document level:
```
- vertex: person
```
Optional: keep_fields: [id, name].

Transform step — rename fields, change shape, or apply a named transform; optionally send result to a vertex:

- map:
    person: name
    person_id: id
  target_vertex: department   # or to_vertex

Or use a named transform (defined in transforms):

- name: keep_suffix_id
  params: { sep: "/", keep: -1 }
  input: [id]
  output: [_key]

Edge step — create edges between two vertex types:
```
- source: person
  target: department
```
Or:
```
- edge:
    from: person
    to: department
```
You can add edge-specific weights, indexes, etc. in the step when needed.

Descend step — go into a nested key and run a sub-pipeline (or process all keys with any_key):

- key: referenced_works
  apply:
    - vertex: work
    - source: work
      target: work

Or with any_key to iterate over all keys:

- any_key: true
  apply: [...]

Rules for resources (for agents)¶

Unique names: Every resource_name in the schema must be unique.
References: All vertex names in apply (e.g. vertex: person, source/target, target_vertex) must exist in vertex_config.vertices. All edge relationships implied by source/target should exist in edge_config.edges (or be compatible).
Order: Steps run in sequence. Typically you create vertices before creating edges that reference them; use transform to reshape data and descend to handle nested structures.
Transforms: If a step uses name: <transform_name>, that name must exist in transforms (see below).

`transforms`¶

Optional dictionary of named transforms used by resources. Keys are transform names; values are configs (e.g. foo, module, params, input, output).

transforms:
  keep_suffix_id:
    foo: split_keep_part
    module: graflo.util.transform
    params: { sep: "/", keep: -1 }
    input: [id]
    output: [_key]

Resources refer to them with name: keep_suffix_id (and optional params, input, output overrides) in a transform step.

Loading a schema¶

All schema configs are Pydantic models. You can load a Schema from a dict or YAML:

from graflo import Schema
from suthing import FileHandle

# From dict (e.g. from YAML already parsed)
schema = Schema.model_validate(FileHandle.load("schema.yaml"))
# Or explicit method
schema = Schema.from_dict(FileHandle.load("schema.yaml"))

# From YAML file path (if your root is the schema dict)
data = FileHandle.load("schema.yaml")
schema = Schema.model_validate(data)

After loading, the schema runs finish_init() (transform names, edge init, resource pipelines, and the internal resource name map). If you modify resources programmatically, call schema.finish_init() so that fetch_resource(name) and ingestion use the updated pipelines.

Minimal full example¶

general:
  name: hr

vertex_config:
  vertices:
    - name: person
      fields: [id, name, age]
      indexes:
        - fields: [id]
    - name: department
      fields: [name]
      indexes:
        - fields: [name]

edge_config:
  edges:
    - source: person
      target: department

resources:
  - resource_name: people
    apply:
      - vertex: person
  - resource_name: departments
    apply:
      - map:
          person: name
          person_id: id
      - target_vertex: department
        map:
          department: name

This defines two vertex types (person, department), one edge type (person → department), and two resources: people (each row → one person vertex) and departments (transform + department vertices). Data sources are attached to these resources by name (e.g. via Patterns or DataSourceRegistry) as shown in the Quick Start.

Creating a Schema¶

Principles¶

Schema structure¶

general (SchemaMetadata)¶

vertex_config¶

Structure¶

Vertex fields¶

VertexConfig-level options¶

edge_config¶

Structure¶

Edge fields¶

resources (focus)¶

Resource-level fields¶

Actor steps in apply / pipeline¶

Rules for resources (for agents)¶

transforms¶

Loading a schema¶

Minimal full example¶

See also¶

`general` (SchemaMetadata)¶

`vertex_config`¶

`edge_config`¶

`resources` (focus)¶

Actor steps in `apply` / `pipeline`¶

`transforms`¶