Creating a Schema¶
This guide explains how to define a graflo Schema: the central configuration that describes your graph structure (vertices and edges), how data is transformed (resources and actors), and optional metadata. The content is structured so that both developers and automated agents can follow the same principles.
Principles¶
- Schema is the single source of truth for the graph: vertex types, edge types, indexes, and the mapping from raw data to vertices/edges.
- All schema configs are Pydantic models (
ConfigBaseModel). You can load from YAML or dicts; validation runs at load time. - Resources define data pipelines: each resource has a unique
resource_nameand anapply(orpipeline) list of actor steps. Data sources (files, APIs, SQL) are bound to resources by name elsewhere (e.g.Patterns). - Order of definition is flexible in YAML:
general,vertex_config,edge_config,resources, andtransformscan appear in any order. References (e.g. vertex names in edges or inapply) must refer to names defined in the same schema.
Schema structure¶
A Schema has five top-level parts:
| Section | Required | Description |
|---|---|---|
general |
Yes | Schema name and optional version. |
vertex_config |
Yes | Vertex types and their fields, indexes, filters. |
edge_config |
Yes | Edge types (source, target, weights, indexes). |
resources |
No | List of resources: data pipelines (apply/pipeline) that map data to vertices and edges. |
transforms |
No | Named transform functions used by resources. |
general (SchemaMetadata)¶
Identifies the schema. Used for versioning and as fallback graph/schema name when the database config does not set one.
name: Required. Identifier for the schema (e.g. graph or database name).version: Optional. Semantic or custom version string.
vertex_config¶
Defines vertex types: their fields, indexes, and optional filters. Each vertex type has a unique name and is referenced by that name in edges and in resources.
Structure¶
vertex_config:
vertices:
- name: person
fields: [id, name, age]
indexes:
- fields: [id]
- name: department
fields: [name]
indexes:
- fields: [name]
blank_vertices: [] # optional: vertex names allowed without explicit data
force_types: {} # optional: vertex -> list of field type names
db_flavor: ARANGO # optional: ARANGO | NEO4J | TIGERGRAPH
Vertex fields¶
name: Required. Vertex type name (e.g.person,department). Must be unique.fields: List of field definitions. Each item can be:- A string (field name, type inferred or omitted).
- A dict with
nameand optionaltype:{"name": "created_at", "type": "DATETIME"}. - For TigerGraph or typed backends, use types:
INT,UINT,FLOAT,DOUBLE,BOOL,STRING,DATETIME. indexes: List of index definitions. If empty, a single primary index on all fields is created. Each index can specifyfieldsand optionallyunique: true/false.filters: Optional list of filter expressions for querying this vertex.dbname: Optional. Database-specific name (e.g. collection/table). Defaults tonameif not set.
VertexConfig-level options¶
blank_vertices: Vertex names that may be created without explicit row data (e.g. placeholders). Each must exist invertices.force_types: Override mapping from vertex name to list of field type names for inference.db_flavor: Database flavor used for schema/index generation:ARANGO,NEO4J, orTIGERGRAPH.
edge_config¶
Defines edge types: source and target vertex types, relation name, weights, and indexes.
Structure¶
edge_config:
edges:
- source: person
target: department
# optional: relation, match_source, match_target, weights, indexes, etc.
Edge fields¶
source,target: Required. Vertex type names (must exist invertex_config.vertices).relation: Optional. Relationship/edge type name (especially for Neo4j). For ArangoDB can be used as weight.relation_field: Optional. Field name that stores or reads the relation type (e.g. for TigerGraph).relation_from_key: Optional. If true, derive relation from the location key during ingestion (e.g. JSON key).match_source,match_target: Optional. Fields used to match source/target vertices when creating edges.weights: Optional. Weight/attribute configuration:direct: List of field names or typed fields to attach directly to the edge (e.g.["date", "weight"]or[{"name": "date", "type": "DATETIME"}]).vertices: List of vertex-based weight definitions.indexes(orindex): Optional. List of index definitions for the edge.purpose: Optional. Extra label for utility edges between the same vertex types.type: Optional.DIRECT(default) orINDIRECT.aux: Optional. If true, edge is created in DB but not used by graflo ingestion.by: Optional. ForINDIRECTedges: vertex type name used to define the edge.
resources (focus)¶
Resources define how each data stream is turned into vertices and edges. Each resource has a unique resource_name (used by Patterns / DataSourceRegistry to bind files, APIs, or SQL to this pipeline) and an apply (or pipeline) list of actor steps. Steps are executed in order; the pipeline can branch with descend steps.
Resource-level fields¶
resource_name: Required. Unique identifier (e.g. table or file name). Used when mapping data sources to this resource.apply(orpipeline): Required. List of actor steps (see below).encoding: Optional. Character encoding (defaultUTF_8).merge_collections: Optional. List of collection names to merge when writing.extra_weights: Optional. Additional edge weight configs for this resource.types: Optional. Field name → Python type expression for casting during ingestion (e.g.{"age": "int"},{"amount": "float"},{"created_at": "datetime"}). Useful when input is string-only (CSV, JSON) and you need numeric or date values.edge_greedy: Optional. If true (default), emit edges as soon as source/target exist; if false, wait for explicit targets.
Actor steps in apply / pipeline¶
Each step is a dict. You can write steps in shorthand (e.g. vertex: person) or with an explicit type (vertex, transform, edge, descend). The system recognizes:
-
Vertex step — create vertices of a given type from the current document level:
Optional:keep_fields: [id, name]. -
Transform step — rename fields, change shape, or apply a named transform; optionally send result to a vertex:
Or use a named transform (defined intransforms): -
Edge step — create edges between two vertex types:
Or: You can add edge-specificweights,indexes, etc. in the step when needed. -
Descend step — go into a nested key and run a sub-pipeline (or process all keys with
Or withany_key):any_keyto iterate over all keys:
Rules for resources (for agents)¶
- Unique names: Every
resource_namein the schema must be unique. - References: All vertex names in
apply(e.g.vertex: person,source/target,target_vertex) must exist invertex_config.vertices. All edge relationships implied bysource/targetshould exist inedge_config.edges(or be compatible). - Order: Steps run in sequence. Typically you create vertices before creating edges that reference them; use transform to reshape data and descend to handle nested structures.
- Transforms: If a step uses
name: <transform_name>, that name must exist intransforms(see below).
transforms¶
Optional dictionary of named transforms used by resources. Keys are transform names; values are configs (e.g. foo, module, params, input, output).
transforms:
keep_suffix_id:
foo: split_keep_part
module: graflo.util.transform
params: { sep: "/", keep: -1 }
input: [id]
output: [_key]
Resources refer to them with name: keep_suffix_id (and optional params, input, output overrides) in a transform step.
Loading a schema¶
All schema configs are Pydantic models. You can load a Schema from a dict or YAML:
from graflo import Schema
from suthing import FileHandle
# From dict (e.g. from YAML already parsed)
schema = Schema.model_validate(FileHandle.load("schema.yaml"))
# Or explicit method
schema = Schema.from_dict(FileHandle.load("schema.yaml"))
# From YAML file path (if your root is the schema dict)
data = FileHandle.load("schema.yaml")
schema = Schema.model_validate(data)
After loading, the schema runs finish_init() (transform names, edge init, resource pipelines, and the internal resource name map). If you modify resources programmatically, call schema.finish_init() so that fetch_resource(name) and ingestion use the updated pipelines.
Minimal full example¶
general:
name: hr
vertex_config:
vertices:
- name: person
fields: [id, name, age]
indexes:
- fields: [id]
- name: department
fields: [name]
indexes:
- fields: [name]
edge_config:
edges:
- source: person
target: department
resources:
- resource_name: people
apply:
- vertex: person
- resource_name: departments
apply:
- map:
person: name
person_id: id
- target_vertex: department
map:
department: name
This defines two vertex types (person, department), one edge type (person → department), and two resources: people (each row → one person vertex) and departments (transform + department vertices). Data sources are attached to these resources by name (e.g. via Patterns or DataSourceRegistry) as shown in the Quick Start.
See also¶
- Concepts — Schema and constituents for higher-level overview.
- Quick Start for loading a schema and running ingestion.
- API Reference — architecture for Pydantic model details.