GraFlo — Graph Schema & Transformation Language (GSTL)
¶
GraFlo is a manifest-driven schema and ingestion layer for labeled property graphs (LPGs).
Write a GraphManifest (YAML or Python) once — it defines vertices, edges, typed properties,
identities, and DB profile — then infer, validate, migrate, and load into any supported graph engine.
It is a Python package and Graph Schema & Transformation Language (GSTL). GraphEngine covers schema inference, migrations, DDL, and ingest; Caster focuses on batching records into a GraphContainer and DBWriter.
What you get¶
- One pipeline, several graph databases — The same manifest targets ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph, PostgreSQL (relational vertex + junction edge tables), or a GraFlo file backend on disk;
DatabaseProfileand DB-aware types absorb naming, defaults, and indexing differences. - Explicit identities — Vertex identity fields and indexes back upserts so reloads merge on keys instead of blindly duplicating nodes.
- Reusable ingestion —
ResourceConfigactor pipelines (including vertex / vertex_router / edge steps) bind to files, SQL, SPARQL/RDF, APIs, or in-memory batches viaBindingsand theDataSourceRegistry. A single flat row can populate multiple same-type vertices in distinct named slots (role) and emit multiple edges in oneedge: linksstep. Per-resourcetolerate_transform_errors(default on) keeps ingestion moving when an individual transform step fails. - Schema as the contract —
GraphManifestis the single source of truth: vertex/edge definitions, typed properties, identity fields, and DB profile are validated atfinish_inittime, not at write time. Schema migrations are first-class (graflo migrate_schema). - Manifest as linked data — Export and restore manifests as RDF via the GraFlo ontology (
manifest-to-rdf/rdf-to-manifestCLI,graflo.rdfAPI). - Manifest-first sanitization —
Sanitizer(backed bygraflo.architecture.evolutionSanitizeOp) normalizes schema identifiers (reserved words, TigerGraph relation/index constraints) and synchronizes related ingestion mappings viasanitize_manifest(GraphManifest).GraphEngine.infer_manifest(...)applies it automatically; lower-levelSQLInferenceManagerdoes not—sanitize the manifest yourself when assembling contracts outside the engine.
What’s in the manifest¶
schema—Schema: metadata,core_schema(vertices, edges, typedproperties, identities), anddb_profile(DatabaseProfile: target flavor, storage names, secondary indexes, TigerGraphdefault_property_values, …).ingestion_model—IngestionModel: namedresources(actor sequences: descend, transform, vertex, edge, …) and a registry of reusabletransforms.bindings— Connectors (FileConnector,TableConnector,SparqlConnector,APIConnector) plusresource_connectorwiring. Optionalconnector_connectionmaps connectors toconn_proxylabels so YAML stays secret-free; a runtimeConnectionProvidersupplies credentials. REST APIs declare pagination onAPIConnector— see API connector and pagination.
Runtime path¶
- Source instance — Batches from a
DataSourceTypeadapter (FileDataSource,SQLDataSource,SparqlEndpointDataSource,APIDataSource, …). - Resource (actors) — Maps records to graph elements against the logical schema (validated during
IngestionModel.finish_init/ pipeline execution). GraphContainer— Intermediate, database-agnostic vertex/edge batches.- DB-aware projection —
Schema.resolve_db_aware()plusVertexConfigDBAware/EdgeConfigDBAwarefor the activeDBType. - Graph DB —
DBWriter+ConnectionManagerand the backend-specificConnectionimplementation.
| Piece | Role | Code |
|---|---|---|
| Logical graph schema | Manifest schema: vertex/edge definitions, identities, typed properties, DB profile. Constrains pipeline output and projection; not a separate queue between steps. |
Schema, VertexConfig, EdgeConfig (under core_schema). |
| Source instance | Concrete input: file, SQL table, SPARQL endpoint, API payload, in-memory rows. | AbstractDataSource + DataSourceType. |
| Resource | Ordered actors; resources are looked up by name when sources are registered. | ResourceConfig in IngestionModel; ResourceRuntime at cast time. |
Covariant graph (GraphContainer) |
Batches of vertices/edges before load. | GraphContainer. |
| DB-aware projection | Physical names, defaults, indexes for the target. | Schema.resolve_db_aware(), VertexConfigDBAware, EdgeConfigDBAware. |
| Graph DB | Target LPG; each DBType has its own connector, orchestrated the same way. |
ConnectionManager, DBWriter, per-backend Connection. |
Supported source types (DataSourceType)¶
| DataSourceType | Adapter | DataSource | Schema inference |
|---|---|---|---|
FILE — CSV / JSON / JSONL / Parquet |
FileConnector |
FileDataSource |
manual |
SQL — relational tables (docs focus on PostgreSQL; other engines via SQLAlchemy where supported) |
TableConnector |
SQLDataSource |
automatic for PostgreSQL-style 3NF (PK/FK heuristics) |
SPARQL — RDF files (.ttl, .rdf, .n3) |
SparqlConnector |
RdfFileDataSource |
automatic (OWL/RDFS ontology) |
SPARQL — SPARQL endpoints (Fuseki, …) |
SparqlConnector |
SparqlEndpointDataSource |
automatic (OWL/RDFS ontology) |
API — REST APIs |
APIConnector |
APIDataSource |
manual |
IN_MEMORY — list / DataFrame |
— | InMemoryDataSource |
manual |
Supported targets¶
The engines listed in What you get are the supported output DBType values in graflo.onto (including PostgreSQL as a relational graph store). Each backend uses its own Connection implementation under the shared ConnectionManager / DBWriter / GraphEngine flow.
Graph sources (introspection and bulk export) are supported on Neo4j, ArangoDB, and the GraFlo file backend via GraphEngine.migrate_graph() and ConnectionManager.graph_export_flavors(). See Graph export and migration.
Core Concepts¶
Labeled Property Graphs¶
GraFlo targets the LPG model:
- Vertices — nodes with typed properties (manifest key:
properties) and logical identity keys for upserts. - Duplicate vertex property definitions are merged by name; conflicting typed duplicates are rejected.
- Identity fallback from all properties is opt-in via
VertexConfig.identity_from_all_propertiesand is disabled by default. - Edges — relationships between vertices (
directed: trueby default); setdirected: falsefor logically undirected kinds. Relationship attributes are declared aspropertieson the logical edge (same list-of-names-or-Fieldshape as vertices). TigerGraph can project undirected edges toUNDIRECTED EDGEDDL or pair directed edges viadb_profile.edge_specs[*].reverse_edge(WITH REVERSE_EDGE).
Schema¶
The Schema is the single source of truth for graph structure (not for ingestion transforms):
- Vertex definitions — vertex types,
properties(optionally typed:INT,FLOAT,STRING,DATETIME,BOOL), identity, and filters. Secondary indexes and physical naming live underschema.db_profile(DatabaseProfile: e.g.vertex_indexes,edge_specs; see Backend indexes). - Edge definitions — source/target (and optional
relation), optionaldirected(defaulttrue),propertiesfor relationship payload, and optionalidentitiesfor parallel-edge / MERGE semantics. - Schema inference — generate schemas from PostgreSQL 3NF databases (PK/FK heuristics) or from OWL/RDFS ontologies.
Resources and transforms are part of IngestionModel, not Schema.
IngestionModel¶
IngestionModel defines how source records are transformed into graph entities:
- Resources — reusable actor pipelines that map raw records to vertices and edges.
- Transforms — reusable named transforms referenced by resource steps.
Resource¶
A Resource is the central abstraction that bridges data sources and the graph schema. Each Resource defines a reusable pipeline of actors (descend, transform, vertex, edge) that maps raw records to graph elements. Data sources bind to Resources by name via the DataSourceRegistry, so the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint.
vertexsteprole: assign a named accumulator slot to a static-type vertex step so multiple vertices of the same type in one flat row occupy distinct slots (e.g.role: self,role: parent,role: childall asperson). Useextraction_scope: mapped_onlyto extract only explicitfrommappings, or keep the defaultextraction_scope: fulland usekeep_fieldsto restrict passthrough.edgesteplinks: declare multiple edge intents in one step — each list item emits one edge per row with its ownsource_role/target_role(orsource_type_field/target_type_field) andrelation.source_role/target_roleonedgesteps: role-first aliases forsource_type_field/target_type_fieldwhen the slot was populated by avertex+roleorvertex_router+rolestep.
For wide rows with many empty or null columns, drop_trivial_input_fields (default false) removes only top-level keys whose value is null or "" before the pipeline runs. The filter is shallow: nested dicts and lists are not walked, and empty {} / [] values are kept because they are not null or "". 0 and false are kept.
For TigerGraph, optional attribute defaults belong in the covariant physical layer: schema.db_profile.default_property_values maps logical vertex/edge properties to YAML literals that GraFlo turns into GSQL DEFAULT clauses when defining the graph schema (same idea as CREATE VERTEX Sensor (id STRING PRIMARY KEY, reading FLOAT DEFAULT -1.0) in the TigerGraph schema reference).
DataSourceRegistry¶
The DataSourceRegistry manages AbstractDataSource adapters, each carrying a DataSourceType:
DataSourceType |
Adapter | Sources |
|---|---|---|
FILE |
FileDataSource |
CSV, JSON, JSONL, Parquet files |
SQL |
SQLDataSource |
PostgreSQL and other SQL databases via SQLAlchemy |
SPARQL |
RdfFileDataSource |
Turtle/RDF/N3/JSON-LD files via rdflib |
SPARQL |
SparqlEndpointDataSource |
Remote SPARQL endpoints (e.g. Apache Fuseki) via SPARQLWrapper |
API |
APIConnector / APIDataSource |
REST APIs via bindings; offset / page / cursor pagination, env wiring |
IN_MEMORY |
InMemoryDataSource |
Python objects (lists, DataFrames) |
GraphEngine¶
GraphEngine orchestrates end-to-end operations: schema/manifest inference, schema definition in the target database, connector creation from data sources, and data ingestion.
For PostgreSQL workflows, infer_manifest(...) returns a full manifest contract
(schema + ingestion_model + bindings) and runs target-DBType Sanitizer on that manifest before returning.
Graph-source workflows (infer_schema_from_graph, migrate_graph) introspect Neo4j, ArangoDB, or a file backend and load into any supported target — including PostgreSQL as a relational graph store. See Graph export and migration.
More capabilities¶
- GraFlo ontology (manifest RDF) — Publish and query manifests as linked data: OWL vocabulary at
https://ontology.growgraph.dev/graflo(v1.0.0), plusmanifest-to-rdf/rdf-to-manifestCLI andgraflo.rdfserializers. See GraFlo ontology. - SPARQL & RDF — Endpoints and RDF files; optional OWL/RDFS domain schema inference (
rdflib,SPARQLWrapperin the default install). - Schema inference — From PostgreSQL-style 3NF layouts (PK/FK heuristics) or from OWL/RDFS (
owl:Class→ vertices,owl:ObjectProperty→ edges,owl:DatatypeProperty→ vertex fields). See Example 5. - Graph export & migration — Introspect Neo4j or ArangoDB, export to a chunked file backend (
GraFloBackendConfig), ingest manifest resources to disk, or migrate graph→graph / graph→PostgreSQL. See Graph export and migration and Example 13. - Schema migrations — Plan and apply guarded schema deltas (
migrate_schemaconsole script →graflo.cli.migrate_schema; library ingraflo.migrate). Comparefrom/toschemas before execution to preview deltas and blocked high-risk operations. See Concepts — Schema Migration. - Typed
properties— Optional field types (INT,FLOAT,STRING,DATETIME,BOOL) on vertices and edges. - Batching & concurrency — Configurable batch sizes (
IngestionParams.batch_size), bounded prefetch of upcoming batches (IngestionParams.batch_prefetch), worker counts (IngestionParams.n_cores), and DB write concurrency (IngestionParams.max_concurrent_db_ops/DBWriter). - Ingestion scope filters — Optional subsets via
IngestionParams.resources,IngestionParams.connectors(connector name or hash), andIngestionParams.vertices.resourcesandconnectorsintersect when both are set. - Advanced filtering — Server-side filtering (e.g. TigerGraph REST++ API), client-side filter expressions, and SelectSpec for declarative SQL view/filter control before data reaches Resources.
- Blank vertices — Intermediate nodes for complex relationship modelling.
Quick Links¶
- Installation
- Quick Start Guide
- Graph export and migration
- Concepts (architecture diagrams)
- GraFlo ontology — manifest ↔ RDF
- Concepts — Schema Migration
- Concepts — Comparing Two Schemas
- API Reference
- Examples
Note: Mermaid diagrams are kept in section pages (for example
concepts/) rather than on this landing page.
Use Cases¶
- Data Migration — Transform relational data into LPG structures. Infer schemas from PostgreSQL 3NF databases and migrate data directly. Export or migrate between graph databases (Neo4j, ArangoDB) or into PostgreSQL relational graph tables.
- RDF-to-LPG — Read RDF triples from files or SPARQL endpoints, auto-infer schemas from OWL ontologies, and ingest into ArangoDB, Neo4j, etc.
- Knowledge Graphs — Build knowledge representations from heterogeneous sources (SQL, files, APIs, RDF/SPARQL).
- Data Integration — Combine multiple data sources into a unified labeled property graph.
- Graph Views — Create graph views of existing PostgreSQL databases without schema changes.
Requirements¶
- Python 3.11 or higher (3.11 and 3.12 officially supported)
- A graph database (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, or NebulaGraph) as target, or PostgreSQL for relational graph storage
- Optional: PostgreSQL for SQL data sources and 3NF schema inference
- Optional extras (see Installation):
dev(tests and typing),docs(MkDocs),plot(plot_manifestviapygraphviz; system Graphviz required) - Full dependency list in
pyproject.toml
Contributing¶
We welcome contributions! Please check out our Contributing Guide for details on how to get started.