Concepts¶
GraFlo is a Graph Schema Transformation Language (GSTL) for Labeled Property Graphs (LPG). As a domain-specific language (DSL), it separates graph schema definition from data-source binding and database targeting, enabling a single declarative specification to drive ingestion across heterogeneous sources and databases while keeping transformation logic portable across vendors.
System Overview¶
The GraFlo pipeline transforms data through six stages with a manifest contract boundary:
%%{ init: {
"theme": "base",
"themeVariables": {
"primaryColor": "#90CAF9",
"primaryTextColor": "#111111",
"primaryBorderColor": "#1E88E5",
"lineColor": "#546E7A",
"secondaryColor": "#A5D6A7",
"tertiaryColor": "#CE93D8"
}
} }%%
flowchart LR
MF["<b>GraphManifest</b><br/>schema + ingestion_model + bindings"]
SI["<b>Source Instance</b><br/>File · SQL · SPARQL · API"]
R["<b>Resource</b><br/>Actor Pipeline"]
EX["<b>Extraction</b><br/>Observations + Edge Intents"]
AS["<b>Assembly</b><br/>Graph Entity Materialization"]
GS["<b>Schema (logical)</b><br/>Vertex/Edge Definitions<br/>Identities · DB Profile"]
IM["<b>IngestionModel</b><br/>Resources · Transforms"]
BD["<b>Bindings</b><br/>Resource -> Data Source mapping"]
GC["<b>GraphContainer</b><br/>Database-Independent Representation"]
DB["<b>Graph DB (LPG)</b><br/>ArangoDB · Neo4j · TigerGraph · Others"]
MF --> GS
MF --> IM
MF --> BD
SI --> R --> EX --> AS --> GC --> DB
IM -. configures .-> R
GS -. constrains .-> AS
BD -. routes sources .-> R
- Source Instance — a concrete data artifact (a file, a table, a SPARQL endpoint), wrapped by an
AbstractDataSourcewith aDataSourceType(FILE,SQL,SPARQL,API,IN_MEMORY). - Resource — a reusable transformation pipeline (actor steps: descend, transform, vertex, edge) that maps raw records to graph elements. Data sources bind to Resources by name via the
DataSourceRegistry. - GraphManifest — the canonical top-level contract that composes
schema,ingestion_model, andbindings. High-level contract evolution (remove/merge vertex types and keep ingestion aligned) is described in Manifest evolution. - Schema — the declarative logical graph model (
Schema): vertex/edge definitions, identities, typedproperties, and DB profile. - IngestionModel — reusable resources and transforms used to map records into graph entities.
- Bindings — named
FileConnector/TableConnector/SparqlConnectorlist plusresource_connector(many rows per resource allowed: resource→0..n connectors) and optionalconnector_connection(connector name or hash→conn_proxyfor runtimeConnectionProviderresolution without secrets in the manifest). Optionalstaging_proxymaps logical staging profile names toconn_proxykeys for TigerGraph bulk S3 upload (credentials viaS3GeneralizedConnConfig, not in YAML). Staging is separate from ingestion connectors; see Object storage (S3 staging). Each connector exposes a bound source modality (BoundSourceKind: file, SQL table, SPARQL) for dispatch, distinct from the abstract ingestion Resource. See TigerGraph bulk load. - Database-Independent Graph Representation — a
GraphContainerof vertices and edges, independent of any target database. - Graph DB — the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph).
Data flow detail¶
The diagram below shows how different source instances (files, SQL tables, RDF/SPARQL)
flow through the DataSourceRegistry into the shared Resource pipeline.
flowchart LR
subgraph sources [Data Sources]
TTL["*.ttl / *.rdf files"]
Fuseki["SPARQL Endpoint<br/>(Fuseki)"]
Files["CSV / JSON files"]
PG["PostgreSQL"]
end
subgraph bindings [Bindings]
FP[FileConnector]
TP[TableConnector]
SP[SparqlConnector]
end
subgraph datasources [DataSource Layer]
subgraph rdfFamily ["RdfDataSource (abstract)"]
RdfDS[RdfFileDataSource]
SparqlDS[SparqlEndpointDataSource]
end
FileDS[FileDataSource]
SQLDS[SQLDataSource]
end
subgraph pipeline [Shared Pipeline]
Sch[Schema]
Res[Resource Pipeline]
Ex[Extraction Phase]
Asm[Assembly Phase]
GC[GraphContainer]
DBW[DBWriter]
end
TTL --> SP --> RdfDS --> Res
Fuseki --> SP --> SparqlDS --> Res
Files --> FP --> FileDS --> Res
PG --> TP --> SQLDS --> Res
Sch --> Res
Sch --> Asm
Res --> Ex --> Asm --> GC --> DBW
- Bindings (
FileConnector,TableConnector,SparqlConnector) describe where data comes from (file paths, SQL tables, SPARQL endpoints). Multiple connectors may attach to the same ingestion resource name; optionalconnector_connectionentries assign each SQL/SPARQL connector aconn_proxyby connectornameorhash(not by resource name). TheConnectionProviderturns that label into real connection config at runtime so manifests stay credential-free. - DataSources (
AbstractDataSourcesubclasses) handle how to read data in batches. Each carries aDataSourceTypeand is registered in theDataSourceRegistry. - Resources define what to extract — each
Resourceis a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Optionaldrop_trivial_input_fields:trueremoves top-level keys whose value isnullor""before actors run (shallow only;0andfalsestay). TigerGraph physical defaults for missing attributes belong inschema.db_profile.default_property_values(GSQLDEFAULTat DDL time), not in the covariantGraphContainerassembly path. - GraphContainer (covariant graph representation) collects the resulting vertices and edges in a database-independent format.
- DBWriter pushes the graph data into the target LPG store (ArangoDB, Neo4j, TigerGraph, FalkorDB, Memgraph, NebulaGraph).
- Document cast errors — when a single source document fails inside a resource,
IngestionParams.on_doc_errorchooses skip vs fail-the-batch; optional gzip JSONL persistence usesdoc_error_sink_path(CLIingest --doc-error-sink). Details: Document cast errors and doc error sink.
Minimal canonical config contract¶
GraFlo serializes configuration models in a minimal canonical form by default:
- fields equal to defaults are omitted;
Nonevalues are omitted;- aliases and normalized DSL shapes are used.
This is intentional for lightweight manifests and LLM-oriented workflows.
The guaranteed invariant is semantic/idempotent canonical round-trip
(parse -> minimal dump -> parse), not authored-style text preservation.
In depth¶
The overview above is continued in dedicated pages (formerly a single long document):
- Architecture diagrams — class-level Mermaid views of
GraphEngine,Schema/IngestionModel,Caster, and DataSources vs Resources - Core components — schema, ingestion, edges, DataSources, resources, actors, location scoping, transforms
- Features, migration, and practices — product features,
migrate_schemaCLI, performance notes, best practices
Focused topics: Transforms, Table connector views, Backend indexes, Ingestion doc errors, Object storage (S3 staging), Manifest evolution.