Features, migration, and practices¶
Product capabilities, the migrate_schema workflow, performance levers, and authoring best practices.
Key Features¶
Schema & Abstraction¶
- Declarative LPG schema —
Schemadefines vertices, edges, identity rules, and edgepropertiesin YAML or Python; the single source of truth for graph structure. Transforms/resources are defined inIngestionModel. - Database abstraction — one logical schema, multiple backends; each target uses its own
Connectiontype behindConnectionManager/DBWriter, with DB-specific behavior applied in DB-aware projection (Schema.resolve_db_aware(...),VertexConfigDBAware,EdgeConfigDBAware). - Resource abstraction — each
Resourceis a reusable actor pipeline that maps raw records to graph elements, decoupled from data retrieval. - DataSourceRegistry — pluggable
AbstractDataSourceadapters (FILE,SQL,API,SPARQL,IN_MEMORY) bound to Resources by name.
Schema Features¶
- Flexible Identity + Indexing — logical identity plus DB-specific secondary indexes (
schema.db_profile.vertex_indexes,edge_specs, …). - Typed properties — optional type information on vertex and edge
properties(INT, FLOAT, STRING, DATETIME, BOOL). - Hierarchical Edge Definition — define edges at any level of nested documents (via resource edge steps and actors).
- Relationship payload — logical edges declare
properties; additional payload from vertices or row shape is wired in edge actors (vertex_weights, maps, etc.) with optional types. - Blank Vertices — create intermediate vertices for complex relationships.
- Actor Pipeline — process documents through a sequence of specialised actors (descend, transform, vertex, edge).
- Reusable Transforms — define and reference transformations by name across Resources.
- Vertex Filtering — filter vertices based on custom conditions.
- PostgreSQL Schema Inference — infer schemas from normalised PostgreSQL databases (3NF) with PK/FK constraints.
- RDF / OWL Schema Inference — infer schemas from OWL/RDFS ontologies:
owl:Class→ vertices,owl:ObjectProperty→ edges,owl:DatatypeProperty→ vertex properties. - SelectSpec — declarative SQL view on top of
TableConnector(viewfield):kind="type_lookup"for polymorphic relation rows joined to type lookup table(s), orkind="select"for fullfrom/joins/select/where. See Table connector views and SelectSpec.
Schema Migration (v1)¶
- Read-only planning first — use
migrate_schema plan --from-schema-path ... --to-schema-path ...to generate a deterministic operation plan before any writes. - Risk-gated execution — v1 executes only low-risk additive operations by default and blocks high-risk/destructive operations.
- Backend scope — execution adapters are currently focused on ArangoDB and Neo4j; other backends are plan-first until adapter coverage is added.
- History and idempotency — applied revisions are tracked in a migration manifest (
.graflo/migrations.json) with revision + schema hash checks. - Operational commands —
plan,apply,status, andhistoryare exposed through themigrate_schemaCLI entrypoint.
Comparing Two Schemas¶
When you compare schemas, treat it like comparing two building blueprints:
--from-schema-pathis the current building blueprint.--to-schema-pathis the target building blueprint.migrate_schema planis the architectural diff report that tells you what must be added, changed, or removed to get from current to target.
Another useful analogy is git diff, but for graph structure:
- Additive changes (new vertex type, new edge, new property, new index) are similar to adding code in a backward-compatible way.
- Destructive changes (removing properties/types, identity shifts) are similar to breaking API changes: they often require explicit migration steps, data sweeps, or rollouts.
Practical comparison checklist:
- Run
planfirst and review operations grouped by risk. - Confirm identity changes explicitly (identity shifts are high-impact).
- Validate whether each blocked operation needs a manual script, staged rollout, or explicit high-risk approval.
- Use
apply --dry-runbefore any real apply.
Example:
uv run migrate_schema plan \
--from-schema-path schema_v1.yaml \
--to-schema-path schema_v2.yaml \
--output-format json
How to read the output:
operations: runnable operations under current risk policy (v1 defaults to low-risk subset).blocked_operations: operations intentionally withheld for safety.warnings: policy and compatibility notes you should resolve before execution.
Migration Command Examples¶
# Plan changes between two schema versions
uv run migrate_schema plan \
--from-schema-path schema_v1.yaml \
--to-schema-path schema_v2.yaml
# Dry-run apply to inspect backend actions
uv run migrate_schema apply \
--from-schema-path schema_v1.yaml \
--to-schema-path schema_v2.yaml \
--db-config-path db.yaml \
--revision 0001_additive_updates \
--dry-run
# Persist migration history after real execution
uv run migrate_schema apply \
--from-schema-path schema_v1.yaml \
--to-schema-path schema_v2.yaml \
--db-config-path db.yaml \
--revision 0001_additive_updates \
--no-dry-run
# Inspect migration state
uv run migrate_schema status
uv run migrate_schema history
Why This Helps¶
Schema comparison gives you a predictable transition path between versions. Instead of discovering incompatibilities during ingestion, you see structural deltas in advance, gate risky steps, and execute a controlled rollout.
Performance Optimization¶
- Batch Processing: Process large datasets in configurable batches (
IngestionParams.batch_sizeonCaster/GraphEngine) - Batch Prefetch: While one batch is cast and written,
Caster.process_data_sourcecan prefetch up toIngestionParams.batch_prefetchadditional batches fromAbstractDataSource.iter_batches(bounded memory, overlapped I/O) - Parallel Execution: Utilize multiple cores for faster processing (
n_coresparameter ofCaster) - Efficient Resource Handling: Optimized processing of both table and tree-like data
Best Practices¶
- Use compound identity fields for natural keys, and
schema.db_profilesecondary indexes for query performance - Leverage blank vertices for complex relationship modeling
- Define reusable transforms in
ingestion_model.transformsand reference them from resource steps - Configure appropriate batch sizes based on your data volume
- Enable parallel processing for large datasets
- Choose the right relationship attribute based on your data format:
relation_fieldon an edge actor step — relation from a column/fieldrelation_from_keyon an edge actor step — relation from JSON keysrelationon the logical edge — static relationship name when applicable- Use logical edge
properties(and edge-actor payload options) for temporal or quantitative relationship attributes - Specify types when the target DB requires them (e.g., TigerGraph)
- Use typed
Fieldobjects or dicts with atypekey for better validation - Leverage key matching (
match_source,match_target) on edge steps for complex matching scenarios - Use PostgreSQL schema inference for automatic schema generation from normalized databases (3NF) with proper PK/FK constraints
- Use RDF/OWL schema inference (
infer_schema_from_rdf) when ingesting data from SPARQL endpoints or.ttlfiles with a well-defined ontology - Specify property types for better validation and database-specific optimizations, especially when targeting TigerGraph