Architecture diagrams¶
Class-level Mermaid views of orchestration (GraphEngine), logical schema vs ingestion (Schema, IngestionModel), the Caster pipeline, and how DataSources relate to Resources.
Class Diagrams¶
GraphEngine orchestration¶
GraphEngine is the top-level orchestrator that coordinates schema inference,
connector creation, schema definition, and data ingestion. The diagram below shows
how it delegates to specialised components.
classDiagram
direction TB
class GraphEngine {
+target_db_flavor: DBType
+resource_mapper: ResourceMapper
+introspect(postgres_config) SchemaIntrospectionResult
+infer_manifest(postgres_config) GraphManifest
+create_bindings(postgres_config, ...) Bindings
+infer_schema_from_rdf(source) tuple~Schema, IngestionModel~
+create_bindings_from_rdf(source) Bindings
+define_schema(manifest, target_db_config)
+define_and_ingest(manifest, target_db_config, ...)
+ingest(manifest, target_db_config, ...)
}
class SQLInferenceManager {
+conn: PostgresConnection
+target_db_flavor: DBType
+introspect(schema_name) SchemaIntrospectionResult
+infer_artifacts(schema_name) SQLInferenceArtifacts
+infer_complete_schema(schema_name) tuple~Schema, IngestionModel~
}
class Sanitizer {
+db_flavor: DBType
+sanitize_manifest(manifest) GraphManifest
}
class ResourceMapper {
+create_bindings_from_postgres(conn, ...) Bindings
}
class Caster {
+schema: Schema
+ingestion_model: IngestionModel
+ingestion_params: IngestionParams
+ingest(target_db_config, bindings, ...)
}
class ConnectionManager {
+connection_config: DBConfig
+init_db(schema, recreate_schema)
+clear_data(schema)
}
class Schema {
«see Schema diagram»
}
class GraphManifest {
+schema: Schema?
+ingestion_model: IngestionModel?
+bindings: Bindings?
+finish_init()
}
class Bindings {
+connectors: list~ResourceConnector~
+resource_connector: list~ResourceConnectorBinding~
+connector_connection: list~ConnectorConnectionBinding~
+get_connectors_for_resource(name) list
+get_conn_proxy_for_connector(connector) str?
+bind_connector_to_conn_proxy(connector, conn_proxy)
}
class DBConfig {
<<abstract>>
+uri: str
+effective_schema: str?
+connection_type: DBType
}
GraphEngine --> SQLInferenceManager : creates for introspect / infer_artifacts
GraphEngine --> ResourceMapper : resource_mapper
GraphEngine --> Sanitizer : infer_manifest() applies target flavor
GraphEngine --> Caster : creates for ingest
GraphEngine --> ConnectionManager : creates for define_schema
GraphEngine ..> GraphManifest : produces / consumes
GraphEngine ..> Bindings : produces / consumes
GraphEngine ..> DBConfig : target_db_config
SQLInferenceManager performs introspection and schema/resource inference only (no Sanitizer). Use GraphEngine.infer_manifest or call Sanitizer.sanitize_manifest on a composed GraphManifest when you need target-DB normalization.
Schema architecture¶
Schema and IngestionModel split logical graph structure from ingestion
runtime pipelines. The diagram below shows their constituent parts and
relationships.
classDiagram
direction TB
class Schema {
+metadata: GraphMetadata
+core_schema: CoreSchema
+db_profile: DatabaseProfile
+finish_init()
+remove_disconnected_vertices()
+resolve_db_aware(db_flavor?) SchemaDBAware
}
class CoreSchema {
+vertex_config: VertexConfig
+edge_config: EdgeConfig
+finish_init()
}
class IngestionModel {
+resources: list~Resource~
+transforms: list~ProtoTransform~
+finish_init(core_schema)
+fetch_resource(name) Resource
}
class GraphMetadata {
+name: str
+version: str?
+description: str?
}
class VertexConfig {
+vertices: list~Vertex~
+blank_vertices: list~Vertex~
}
class Vertex {
+name: str
+identity: list~str~
+properties: list~Field~
+filters: FilterExpression?
}
class Field {
+name: str
+type: FieldType?
}
class EdgeConfig {
+edges: list~Edge~
+extra_edges: list~Edge~
}
class Edge {
+source: str
+target: str
+identities: list~list~str~~
+properties: list~Field~
+relation: str?
+filters: FilterExpression?
}
class Resource {
+name: str
+root: ActorWrapper
+executor: ActorExecutor
+finish_init(vertex_config, edge_config, transforms)
}
class ActorWrapper {
+actor: Actor
+children: list~ActorWrapper~
}
note for ActorWrapper "Recursive tree: each<br />child is an ActorWrapper"
class ActorExecutor {
+extract(doc) ExtractionContext
+assemble(extraction_ctx) dict
+assemble_result(extraction_ctx) GraphAssemblyResult
}
class Actor {
<<abstract>>
}
class VertexActor
class EdgeActor
class VertexRouterActor
class TransformActor
class DescendActor
class ProtoTransform {
+name: str
}
class ExtractionContext {
+acc_vertex: map
+transform_buffer: map
+obs_buffer: map
+edge_intents: list~EdgeIntent~
}
class AssemblyContext {
+extraction: ExtractionContext
+acc_global: map
}
class VertexObservation
class TransformObservation
class EdgeIntent
class ProvenancePath
class GraphAssemblyResult
class FilterExpression {
+kind: leaf | composite
+from_dict(data) FilterExpression
}
Schema *-- GraphMetadata : metadata
Schema *-- CoreSchema : core_schema
CoreSchema *-- VertexConfig : vertex_config
CoreSchema *-- EdgeConfig : edge_config
IngestionModel *-- "0..*" Resource : resources
IngestionModel *-- "0..*" ProtoTransform : transforms
VertexConfig *-- "0..*" Vertex : vertices
Vertex *-- "0..*" Field : properties
Vertex --> FilterExpression : filters
EdgeConfig *-- "0..*" Edge : edges
Edge *-- "0..*" Field : properties
Edge --> FilterExpression : filters
Resource *-- ActorWrapper : root
Resource *-- ActorExecutor : runtime orchestration
ActorWrapper --> Actor : actor
ActorExecutor ..> ExtractionContext : produces
ActorExecutor ..> AssemblyContext : consumes
ExtractionContext o-- VertexObservation
ExtractionContext o-- TransformObservation
ExtractionContext o-- EdgeIntent
EdgeIntent --> ProvenancePath
ActorExecutor ..> GraphAssemblyResult : produces
Actor <|-- VertexActor
Actor <|-- EdgeActor
Actor <|-- VertexRouterActor
Actor <|-- TransformActor
Actor <|-- DescendActor
Runtime detail: resource processing now uses an explicit two-phase flow
(ExtractionContext -> AssemblyContext). Extraction records typed artifacts
(VertexObservation, TransformObservation, EdgeIntent), and assembly turns
those artifacts into graph entities. Orchestration is owned by
ActorExecutor, while ActorWrapper remains focused on actor tree behavior.
Logical schema vs DB-aware projection¶
GraFlo now keeps logical graph modeling separate from DB materialization:
Vertex,Edge,VertexConfig, andEdgeConfigare logical and backend-agnostic.- DB-specific naming/defaults/index projection is resolved through
VertexConfigDBAwareandEdgeConfigDBAware. - The resolver entrypoint is
Schema.resolve_db_aware(...), used by DB write/connector stages.
flowchart TD
schema[LogicalSchema]
vcfg[VertexConfigLogical]
ecfg[EdgeConfigLogical]
dbfeat[DatabaseProfile]
resolver[DbAwareConfigResolver]
vdb[VertexConfigDBAware]
edb[EdgeConfigDBAware]
caster[CasterAndResources]
dbwriter[DBWriterAndBindings]
schema --> vcfg
schema --> ecfg
schema --> caster
schema --> resolver
dbfeat --> resolver
resolver --> vdb
resolver --> edb
vdb --> dbwriter
edb --> dbwriter
Caster ingestion pipeline¶
Caster is the ingestion workhorse. It builds a DataSourceRegistry via
RegistryBuilder, casts each batch of source data into a GraphContainer,
and hands that container to DBWriter which pushes vertices and edges to the
target database through ConnectionManager.
classDiagram
direction LR
class Caster {
+schema: Schema
+ingestion_model: IngestionModel
+ingestion_params: IngestionParams
+ingest(target_db_config, bindings, ...)
+cast_normal_resource(data, resource_name) GraphContainer
+process_batch(batch, resource_name, conn_conf)
+process_data_source(data_source, ...)
+ingest_data_sources(registry, conn_conf, ...)
}
class IngestionParams {
+clear_data: bool
+n_cores: int
+resources: list[str]?
+vertices: list[str]?
+batch_size: int
+batch_prefetch: int
+max_items: int?
+dry: bool
+datetime_after: str?
+datetime_before: str?
+datetime_column: str?
}
class RegistryBuilder {
+schema: Schema
+build(bindings, ingestion_params) DataSourceRegistry
}
class DataSourceRegistry {
+register(data_source, resource_name)
+get_data_sources(resource_name) list~AbstractDataSource~
}
class DBWriter {
+schema: Schema
+dry: bool
+max_concurrent: int
+write(gc, conn_conf, resource_name)
}
class GraphContainer {
+vertices: dict
+edges: dict
+from_docs_list(docs) GraphContainer
}
class ConnectionManager {
+connection_config: DBConfig
+upsert_docs_batch(...)
+insert_edges_batch(...)
}
class AbstractDataSource {
<<abstract>>
+resource_name: str?
+iter_batches(batch_size, limit)
}
Caster --> IngestionParams : ingestion_params
Caster --> RegistryBuilder : creates
RegistryBuilder --> DataSourceRegistry : builds
Caster --> DBWriter : creates per batch
Caster ..> GraphContainer : produces
DBWriter ..> GraphContainer : consumes
DBWriter --> ConnectionManager : opens connections
DataSourceRegistry o-- "0..*" AbstractDataSource : contains
DataSources vs Resources¶
These are the two key abstractions that decouple data retrieval from graph transformation:
-
DataSources (
AbstractDataSourcesubclasses) — handle where and how data is read. Each carries aDataSourceType(FILE,SQL,SPARQL,API,IN_MEMORY). Many DataSources can bind to the same Resource by name via theDataSourceRegistry. -
Resources (
Resource) — handle what the data becomes in the LPG. Each Resource is a reusable actor pipeline (descend → transform → vertex → edge) that maps raw records to graph elements. Because DataSources bind to Resources by name, the same transformation logic applies regardless of whether data arrives from a file, an API, or a SPARQL endpoint. - Optional
drop_trivial_input_fields(defaultfalseon the model): whentrue, each record is preprocessed by dropping top-level keys whose value isnullor the empty string""before actors run. This trims sparse wide rows (many unused columns) without extra transforms; nested dicts and lists are not walked.