graflo.data_source.base¶
Base classes for data source abstraction.
This module defines the abstract base class and types for all data sources. Data sources handle data retrieval from various sources (files, APIs, databases) and provide a unified interface for batch iteration.
AbstractDataSource
dataclass
¶
Bases: BaseDataclass, ABC
Abstract base class for all data sources.
Data sources handle data retrieval from various sources and provide a unified interface for batch iteration. They are separate from Resources, which handle data transformation. Many DataSources can map to the same Resource.
Attributes:
| Name | Type | Description |
|---|---|---|
source_type |
DataSourceType
|
Type of the data source |
resource_name |
str | None
|
Name of the resource this data source maps to (set externally via DataSourceRegistry) |
Source code in graflo/data_source/base.py
resource_name
property
writable
¶
Get the resource name this data source maps to.
Returns:
| Type | Description |
|---|---|
str | None
|
Resource name or None if not set |
__iter__()
¶
Make data source iterable, yielding individual items.
Yields:
| Name | Type | Description |
|---|---|---|
dict |
Individual documents |
__post_init__()
¶
iter_batches(batch_size=1000, limit=None)
abstractmethod
¶
Iterate over data in batches.
This method yields batches of documents (dictionaries) from the data source. Each batch is a list of dictionaries representing the data items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int
|
Number of items per batch |
1000
|
limit
|
int | None
|
Maximum number of items to retrieve (None for no limit) |
None
|
Yields:
| Type | Description |
|---|---|
list[dict]
|
list[dict]: Batches of documents as dictionaries |
Raises:
| Type | Description |
|---|---|
NotImplementedError
|
Must be implemented by subclasses |
Source code in graflo/data_source/base.py
DataSourceType
¶
Bases: BaseEnum
Types of data sources supported by the system.
FILE: File-based data sources (JSON, JSONL, CSV/TSV) API: REST API data sources SQL: SQL database data sources IN_MEMORY: In-memory data sources (lists, DataFrames)