graflo.data_source.file¶
File-based data source implementations.
This module provides data source implementations for file-based data sources, including JSON, JSONL, and CSV/TSV files. It integrates with the existing chunker logic for efficient batch processing.
FileDataSource
dataclass
¶
Bases: AbstractDataSource
Base class for file-based data sources.
This class provides a common interface for file-based data sources, integrating with the existing chunker system for batch processing.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path | str
|
Path to the file |
file_type |
str | None
|
Type of file (json, jsonl, table) |
encoding |
EncodingType
|
File encoding (default: UTF_8) |
Source code in graflo/data_source/file.py
__post_init__()
¶
iter_batches(batch_size=1000, limit=None)
¶
Iterate over file data in batches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
batch_size
|
int
|
Number of items per batch |
1000
|
limit
|
int | None
|
Maximum number of items to retrieve |
None
|
Yields:
| Type | Description |
|---|---|
list[dict]
|
list[dict]: Batches of documents as dictionaries |
Source code in graflo/data_source/file.py
JsonFileDataSource
dataclass
¶
Bases: FileDataSource
Data source for JSON files.
JSON files are expected to contain hierarchical data structures, similar to REST API responses. The chunker handles nested structures and converts them to dictionaries.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path | str
|
Path to the JSON file |
encoding |
EncodingType
|
File encoding (default: UTF_8) |
Source code in graflo/data_source/file.py
JsonlFileDataSource
dataclass
¶
Bases: FileDataSource
Data source for JSONL (JSON Lines) files.
JSONL files contain one JSON object per line, making them suitable for streaming and batch processing.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path | str
|
Path to the JSONL file |
encoding |
EncodingType
|
File encoding (default: UTF_8) |
Source code in graflo/data_source/file.py
ParquetFileDataSource
dataclass
¶
Bases: FileDataSource
Data source for Parquet files.
Parquet files are columnar storage format files that are read using pandas. Each row becomes a dictionary with column names as keys.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path | str
|
Path to the Parquet file |
Source code in graflo/data_source/file.py
TableFileDataSource
dataclass
¶
Bases: FileDataSource
Data source for CSV/TSV files.
Table files are converted to dictionaries with column headers as keys. Each row becomes a dictionary.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path | str
|
Path to the CSV/TSV file |
encoding |
EncodingType
|
File encoding (default: UTF_8) |
sep |
str
|
Field separator (default: ',') |