pelinker.io.reader¶
Unified reader interface for reading large files in chunks. Supports Feather, Parquet, and CSV/TSV formats.
read_batches(file_path, batch_size=1000, file_type=None, **kwargs)
¶
Read large files in batches, supporting Feather, Parquet, and CSV/TSV formats.
Automatically detects file type from extension if not provided.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
file_path
|
str
|
Path to the file to read |
required |
batch_size
|
int
|
Number of rows per batch (default: 1000) |
1000
|
file_type
|
Optional[str]
|
Optional file type override ('feather', 'parquet', 'csv'). If None, auto-detects from file extension. |
None
|
**kwargs
|
Additional arguments passed to format-specific readers: - For CSV: sep, header, etc. (pandas.read_csv arguments) - For Parquet: columns (list of column names to read) |
{}
|
Yields:
| Type | Description |
|---|---|
DataFrame
|
pd.DataFrame: Batches of data as pandas DataFrames |
Examples:
>>> # Read feather file
>>> for batch in read_batches("data.feather", batch_size=5000):
... process(batch)
>>> # Read parquet file
>>> for batch in read_batches("data.parquet", batch_size=10000):
... process(batch)
>>> # Read CSV file with custom separator
>>> for batch in read_batches("data.csv", batch_size=2000, sep=";"):
... process(batch)