https://github.com/tangledpath/csv-batcher
A utility to split a large CSV into smaller ones, and uses multiprocessing to process the CSVs in parallel.
https://github.com/tangledpath/csv-batcher
Last synced: 11 months ago
JSON representation
A utility to split a large CSV into smaller ones, and uses multiprocessing to process the CSVs in parallel.
- Host: GitHub
- URL: https://github.com/tangledpath/csv-batcher
- Owner: tangledpath
- Created: 2024-02-23T00:35:59.000Z (over 2 years ago)
- Default Branch: master
- Last Pushed: 2024-06-25T22:24:55.000Z (almost 2 years ago)
- Last Synced: 2025-03-09T20:33:57.405Z (about 1 year ago)
- Language: Python
- Size: 1.1 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# csv-batcher
 
 
 
 
 
 
 
## [Vertical scaling]([url](https://en.wikipedia.org/wiki/Scalability#Vertical_or_scale_up))
A lightweight, python-based, multi-process CSV batcher suitable for use with Pandas dataframes, as a standalone tool, or other tools that deal with large CSV files (or files that require timely processing).
## Installation
pip install csv-batcher
## GitHub
https://github.com/tangledpath/csv-batcher
## Documentation
https://tangledpath.github.io/csv-batcher/csv_batcher.html
## Further excercises
* Possibly implement pooling with celery (for use in django apps, etc.), which can bring about [horizontal scaling]([url](https://en.wikipedia.org/wiki/Scalability#Horizontal_or_scale_out)).
## Usage
Arguments sent to callback function can be controlled by
creating pooler with `callback_with` and the CallbackWith enum
values:
### As dataframe row
```python
from csv_batcher.csv_pooler import CSVPooler, CallbackWith
# Callback function passed to pooler; accepts a dataframe row
# as a pandas Series (via apply)
def process_dataframe_row(row):
return row.iloc[0]
pooler = CSVPooler(
"5mSalesRecords.csv",
process_dataframe_row,
callback_with=CallbackWith.DATAFRAME_ROW,
pool_size=16
)
for processed_batch in pooler.process():
print(processed_batch)
```
### As dataframe
```python
from csv_batcher.csv_pooler import CSVPooler, CallbackWith
# Used from process_datafrom's apply:
def process_dataframe_row(row):
return row.iloc[0]
# Callback function passed to pooler; accepts a dataframe:
def process_dataframe(df):
foo = df.apply(process_dataframe_row, axis=1)
# Or do something more complicated....
return len(df)
pooler = CSVPooler(
"5mSalesRecords.csv",
process_dataframe,
callback_with=CallbackWith.DATAFRAME,
pool_size=16
)
for processed_batch in pooler.process():
print(processed_batch)
```
### As CSV filename
```python
import pandas as pd
from csv_batcher.csv_pooler import CSVPooler, CallbackWith
# Used from process_csv_filename's apply:
def process_dataframe_row(row):
return row.iloc[0]
def process_csv_filename(csv_chunk_filename):
# print("processing ", csv_chunk_filename)
df = pd.read_csv(csv_chunk_filename, skipinitialspace=True, index_col=None)
foo = df.apply(process_dataframe_row, axis=1)
return len(df)
pooler = CSVPooler(
"5mSalesRecords.csv",
process_csv_filename,
callback_with=CallbackWith.CSV_FILENAME,
chunk_lines=10000,
pool_size=16
)
for processed_batch in pooler.process():
print(processed_batch)
```
## Development
### Linting
```bash
ruff check . # Find linting errors
ruff check . --fix # Auto-fix linting errors (where possible)
```
### Documentation
```
# Shows in browser
poetry run pdoc csv_batcher
# Generates to ./docs
poetry run pdoc csv_batcher -o ./docs
# OR (recommended)
bin/build.sh
```
### Testing
```bash
clear; pytest
```
### Publishing
```bash
poetry publish --build -u __token__ -p $PYPI_TOKEN`
# OR (recommended)
bin/publish.sh
```