https://github.com/tangledpath/csv-batcher

A utility to split a large CSV into smaller ones, and uses multiprocessing to process the CSVs in parallel.
https://github.com/tangledpath/csv-batcher

Last synced: 12 months ago
JSON representation

A utility to split a large CSV into smaller ones, and uses multiprocessing to process the CSVs in parallel.

Host: GitHub
URL: https://github.com/tangledpath/csv-batcher
Owner: tangledpath
Created: 2024-02-23T00:35:59.000Z (over 2 years ago)
Default Branch: master
Last Pushed: 2024-06-25T22:24:55.000Z (about 2 years ago)
Last Synced: 2025-03-09T20:33:57.405Z (over 1 year ago)
Language: Python
Size: 1.1 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 2
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          # csv-batcher



  



&nbsp

&nbsp

&nbsp

&nbsp

&nbsp

&nbsp

&nbsp


## [Vertical scaling]([url](https://en.wikipedia.org/wiki/Scalability#Vertical_or_scale_up))

A lightweight, python-based, multi-process CSV batcher suitable for use with Pandas dataframes, as a standalone tool, or other tools that deal with large CSV files (or files that require timely processing).

## Installation

pip install csv-batcher

## GitHub

https://github.com/tangledpath/csv-batcher

## Documentation

https://tangledpath.github.io/csv-batcher/csv_batcher.html

## Further excercises

* Possibly implement pooling with celery (for use in django apps, etc.), which can bring about [horizontal scaling]([url](https://en.wikipedia.org/wiki/Scalability#Horizontal_or_scale_out)).

## Usage

Arguments sent to callback function can be controlled by

creating pooler with `callback_with` and the CallbackWith enum

values:

### As dataframe row

```python

from csv_batcher.csv_pooler import CSVPooler, CallbackWith

# Callback function passed to pooler; accepts a dataframe row

#   as a pandas Series (via apply)

def process_dataframe_row(row):

    return row.iloc[0]

pooler = CSVPooler(

    "5mSalesRecords.csv",

    process_dataframe_row,

    callback_with=CallbackWith.DATAFRAME_ROW,

    pool_size=16

)

for processed_batch in pooler.process():

    print(processed_batch)

```

### As dataframe

```python

from csv_batcher.csv_pooler import CSVPooler, CallbackWith

# Used from process_datafrom's apply:

def process_dataframe_row(row):

    return row.iloc[0]

# Callback function passed to pooler; accepts a dataframe:

def process_dataframe(df):

    foo = df.apply(process_dataframe_row, axis=1)

    # Or do something more complicated....

    return len(df)

pooler = CSVPooler(

    "5mSalesRecords.csv",

    process_dataframe,

    callback_with=CallbackWith.DATAFRAME,

    pool_size=16

)

for processed_batch in pooler.process():

    print(processed_batch)

```

### As CSV filename

```python

import pandas as pd

from csv_batcher.csv_pooler import CSVPooler, CallbackWith

# Used from process_csv_filename's apply:

def process_dataframe_row(row):

    return row.iloc[0]

def process_csv_filename(csv_chunk_filename):

    # print("processing ", csv_chunk_filename)

    df = pd.read_csv(csv_chunk_filename, skipinitialspace=True, index_col=None)

    foo = df.apply(process_dataframe_row, axis=1)

    return len(df)

pooler = CSVPooler(

    "5mSalesRecords.csv",

    process_csv_filename,

    callback_with=CallbackWith.CSV_FILENAME,

    chunk_lines=10000,

    pool_size=16

)

for processed_batch in pooler.process():

    print(processed_batch)

```

## Development

### Linting

```bash

ruff check . # Find linting errors

ruff check . --fix # Auto-fix linting errors (where possible)

```

### Documentation

```

# Shows in browser

poetry run pdoc csv_batcher

# Generates to ./docs

poetry run pdoc csv_batcher -o ./docs

# OR (recommended)

bin/build.sh

```

### Testing

```bash

clear; pytest

```

### Publishing

```bash

poetry publish --build -u __token__ -p $PYPI_TOKEN`

# OR (recommended)

bin/publish.sh

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/tangledpath/csv-batcher

Awesome Lists containing this project

README