https://github.com/mmore500/joinem

CLI for fast, flexbile concatenation of tabular data using polars
https://github.com/mmore500/joinem

Last synced: 28 days ago
JSON representation

CLI for fast, flexbile concatenation of tabular data using polars

Host: GitHub
URL: https://github.com/mmore500/joinem
Owner: mmore500
License: mit
Created: 2024-02-19T17:20:44.000Z (over 1 year ago)
Default Branch: master
Last Pushed: 2025-04-28T04:32:55.000Z (30 days ago)
Last Synced: 2025-04-30T10:12:57.725Z (28 days ago)
Language: Python
Homepage:
Size: 168 KB
Stars: 16
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
- Citation: CITATION.cff

Awesome Lists containing this project

README

        [

![PyPi](https://img.shields.io/pypi/v/joinem.svg?)

](https://pypi.python.org/pypi/joinem)

[

![CI](https://github.com/mmore500/joinem/actions/workflows/ci.yaml/badge.svg)

](https://github.com/mmore500/joinem/actions)

[

![GitHub stars](https://img.shields.io/github/stars/mmore500/joinem.svg?style=round-square&logo=github&label=Stars&logoColor=white)](https://github.com/mmore500/joinem)

[![DOI](https://zenodo.org/badge/760045369.svg)](https://zenodo.org/doi/10.5281/zenodo.10701182)

**_joinem_** provides a CLI for fast, flexbile concatenation of tabular data using [polars](https://pola.rs/)

- Free software: MIT license

- Repository: 

- Documentation: 

## Install

`python3 -m pip install joinem`

## Features

- Lazily streams I/O to expeditiously handle numerous large files.

- Supports CSV and parquet input files.

    - Due to current polars limitations, JSON and feather files are not supported.

    - Input formats may be mixed.

- Supports output to CSV, JSON, parquet, and feather file types.

- Allows mismatched columns and/or empty data files with `--how diagonal` and `--how diagonal_relaxed`.

- Provides a progress bar with `--progress`.

- Add programatically-generated columns to output.

## Example Usage

Pass input filenames via stdin, one filename per line.

```

find path/to/*.parquet path/to/*.csv | python3 -m joinem out.parquet

```

Output file type is inferred from the extension of the output file name.

Supported output types are feather, JSON, parquet, and csv.

```

find -name '*.parquet' | python3 -m joinem out.json

```

If file columns may mismatch, use `--how diagonal`.

```

find path/to/ -name '*.csv' | python3 -m joinem out.csv --how diagonal

```

If some files may be empty, use `--how diagonal_relaxed`.

To run via Singularity/Apptainer,

```

ls -1 *.csv | singularity run docker://ghcr.io/mmore500/joinem out.feather

```

## Advanced Usage

Add literal value column to output.

```

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.lit(2).alias("two")'

```

Cast a column to categorical in the output, shrink dtypes, and tune compression.

```

ls -1 *.csv | python3 -m joinem out.pqt \

  --with-column 'pl.col("uuid").cast(pl.Categorical)' --string-cache \

  --shrink-dtypes \

  --write-kwarg 'compression_level=10' --write-kwarg 'compression="zstd"'

```

Alias an existing column in the output.

```

ls -1 *.csv | python3 -m joinem out.csv --with-column 'pl.col("a").alias("a2")'

```

Apply regex on source datafile paths to create new column in output.

```

ls -1 path/to/*.csv | python3 -m joinem out.csv \

  --with-column 'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv", r"${1}").alias("filename stem")'

```

Read data from stdin and write data to stdout.

```

cat foo.csv | python3 -m joinem "/dev/stdout" --stdin \

  --output-filetype csv --input-filetype csv

```

Write to parquet via stdout using `pv` to display progress, cast "myValue" column to categorical, and use lz4 for parquet compression.

```

ls -1 input/*.pqt | python3 -m joinem "/dev/stdout" --output-filetype pqt \

  --with-column 'pl.col("myValue").cast(pl.Categorical)' \

  --write-kwarg 'compression="lz4"' \

  | pv > concat.pqt

```

## API

```

usage: __main__.py [-h] [--version] [--progress] [--stdin] [--drop DROP] [--eager-read] [--eager-write]

                   [--filter FILTERS] [--head HEAD] [--tail TAIL] [--sample SAMPLE] [--seed SEED]

                   [--with-column WITH_COLUMNS] [--shrink-dtypes] [--string-cache]

                   [--how {vertical,vertical_relaxed,diagonal,diagonal_relaxed,horizontal,align,align_full,align_inner,align_left,align_right}]

                   [--input-filetype INPUT_FILETYPE] [--output-filetype OUTPUT_FILETYPE] [--read-kwarg READ_KWARGS]

                   [--write-kwarg WRITE_KWARGS]

                   output_file

CLI for fast, flexbile concatenation of tabular data using Polars.

positional arguments:

  output_file           Output file name

options:

  -h, --help            show this help message and exit

  --version             show program's version number and exit

  --progress            Show progress bar

  --stdin               Read data from stdin

  --drop DROP           Columns to drop.

  --eager-read          Use read_* instead of scan_*. Can improve performance

                        in some cases.

  --eager-write         Use write_* instead of sink_*. Can improve performance

                        in some cases.

  --filter FILTERS      Expression to be evaluated and passed to polars DataFrame.filter.

                        Example: 'pl.col("thing") == 0'

  --with-column WITH_COLUMNS

                        Expression to be evaluated to add a column, as access

                        to each datafile's filepath as `filepath` and polars

                        as `pl`. Example:

                        'pl.lit(filepath).str.replace(r".*?([^/]*)\.csv",

                        r"${1}").alias("filename stem")'

  --shrink-dtypes       Shrink numeric columns to the minimal required datatype.

  --string-cache        Enable Polars global string cache.

 --how {vertical,vertical_relaxed,diagonal,diagonal_relaxed,horizontal,align,align_full,align_inner,align_left,align_right}

                        How to concatenate frames. See  for more information.

  --input-filetype INPUT_FILETYPE

                        Filetype of input. Otherwise, inferred. Example: csv,

                        parquet, json, feather

  --output-filetype OUTPUT_FILETYPE

                        Filetype of output. Otherwise, inferred. Example: csv,

                        parquet

  --read-kwarg READ_KWARGS

                        Additional keyword arguments to pass to pl.read_* or

                        pl.scan_* call(s). Provide as 'key=value'. Specify

                        multiple kwargs by using this flag multiple times.

                        Arguments will be evaluated as Python expressions.

                        Example: 'infer_schema_length=None'

  --write-kwarg WRITE_KWARGS

                        Additional keyword arguments to pass to pl.write_* or

                        pl.sink_* call. Provide as 'key=value'. Specify

                        multiple kwargs by using this flag multiple times.

                        Arguments will be evaluated as Python expressions.

                        Example: 'compression="lz4"'

Provide input filepaths via stdin. Example: find path/to/ -name '*.csv' |

python3 -m joinem out.csv

```

## Citing

If *joinem* contributes to a scholarly work, please cite it as

> Matthew Andres Moreno. (2024). mmore500/joinem. Zenodo. https://doi.org/10.5281/zenodo.10701182

```bibtex

@software{moreno2024joinem,

  author = {Matthew Andres Moreno},

  title = {mmore500/joinem},

  month = feb,

  year = 2024,

  publisher = {Zenodo},

  doi = {10.5281/zenodo.10701182},

  url = {https://doi.org/10.5281/zenodo.10701182}

}

```

And don't forget to leave a [star on GitHub](https://github.com/mmore500/joinem/stargazers)!

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mmore500/joinem

Awesome Lists containing this project

README