https://github.com/cldellow/csv2parquet

Convert a CSV to a parquet file.
https://github.com/cldellow/csv2parquet

apache-arrow apache-parquet csv parquet

Last synced: 11 months ago
JSON representation

Convert a CSV to a parquet file.

Host: GitHub
URL: https://github.com/cldellow/csv2parquet
Owner: cldellow
License: apache-2.0
Created: 2018-03-26T01:22:23.000Z (over 8 years ago)
Default Branch: master
Last Pushed: 2022-12-08T06:34:48.000Z (over 3 years ago)
Last Synced: 2024-12-16T04:13:16.772Z (over 1 year ago)
Topics: apache-arrow, apache-parquet, csv, parquet
Language: Python
Homepage:
Size: 97.7 KB
Stars: 64
Watchers: 4
Forks: 14
Open Issues: 12
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

          # csv2parquet

[![Build Status](https://travis-ci.org/cldellow/csv2parquet.svg?branch=master)](https://travis-ci.org/cldellow/csv2parquet)

[![codecov](https://codecov.io/gh/cldellow/csv2parquet/branch/master/graph/badge.svg)](https://codecov.io/gh/cldellow/csv2parquet)

Convert a CSV to a parquet file. You may also find [sqlite-parquet-vtable](https://github.com/cldellow/sqlite-parquet-vtable) or

[parquet-metadata](https://github.com/cldellow/parquet-metadata) useful.

## Installing

If you just want to use the tool:

```

sudo pip install pyarrow csv2parquet

```

If you want to clone the repo and work on the tool, install its dependencies via pipenv:

```

pipenv install

```

## Usage

Next, create some Parquet files. The tool supports CSV and TSV files.

```

usage: csv2parquet [-h] [-n ROWS] [-r ROW_GROUP_SIZE] [-o OUTPUT] [-c CODEC]

                   [-i INCLUDE [INCLUDE ...] | -x EXCLUDE [EXCLUDE ...]]

                   [-R RENAME [RENAME ...]] [-t TYPE [TYPE ...]]

                   csv_file

positional arguments:

  csv_file              input file, can be CSV or TSV

optional arguments:

  -h, --help            show this help message and exit

  -n ROWS, --rows ROWS  The number of rows to include, useful for testing.

  -r ROW_GROUP_SIZE, --row-group-size ROW_GROUP_SIZE

                        The number of rows per row group.

  -o OUTPUT, --output OUTPUT

                        The parquet file

  -c CODEC, --codec CODEC

                        The compression codec to use (brotli, gzip, snappy,

                        zstd, none)

  -i INCLUDE [INCLUDE ...], --include INCLUDE [INCLUDE ...]

                        Include the given columns (by index or name)

  -x EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...]

                        Exclude the given columns (by index or name)

  -R RENAME [RENAME ...], --rename RENAME [RENAME ...]

                        Rename a column. Specify the column to be renamed and

                        its new name, eg: 0=age or person_age=age

  -t TYPE [TYPE ...], --type TYPE [TYPE ...]

                        Parse a column as a given type. Specify the column and

                        its type, eg: 0=bool? or person_age=int8. Parse errors

                        are fatal unless the type is followed by a question

                        mark. Valid types are string (default), base64, bool,

                        float32, float64, int8, int16, int32, int64, timestamp

```

## Testing

```

pylint csv2parquet

pytest

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cldellow/csv2parquet

Awesome Lists containing this project

README