Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/cldellow/csv2parquet
Convert a CSV to a parquet file.
https://github.com/cldellow/csv2parquet
apache-arrow apache-parquet csv parquet
Last synced: about 1 month ago
JSON representation
Convert a CSV to a parquet file.
- Host: GitHub
- URL: https://github.com/cldellow/csv2parquet
- Owner: cldellow
- License: apache-2.0
- Created: 2018-03-26T01:22:23.000Z (almost 7 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T06:34:48.000Z (about 2 years ago)
- Last Synced: 2024-12-16T04:13:16.772Z (about 2 months ago)
- Topics: apache-arrow, apache-parquet, csv, parquet
- Language: Python
- Homepage:
- Size: 97.7 KB
- Stars: 64
- Watchers: 4
- Forks: 14
- Open Issues: 12
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
Awesome Lists containing this project
README
# csv2parquet
[![Build Status](https://travis-ci.org/cldellow/csv2parquet.svg?branch=master)](https://travis-ci.org/cldellow/csv2parquet)
[![codecov](https://codecov.io/gh/cldellow/csv2parquet/branch/master/graph/badge.svg)](https://codecov.io/gh/cldellow/csv2parquet)Convert a CSV to a parquet file. You may also find [sqlite-parquet-vtable](https://github.com/cldellow/sqlite-parquet-vtable) or
[parquet-metadata](https://github.com/cldellow/parquet-metadata) useful.## Installing
If you just want to use the tool:
```
sudo pip install pyarrow csv2parquet
```If you want to clone the repo and work on the tool, install its dependencies via pipenv:
```
pipenv install
```## Usage
Next, create some Parquet files. The tool supports CSV and TSV files.
```
usage: csv2parquet [-h] [-n ROWS] [-r ROW_GROUP_SIZE] [-o OUTPUT] [-c CODEC]
[-i INCLUDE [INCLUDE ...] | -x EXCLUDE [EXCLUDE ...]]
[-R RENAME [RENAME ...]] [-t TYPE [TYPE ...]]
csv_filepositional arguments:
csv_file input file, can be CSV or TSVoptional arguments:
-h, --help show this help message and exit
-n ROWS, --rows ROWS The number of rows to include, useful for testing.
-r ROW_GROUP_SIZE, --row-group-size ROW_GROUP_SIZE
The number of rows per row group.
-o OUTPUT, --output OUTPUT
The parquet file
-c CODEC, --codec CODEC
The compression codec to use (brotli, gzip, snappy,
zstd, none)
-i INCLUDE [INCLUDE ...], --include INCLUDE [INCLUDE ...]
Include the given columns (by index or name)
-x EXCLUDE [EXCLUDE ...], --exclude EXCLUDE [EXCLUDE ...]
Exclude the given columns (by index or name)
-R RENAME [RENAME ...], --rename RENAME [RENAME ...]
Rename a column. Specify the column to be renamed and
its new name, eg: 0=age or person_age=age
-t TYPE [TYPE ...], --type TYPE [TYPE ...]
Parse a column as a given type. Specify the column and
its type, eg: 0=bool? or person_age=int8. Parse errors
are fatal unless the type is followed by a question
mark. Valid types are string (default), base64, bool,
float32, float64, int8, int16, int32, int64, timestamp
```## Testing
```
pylint csv2parquet
pytest
```