https://github.com/loiclefevre/parquet-manager
https://github.com/loiclefevre/parquet-manager
converter csv parquet
Last synced: 2 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/loiclefevre/parquet-manager
- Owner: loiclefevre
- License: apache-2.0
- Created: 2019-04-15T14:40:53.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2019-04-17T15:47:32.000Z (about 7 years ago)
- Last Synced: 2025-01-13T11:33:23.250Z (over 1 year ago)
- Topics: converter, csv, parquet
- Language: Java
- Size: 14.6 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# parquet-manager
Helps to create Apache Parquet files from CSV without a full blown Hadoop cluster deployment. This looks like an ETL phase :)
On Windows OS, you'll need to install an Hadoop client; please check this thread for help: https://stackoverflow.com/questions/19620642/failed-to-locate-the-winutils-binary-in-the-hadoop-binary-path
Also, you'll have to define the HADOOP_HOME environment variable to make it work.
Usage:
```Bash
Syntax: parquet-manager [schema file] [data file in CSV format] [output file in Parquet format]
./parquet-manager.sh chicagocrimes.avsc chicagocrimes.csv chicagocrimes.parquet
```
You may download the sample open data from here: https://data.cityofchicago.org/api/views/ijzp-q8t2/rows.csv?accessType=DOWNLOAD
(Portal: https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2)