Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hannesmuehleisen/miniparquet

Library to read a subset of Parquet files
https://github.com/hannesmuehleisen/miniparquet

cpp cpp11 dependency-free parquet parquet-cpp parquet-files

Last synced: 3 months ago
JSON representation

Library to read a subset of Parquet files

Host: GitHub
URL: https://github.com/hannesmuehleisen/miniparquet
Owner: hannes
License: other
Archived: true
Created: 2019-08-06T16:03:34.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2020-02-13T19:59:08.000Z (over 4 years ago)
Last Synced: 2024-05-20T06:09:31.001Z (6 months ago)
Topics: cpp, cpp11, dependency-free, parquet, parquet-cpp, parquet-files
Language: C++
Homepage:
Size: 485 KB
Stars: 43
Watchers: 5
Forks: 7
Open Issues: 6
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # miniparquet

[![Travis](https://api.travis-ci.org/hannesmuehleisen/miniparquet.svg?branch=master)](https://travis-ci.org/hannesmuehleisen/miniparquet)

[![CRAN

status](https://www.r-pkg.org/badges/version/miniparquet)](https://cran.r-project.org/package=miniparquet)

[![](http://cranlogs.r-pkg.org/badges/miniparquet)](https://dgrtwo.shinyapps.io/cranview/)

`miniparquet` is a reader for a common subset of Parquet files. miniparquet only supports rectangular-shaped data structures (no nested tables) and only the Snappy compression scheme. miniparquet has no (zero, none, 0) [external dependencies](https://research.swtch.com/deps) and is very lightweight. It compiles in seconds to a binary size of under 1 MB. 

## Installation

Miniparquet comes as C++ library, a Python package and a R package. Install the R package like so:

`devtools::install_github("hannesmuehleisen/miniparquet")` 

The C++ library can be built by typing `make`.

The Python package is installed using `python setup.py install`

## Usage

Use the R package like so: `df <- miniparquet::parquet_read("example.parquet")` 

Folders of similar-structured Parquet files (e.g. produced by Spark) can be read like this: 

`df <- data.table::rbindlist(lapply(Sys.glob("some-folder/part-*.parquet"), miniparquet::parquet_read))`

If you find a file that should be supported but isn't, please open an issue here with a link to the file. 

Use the Python package like so: `miniparquet.read('example.parquet')`. You can convert the result to a Pandas dataframe like so: `pandas.DataFrame.from_dict(miniparquet.read('example.parquet'))`

## Performance

`miniparquet` is quite fast, on my laptop (I7-4578U) it can read compressed Parquet files at over 200 MB/s using only a single thread. Previously, there was a comparision with the arrow package here, but it appeared that results were caused by a bug which is fixed.