https://github.com/mcaceresb/stata-parquet
Read and write parquet files from Stata
https://github.com/mcaceresb/stata-parquet
arrow parquet stata
Last synced: about 2 months ago
JSON representation
Read and write parquet files from Stata
- Host: GitHub
- URL: https://github.com/mcaceresb/stata-parquet
- Owner: mcaceresb
- License: mit
- Created: 2018-10-30T22:18:16.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2023-10-22T21:35:24.000Z (about 2 years ago)
- Last Synced: 2025-04-03T18:52:55.988Z (8 months ago)
- Topics: arrow, parquet, stata
- Language: C++
- Size: 542 KB
- Stars: 23
- Watchers: 1
- Forks: 6
- Open Issues: 8
-
Metadata Files:
- Readme: README.md
- Changelog: changelog.md
- License: LICENSE
Awesome Lists containing this project
README
stata-parquet
=============
Read and write parquet files from Stata (Linux/Unix only).
This package uses the [Apache Arrow](https://github.com/apache/arrow)
C++ library to read and write parquet files from Stata using plugins.
Currently this package is only available in Stata for Unix (Linux).
`version 0.6.5 22Oct2023` | [Installation](#installation) | [Usage](#usage) | [Examples](#examples)
Installation
------------
You need to first install:
- The Apache Arrow C++ library.
- The GNU Compiler Collection
- The Boost C++ libraries.
- Google's logging library (google-glog)
### Installation with Conda
First, intall Google's logging library: `libgoogle-glog-dev` in Ubuntu, `google-glog` in Arch (you may have to link `libglog.so` to `libglog.so.0`), and so on. Then the only tested way to install this software is via `conda` (see [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) for installation instructions; most recent plugin installation and tests were conducted using Miniconda3 for Python 3.8, version `23.3.1`):
```bash
git clone https://github.com/mcaceresb/stata-parquet
cd stata-parquet
conda env create -f environment.yml
conda activate stata-parquet
make SPI=3.0 GCC=${CONDA_PREFIX}/bin/x86_64-conda_cos6-linux-gnu-g++ UFLAGS=-std=c++11 INCLUDE=${CONDA_PREFIX}/include LIBS=${CONDA_PREFIX}/lib all
stata -b "net install parquet, from(${PWD}/build) replace"
rm -f stata.log
```
Note: If you have Stata 14.0 or earlier you will want to use `SPI=2.0` instead.
Warning: The plugin uses a possibly dated version of parquet (specifically `parquet-cpp` version `1.5.1` and `arrow-cpp` version `0.14.1`).
Usage
-----
### Usage with Conda
Activate the Conda environment with
```
conda activate stata-parquet
```
Then be sure to start Stata via
```bash
LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH xstata
```
Alternatively, you could add the following line to your `~/.bashrc` to not have
to enter the `LD_LIBRARY_PATH` every time (make sure to replace
`${CONDA_PREFIX}` with the absolute path it represents):
```bash
export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH
```
Then just start Stata with
```
xstata
```
### Examples
`parquet save` and `parquet use` will save and load datasets in Parquet
format, respectively. `parquet desc` will describe the contents of a
parquet dataset. For example:
```stata
sysuse auto, clear
parquet save auto.parquet, replace
parquet desc auto.parquet
parquet use auto.parquet, clear
desc
parquet use price make gear_ratio using auto.parquet, clear in(10/20)
parquet save gear_ratio make using auto.parquet in 5/6 if price > 5000, replace
```
Note that the `if` clause is not supported by `parquet use`. To test the
plugin works as expected, run `do build/parquet_tests.do` from Stata. To
also test the plugin correctly reads `hive` format datasets, run
```
conda install -n stata-parquet pandas numpy fastparquet
conda activate stata-parquet
```
Then, from Stata, `do build/parquet_tests.do python`
Limitations
-----------
- Writing `strL` variables is not yet supported.
- Reading binary ByteArray data is not supported, only strings.
- `Int96` variables is not supported, as is has no direct Stata counterpart.
- Maximum string widths are not generally stored in `.parquet` files (as
far as I can tell). The default behavior is to scan string columns
to get the largest string, but it can be time-intensive. Adjust this
behavior via `strscan()` and `strbuffer()`.
TODO
----
Some features that ought to be implemented:
- [ ] Option `skip` for columns that are in non-readable formats?
- [X] Write regular missing values (high-level only).
Some features that might not be implementable, but the user should be
warned about them:
- [X] Extended missing values (user gets a warning).
- [ ] `strL` variables
- [ ] Variable formats
- [ ] Variable labels
- [ ] Value labels
- [ ] Dataset notes
- [ ] Variable characteristics
- [ ] ByteArray or FixedLenByteArray with binary data.
Improve:
- [ ] Boolean format to/from Stata.
- [ ] Best way to transpose from column order to row order.
License
-------
`stata-parquet` is [MIT-licensed](https://github.com/mcaceresb/stata-parquet/blob/master/LICENSE).