https://github.com/mcaceresb/stata-parquet

Read and write parquet files from Stata
https://github.com/mcaceresb/stata-parquet

arrow parquet stata

Last synced: about 2 months ago
JSON representation

Read and write parquet files from Stata

Host: GitHub
URL: https://github.com/mcaceresb/stata-parquet
Owner: mcaceresb
License: mit
Created: 2018-10-30T22:18:16.000Z (about 7 years ago)
Default Branch: master
Last Pushed: 2023-10-22T21:35:24.000Z (about 2 years ago)
Last Synced: 2025-04-03T18:52:55.988Z (8 months ago)
Topics: arrow, parquet, stata
Language: C++
Size: 542 KB
Stars: 23
Watchers: 1
Forks: 6
Open Issues: 8
Metadata Files:
- Readme: README.md
- Changelog: changelog.md
- License: LICENSE

Awesome Lists containing this project

README

          stata-parquet

=============

Read and write parquet files from Stata (Linux/Unix only).

This package uses the [Apache Arrow](https://github.com/apache/arrow)

C++ library to read and write parquet files from Stata using plugins.

Currently this package is only available in Stata for Unix (Linux).

`version 0.6.5 22Oct2023` | [Installation](#installation) | [Usage](#usage) | [Examples](#examples)

Installation

------------

You need to first install:

- The Apache Arrow C++ library.

- The GNU Compiler Collection

- The Boost C++ libraries.

- Google's logging library (google-glog)

### Installation with Conda

First, intall Google's logging library: `libgoogle-glog-dev` in Ubuntu, `google-glog` in Arch (you may have to link `libglog.so` to `libglog.so.0`), and so on. Then the only tested way to install this software is via `conda` (see [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html) for installation instructions; most recent plugin installation and tests were conducted using Miniconda3 for Python 3.8, version `23.3.1`):

```bash

git clone https://github.com/mcaceresb/stata-parquet

cd stata-parquet

conda env create -f environment.yml

conda activate stata-parquet

make SPI=3.0 GCC=${CONDA_PREFIX}/bin/x86_64-conda_cos6-linux-gnu-g++ UFLAGS=-std=c++11 INCLUDE=${CONDA_PREFIX}/include LIBS=${CONDA_PREFIX}/lib all

stata -b "net install parquet, from(${PWD}/build) replace"

rm -f stata.log

```

Note: If you have Stata 14.0 or earlier you will want to use `SPI=2.0` instead.

Warning: The plugin uses a possibly dated version of parquet (specifically `parquet-cpp` version `1.5.1` and `arrow-cpp` version `0.14.1`).

Usage

-----

### Usage with Conda

Activate the Conda environment with

```

conda activate stata-parquet

```

Then be sure to start Stata via

```bash

LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH xstata

```

Alternatively, you could add the following line to your `~/.bashrc` to not have

to enter the `LD_LIBRARY_PATH` every time (make sure to replace

`${CONDA_PREFIX}` with the absolute path it represents):

```bash

export LD_LIBRARY_PATH=${CONDA_PREFIX}/lib:$LD_LIBRARY_PATH

```

Then just start Stata with

```

xstata

```

### Examples

`parquet save` and `parquet use` will save and load datasets in Parquet

format, respectively. `parquet desc` will describe the contents of a

parquet dataset. For example:

```stata

sysuse auto, clear

parquet save auto.parquet, replace

parquet desc auto.parquet

parquet use auto.parquet, clear

desc

parquet use price make gear_ratio using auto.parquet, clear in(10/20)

parquet save gear_ratio make using auto.parquet in 5/6 if price > 5000, replace

```

Note that the `if` clause is not supported by `parquet use`. To test the

plugin works as expected, run `do build/parquet_tests.do` from Stata. To

also test the plugin correctly reads `hive` format datasets, run

```

conda install -n stata-parquet pandas numpy fastparquet

conda activate stata-parquet

```

Then, from Stata, `do build/parquet_tests.do python`

Limitations

-----------

- Writing `strL` variables is not yet supported.

- Reading binary ByteArray data is not supported, only strings.

- `Int96` variables is not supported, as is has no direct Stata counterpart.

- Maximum string widths are not generally stored in `.parquet` files (as

  far as I can tell). The default behavior is to scan string columns

  to get the largest string, but it can be time-intensive. Adjust this

  behavior via `strscan()` and `strbuffer()`.

TODO

----

Some features that ought to be implemented:

- [ ] Option `skip` for columns that are in non-readable formats?

- [X] Write regular missing values (high-level only).

Some features that might not be implementable, but the user should be

warned about them:

- [X] Extended missing values (user gets a warning).

- [ ] `strL` variables

- [ ] Variable formats

- [ ] Variable labels

- [ ] Value labels

- [ ] Dataset notes

- [ ] Variable characteristics

- [ ] ByteArray or FixedLenByteArray with binary data.

Improve:

- [ ] Boolean format to/from Stata.

- [ ] Best way to transpose from column order to row order.

License

-------

`stata-parquet` is [MIT-licensed](https://github.com/mcaceresb/stata-parquet/blob/master/LICENSE).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mcaceresb/stata-parquet

Awesome Lists containing this project

README