https://github.com/kylebarron/parquet-wasm

Rust-based WebAssembly bindings to read and write Apache Parquet data
https://github.com/kylebarron/parquet-wasm
apache-arrow apache-parquet arrow javascript parquet rust wasm webassembly
Last synced: 5 months ago
JSON representation
Rust-based WebAssembly bindings to read and write Apache Parquet data
Host: GitHub
URL: https://github.com/kylebarron/parquet-wasm
Owner: kylebarron
License: apache-2.0
Created: 2022-02-27T22:25:25.000Z (over 3 years ago)
Default Branch: main
Last Pushed: 2025-05-12T21:04:31.000Z (5 months ago)
Last Synced: 2025-05-12T23:53:24.185Z (5 months ago)
Topics: apache-arrow, apache-parquet, arrow, javascript, parquet, rust, wasm, webassembly
Language: Rust
Homepage: https://kylebarron.dev/parquet-wasm/
Size: 2.67 MB
Stars: 583
Watchers: 6
Forks: 20
Open Issues: 24
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE_APACHE
Awesome Lists containing this project

README

          # WASM Parquet [![npm version](https://img.shields.io/npm/v/parquet-wasm.svg)](https://www.npmjs.com/package/parquet-wasm)

WebAssembly bindings to read and write the [Apache Parquet](https://parquet.apache.org/) format to and from [Apache Arrow](https://arrow.apache.org/) using the Rust [`parquet`](https://crates.io/crates/parquet) and [`arrow`](https://crates.io/crates/arrow) crates.

This is designed to be used alongside a JavaScript Arrow implementation, such as the canonical [JS Arrow library](https://arrow.apache.org/docs/js/).

Including read and write support and all compression codecs, the brotli-compressed WASM bundle is 1.2 MB. Refer to [custom builds](#custom-builds) for how to build a smaller bundle. A minimal read-only bundle without compression support can be as small as 456 KB brotli-compressed.

## Install

`parquet-wasm` is published to NPM. Install with

```

yarn add parquet-wasm

```

or

```

npm install parquet-wasm

```

## API

Parquet-wasm has both a synchronous and asynchronous API. The sync API is simpler but requires fetching the entire Parquet buffer in advance, which is often prohibitive.

### Sync API

Refer to these functions:

- [`readParquet`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.readParquet.html): Read a Parquet file synchronously.

- [`readSchema`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.readSchema.html): Read an Arrow schema from a Parquet file synchronously.

- [`writeParquet`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.writeParquet.html): Write a Parquet file synchronously.

### Async API

- [`readParquetStream`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.readParquetStream.html): Create a [ReadableStream](https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream) that emits Arrow RecordBatches from a Parquet file.

- [`ParquetFile`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html): A class for reading portions of a remote Parquet file. Use [`fromUrl`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html#fromUrl) to construct from a remote URL or [`fromFile`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html#fromFile) to construct from a [`File`](https://developer.mozilla.org/en-US/docs/Web/API/File) handle. Note that when you're done using this class, you'll need to call [`free`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html#free) to release any memory held by the ParquetFile instance itself.

Both sync and async functions return or accept a [`Table`](https://kylebarron.dev/parquet-wasm/classes/bundler_parquet_wasm.Table.html) class, an Arrow table in WebAssembly memory. Refer to its documentation for moving data into/out of WebAssembly.

## Entry Points

| Entry point                                                               | Description                                             | Documentation        |

| ------------------------------------------------------------------------- | ------------------------------------------------------- | -------------------- |

| `parquet-wasm`, `parquet-wasm/esm`, or `parquet-wasm/esm/parquet_wasm.js` | ESM, to be used directly from the Web as an ES Module   | [Link][esm-docs]     |

| `parquet-wasm/bundler`                                                    | "Bundler" build, to be used in bundlers such as Webpack | [Link][bundler-docs] |

| `parquet-wasm/node`                                                       | Node build, to be used with synchronous `require` in NodeJS         | [Link][node-docs]    |

[bundler-docs]: https://kylebarron.dev/parquet-wasm/modules/bundler_parquet_wasm.html

[node-docs]: https://kylebarron.dev/parquet-wasm/modules/node_parquet_wasm.html

[esm-docs]: https://kylebarron.dev/parquet-wasm/modules/esm_parquet_wasm.html

### ESM

The `esm` entry point is the primary entry point. It is the default export from `parquet-wasm`, and is also accessible at `parquet-wasm/esm` and `parquet-wasm/esm/parquet_wasm.js` (for symmetric imports [directly from a browser](#using-directly-from-a-browser)).

**Note that when using the `esm` bundles, you must manually initialize the WebAssembly module before using any APIs**. Otherwise, you'll get an error `TypeError: Cannot read properties of undefined`. There are multiple ways to initialize the WebAssembly code:

#### Asynchronous initialization

The primary way to initialize is by awaiting the default export.

```js

import wasmInit, {readParquet} from "parquet-wasm";

await wasmInit();

```

Without any parameter, this will try to fetch a file named `'parquet_wasm_bg.wasm'` at the same location as `parquet-wasm`. (E.g. this snippet `input = new URL('parquet_wasm_bg.wasm', import.meta.url);`).

Note that you can also pass in a custom URL if you want to host the `.wasm` file on your own servers.

```js

import wasmInit, {readParquet} from "parquet-wasm";

// Update this version to match the version you're using.

const wasmUrl = "https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/parquet_wasm_bg.wasm";

await wasmInit(wasmUrl);

```

#### Synchronous initialization

The `initSync` named export allows for

```js

import {initSync, readParquet} from "parquet-wasm";

// The contents of esm/parquet_wasm_bg.wasm in an ArrayBuffer

const wasmBuffer = new ArrayBuffer(...);

// Initialize the Wasm synchronously

initSync(wasmBuffer)

```

Async initialization should be preferred over downloading the Wasm buffer and then initializing it synchronously, as [`WebAssembly.instantiateStreaming`](https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/instantiateStreaming_static) is the most efficient way to both download and initialize Wasm code.

### Bundler

The `bundler` entry point doesn't require manual initialization of the WebAssembly blob, but needs setup with whatever bundler you're using. [Refer to the Rust Wasm documentation for more info](https://rustwasm.github.io/docs/wasm-bindgen/reference/deployment.html#bundlers).

### Node

The `node` entry point can be loaded synchronously from Node.

```js

const {readParquet} = require("parquet-wasm");

const wasmTable = readParquet(...);

```

### Using directly from a browser

You can load the `esm/parquet_wasm.js` file directly from a CDN

```js

const parquet = await import(

  "https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/+esm"

)

await parquet.default();

const wasmTable = parquet.readParquet(...);

```

This specific endpoint will minify the ESM before you receive it.

### Debug functions

These functions are not present in normal builds to cut down on bundle size. To create a custom build, see [Custom Builds](#custom-builds) below.

#### `setPanicHook`

`setPanicHook(): void`

Sets [`console_error_panic_hook`](https://github.com/rustwasm/console_error_panic_hook) in Rust, which provides better debugging of panics by having more informative `console.error` messages. Initialize this first if you're getting errors such as `RuntimeError: Unreachable executed`.

The WASM bundle must be compiled with the `console_error_panic_hook` feature for this function to exist.

## Example

```js

import * as arrow from "apache-arrow";

import initWasm, {

  Compression,

  readParquet,

  Table,

  writeParquet,

  WriterPropertiesBuilder,

} from "parquet-wasm";

// Instantiate the WebAssembly context

await initWasm();

// Create Arrow Table in JS

const LENGTH = 2000;

const rainAmounts = Float32Array.from({ length: LENGTH }, () =>

  Number((Math.random() * 20).toFixed(1))

);

const rainDates = Array.from(

  { length: LENGTH },

  (_, i) => new Date(Date.now() - 1000 * 60 * 60 * 24 * i)

);

const rainfall = arrow.tableFromArrays({

  precipitation: rainAmounts,

  date: rainDates,

});

// Write Arrow Table to Parquet

// wasmTable is an Arrow table in WebAssembly memory

const wasmTable = Table.fromIPCStream(arrow.tableToIPC(rainfall, "stream"));

const writerProperties = new WriterPropertiesBuilder()

  .setCompression(Compression.ZSTD)

  .build();

const parquetUint8Array = writeParquet(wasmTable, writerProperties);

// Read Parquet buffer back to Arrow Table

// arrowWasmTable is an Arrow table in WebAssembly memory

const arrowWasmTable = readParquet(parquetUint8Array);

// table is now an Arrow table in JS memory

const table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream());

console.log(table.schema.toString());

// Schema<{ 0: precipitation: Float32, 1: date: Date64 }>

```

### Published examples

(These may use older versions of the library with a different API).

- [GeoParquet on the Web (Observable)](https://observablehq.com/@kylebarron/geoparquet-on-the-web)

- [Hello, Parquet-WASM (Observable)](https://observablehq.com/@bmschmidt/hello-parquet-wasm)

## Performance considerations

Tl;dr: When you have a `Table` object (resulting from `readParquet`), try the new

[`Table.intoFFI`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.Table.html#intoFFI)

API to move it to JavaScript memory. This API is less well tested than the [`Table.intoIPCStream`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.Table.html#intoIPCStream) API, but should be

faster and have **much** less memory overhead (by a factor of 2). If you hit any bugs, please

[create a reproducible issue](https://github.com/kylebarron/parquet-wasm/issues/new).

Under the hood, `parquet-wasm` first decodes a Parquet file into Arrow _in WebAssembly memory_. But

then that WebAssembly memory needs to be copied into JavaScript for use by Arrow JS. The "normal"

conversion APIs (e.g. `Table.intoIPCStream`) use the [Arrow IPC

format](https://arrow.apache.org/docs/python/ipc.html) to get the data back to JavaScript. But this

requires another memory copy _inside WebAssembly_ to assemble the various arrays into a single

buffer to be copied back to JS.

Instead, the new `Table.intoFFI` API uses Arrow's [C Data

Interface](https://arrow.apache.org/docs/format/CDataInterface.html) to be able to copy or view

Arrow arrays from within WebAssembly memory without any serialization.

Note that this approach uses the [`arrow-js-ffi`](https://github.com/kylebarron/arrow-js-ffi)

library to parse the Arrow C Data Interface definitions. This library has not yet been tested in

production, so it may have bugs!

I wrote an [interactive blog

post](https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly) on this approach

and the Arrow C Data Interface if you want to read more!

### Example

```js

import * as arrow from "apache-arrow";

import { parseTable } from "arrow-js-ffi";

import initWasm, { wasmMemory, readParquet } from "parquet-wasm";

// Instantiate the WebAssembly context

await initWasm();

// A reference to the WebAssembly memory object.

const WASM_MEMORY = wasmMemory();

const resp = await fetch("https://example.com/file.parquet");

const parquetUint8Array = new Uint8Array(await resp.arrayBuffer());

const wasmArrowTable = readParquet(parquetUint8Array).intoFFI();

// Arrow JS table that was directly copied from Wasm memory

const table: arrow.Table = parseTable(

  WASM_MEMORY.buffer,

  wasmArrowTable.arrayAddrs(),

  wasmArrowTable.schemaAddr()

);

// VERY IMPORTANT! You must call `drop` on the Wasm table object when you're done using it

// to release the Wasm memory.

// Note that any access to the pointers in this table is undefined behavior after this call.

// Calling any `wasmArrowTable` method will error.

wasmArrowTable.drop();

```

## Compression support

The Parquet specification permits several compression codecs. This library currently supports:

- [x] Uncompressed

- [x] Snappy

- [x] Gzip

- [x] Brotli

- [x] ZSTD

- [x] LZ4_RAW

- [ ] LZ4 (deprecated)

LZ4 support in Parquet is a bit messy. As described [here](https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/Compression.md), there are _two_ LZ4 compression options in Parquet (as of version 2.9.0). The original version `LZ4` is now deprecated; it used an undocumented framing scheme which made interoperability difficult. The specification now reads:

> It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperable `LZ4_RAW` codec.

It's currently unknown how widespread the ecosystem support is for `LZ4_RAW`. As of `pyarrow` v7, it now writes `LZ4_RAW` by default and presumably has read support for it as well.

## Custom builds

In some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build of `parquet-wasm` to keep bundle size at a minimum. If you install the Rust toolchain and `wasm-pack` (see [Development](DEVELOP.md)), you can create a custom build with only the compression codecs you require.

The minimum supported Rust version in this project is 1.60. To upgrade your toolchain, use `rustup update stable`.

### Example custom builds

Reader-only bundle with Snappy compression:

```

wasm-pack build --no-default-features --features snappy --features reader

```

Writer-only bundle with no compression support, targeting Node:

```

wasm-pack build --target nodejs --no-default-features --features writer

```

Bundle with reader and writer support, targeting Node, using `arrow` and `parquet` crates with all their supported compressions, with `console_error_panic_hook` enabled:

```bash

wasm-pack build \

  --target nodejs \

  --no-default-features \

  --features reader \

  --features writer \

  --features all_compressions \

  --features debug

# Or, given the fact that the default feature includes several of these features, a shorter version:

wasm-pack build --target nodejs --features debug

```

Refer to the [`wasm-pack` documentation](https://rustwasm.github.io/docs/wasm-pack/commands/build.html) for more info on flags such as `--release`, `--dev`, `target`, and to the [Cargo documentation](https://doc.rust-lang.org/cargo/reference/features.html) for more info on how to use features.

### Available features

By default, `all_compressions`, `reader`, `writer`, and `async` features are enabled. Use `--no-default-features` to remove these defaults.

- `reader`: Activate read support.

- `writer`: Activate write support.

- `async`: Activate asynchronous read support.

- `all_compressions`: Activate all supported compressions.

- `brotli`: Activate Brotli compression.

- `gzip`: Activate Gzip compression.

- `snappy`: Activate Snappy compression.

- `zstd`: Activate ZSTD compression.

- `lz4`: Activate LZ4_RAW compression.

- `debug`: Expose the `setPanicHook` function for better error messages for Rust panics.

## Node <20

On Node versions before 20, you'll have to [polyfill the Web Cryptography API](https://docs.rs/getrandom/latest/getrandom/#nodejs-es-module-support).

## Future work

- [ ] Example of pushdown predicate filtering, to download only chunks that match a specific condition

- [ ] Column filtering, to download only certain columns

- [ ] More tests

## Acknowledgements

A starting point of my work came from @my-liminal-space's [`read-parquet-browser`](https://github.com/my-liminal-space/read-parquet-browser) (which is also dual licensed MIT and Apache 2).

@domoritz's [`arrow-wasm`](https://github.com/domoritz/arrow-wasm) was a very helpful reference for bootstrapping Rust-WASM bindings.
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kylebarron/parquet-wasm

Awesome Lists containing this project

README