{"id":18174818,"url":"https://github.com/kylebarron/parquet-wasm","last_synced_at":"2025-05-14T11:11:24.046Z","repository":{"id":37935448,"uuid":"464268599","full_name":"kylebarron/parquet-wasm","owner":"kylebarron","description":"Rust-based WebAssembly bindings to read and write Apache Parquet data","archived":false,"fork":false,"pushed_at":"2025-05-12T21:04:31.000Z","size":2798,"stargazers_count":583,"open_issues_count":24,"forks_count":20,"subscribers_count":6,"default_branch":"main","last_synced_at":"2025-05-12T23:53:24.185Z","etag":null,"topics":["apache-arrow","apache-parquet","arrow","javascript","parquet","rust","wasm","webassembly"],"latest_commit_sha":null,"homepage":"https://kylebarron.dev/parquet-wasm/","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kylebarron.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE_APACHE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2022-02-27T22:25:25.000Z","updated_at":"2025-05-12T11:53:45.000Z","dependencies_parsed_at":"2024-02-08T04:34:47.930Z","dependency_job_id":"dea127ed-7409-4e86-bfa8-ba9e62dea574","html_url":"https://github.com/kylebarron/parquet-wasm","commit_stats":{"total_commits":343,"total_committers":8,"mean_commits":42.875,"dds":0.5160349854227405,"last_synced_commit":"ef361e7662a8c0c2b401f21325545348d74f88cb"},"previous_names":[],"tags_count":11,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kylebarron%2Fparquet-wasm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kylebarron%2Fparquet-wasm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kylebarron%2Fparquet-wasm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kylebarron%2Fparquet-wasm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kylebarron","download_url":"https://codeload.github.com/kylebarron/parquet-wasm/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253843185,"owners_count":21972870,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache-arrow","apache-parquet","arrow","javascript","parquet","rust","wasm","webassembly"],"created_at":"2024-11-02T16:07:51.898Z","updated_at":"2025-05-14T11:11:19.034Z","avatar_url":"https://github.com/kylebarron.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# WASM Parquet [![npm version](https://img.shields.io/npm/v/parquet-wasm.svg)](https://www.npmjs.com/package/parquet-wasm)\n\nWebAssembly bindings to read and write the [Apache Parquet](https://parquet.apache.org/) format to and from [Apache Arrow](https://arrow.apache.org/) using the Rust [`parquet`](https://crates.io/crates/parquet) and [`arrow`](https://crates.io/crates/arrow) crates.\n\nThis is designed to be used alongside a JavaScript Arrow implementation, such as the canonical [JS Arrow library](https://arrow.apache.org/docs/js/).\n\nIncluding read and write support and all compression codecs, the brotli-compressed WASM bundle is 1.2 MB. Refer to [custom builds](#custom-builds) for how to build a smaller bundle. A minimal read-only bundle without compression support can be as small as 456 KB brotli-compressed.\n\n## Install\n\n`parquet-wasm` is published to NPM. Install with\n\n```\nyarn add parquet-wasm\n```\n\nor\n\n```\nnpm install parquet-wasm\n```\n\n## API\n\nParquet-wasm has both a synchronous and asynchronous API. The sync API is simpler but requires fetching the entire Parquet buffer in advance, which is often prohibitive.\n\n### Sync API\n\nRefer to these functions:\n\n- [`readParquet`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.readParquet.html): Read a Parquet file synchronously.\n- [`readSchema`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.readSchema.html): Read an Arrow schema from a Parquet file synchronously.\n- [`writeParquet`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.writeParquet.html): Write a Parquet file synchronously.\n\n### Async API\n\n- [`readParquetStream`](https://kylebarron.dev/parquet-wasm/functions/esm_parquet_wasm.readParquetStream.html): Create a [ReadableStream](https://developer.mozilla.org/en-US/docs/Web/API/ReadableStream) that emits Arrow RecordBatches from a Parquet file.\n- [`ParquetFile`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html): A class for reading portions of a remote Parquet file. Use [`fromUrl`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html#fromUrl) to construct from a remote URL or [`fromFile`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html#fromFile) to construct from a [`File`](https://developer.mozilla.org/en-US/docs/Web/API/File) handle. Note that when you're done using this class, you'll need to call [`free`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.ParquetFile.html#free) to release any memory held by the ParquetFile instance itself.\n\n\nBoth sync and async functions return or accept a [`Table`](https://kylebarron.dev/parquet-wasm/classes/bundler_parquet_wasm.Table.html) class, an Arrow table in WebAssembly memory. Refer to its documentation for moving data into/out of WebAssembly.\n\n## Entry Points\n\n\n| Entry point                                                               | Description                                             | Documentation        |\n| ------------------------------------------------------------------------- | ------------------------------------------------------- | -------------------- |\n| `parquet-wasm`, `parquet-wasm/esm`, or `parquet-wasm/esm/parquet_wasm.js` | ESM, to be used directly from the Web as an ES Module   | [Link][esm-docs]     |\n| `parquet-wasm/bundler`                                                    | \"Bundler\" build, to be used in bundlers such as Webpack | [Link][bundler-docs] |\n| `parquet-wasm/node`                                                       | Node build, to be used with synchronous `require` in NodeJS         | [Link][node-docs]    |\n\n[bundler-docs]: https://kylebarron.dev/parquet-wasm/modules/bundler_parquet_wasm.html\n[node-docs]: https://kylebarron.dev/parquet-wasm/modules/node_parquet_wasm.html\n[esm-docs]: https://kylebarron.dev/parquet-wasm/modules/esm_parquet_wasm.html\n\n### ESM\n\nThe `esm` entry point is the primary entry point. It is the default export from `parquet-wasm`, and is also accessible at `parquet-wasm/esm` and `parquet-wasm/esm/parquet_wasm.js` (for symmetric imports [directly from a browser](#using-directly-from-a-browser)).\n\n**Note that when using the `esm` bundles, you must manually initialize the WebAssembly module before using any APIs**. Otherwise, you'll get an error `TypeError: Cannot read properties of undefined`. There are multiple ways to initialize the WebAssembly code:\n\n#### Asynchronous initialization\n\nThe primary way to initialize is by awaiting the default export.\n\n```js\nimport wasmInit, {readParquet} from \"parquet-wasm\";\n\nawait wasmInit();\n```\n\nWithout any parameter, this will try to fetch a file named `'parquet_wasm_bg.wasm'` at the same location as `parquet-wasm`. (E.g. this snippet `input = new URL('parquet_wasm_bg.wasm', import.meta.url);`).\n\nNote that you can also pass in a custom URL if you want to host the `.wasm` file on your own servers.\n\n```js\nimport wasmInit, {readParquet} from \"parquet-wasm\";\n\n// Update this version to match the version you're using.\nconst wasmUrl = \"https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/parquet_wasm_bg.wasm\";\nawait wasmInit(wasmUrl);\n```\n\n#### Synchronous initialization\n\nThe `initSync` named export allows for\n\n```js\nimport {initSync, readParquet} from \"parquet-wasm\";\n\n// The contents of esm/parquet_wasm_bg.wasm in an ArrayBuffer\nconst wasmBuffer = new ArrayBuffer(...);\n\n// Initialize the Wasm synchronously\ninitSync(wasmBuffer)\n```\n\nAsync initialization should be preferred over downloading the Wasm buffer and then initializing it synchronously, as [`WebAssembly.instantiateStreaming`](https://developer.mozilla.org/en-US/docs/WebAssembly/JavaScript_interface/instantiateStreaming_static) is the most efficient way to both download and initialize Wasm code.\n\n### Bundler\n\nThe `bundler` entry point doesn't require manual initialization of the WebAssembly blob, but needs setup with whatever bundler you're using. [Refer to the Rust Wasm documentation for more info](https://rustwasm.github.io/docs/wasm-bindgen/reference/deployment.html#bundlers).\n\n### Node\n\nThe `node` entry point can be loaded synchronously from Node.\n\n```js\nconst {readParquet} = require(\"parquet-wasm\");\n\nconst wasmTable = readParquet(...);\n```\n\n### Using directly from a browser\n\nYou can load the `esm/parquet_wasm.js` file directly from a CDN\n\n```js\nconst parquet = await import(\n  \"https://cdn.jsdelivr.net/npm/parquet-wasm@0.6.0/esm/+esm\"\n)\nawait parquet.default();\n\nconst wasmTable = parquet.readParquet(...);\n```\n\nThis specific endpoint will minify the ESM before you receive it.\n\n### Debug functions\n\nThese functions are not present in normal builds to cut down on bundle size. To create a custom build, see [Custom Builds](#custom-builds) below.\n\n#### `setPanicHook`\n\n`setPanicHook(): void`\n\nSets [`console_error_panic_hook`](https://github.com/rustwasm/console_error_panic_hook) in Rust, which provides better debugging of panics by having more informative `console.error` messages. Initialize this first if you're getting errors such as `RuntimeError: Unreachable executed`.\n\nThe WASM bundle must be compiled with the `console_error_panic_hook` feature for this function to exist.\n\n## Example\n\n```js\nimport * as arrow from \"apache-arrow\";\nimport initWasm, {\n  Compression,\n  readParquet,\n  Table,\n  writeParquet,\n  WriterPropertiesBuilder,\n} from \"parquet-wasm\";\n\n// Instantiate the WebAssembly context\nawait initWasm();\n\n// Create Arrow Table in JS\nconst LENGTH = 2000;\nconst rainAmounts = Float32Array.from({ length: LENGTH }, () =\u003e\n  Number((Math.random() * 20).toFixed(1))\n);\n\nconst rainDates = Array.from(\n  { length: LENGTH },\n  (_, i) =\u003e new Date(Date.now() - 1000 * 60 * 60 * 24 * i)\n);\n\nconst rainfall = arrow.tableFromArrays({\n  precipitation: rainAmounts,\n  date: rainDates,\n});\n\n// Write Arrow Table to Parquet\n\n// wasmTable is an Arrow table in WebAssembly memory\nconst wasmTable = Table.fromIPCStream(arrow.tableToIPC(rainfall, \"stream\"));\nconst writerProperties = new WriterPropertiesBuilder()\n  .setCompression(Compression.ZSTD)\n  .build();\nconst parquetUint8Array = writeParquet(wasmTable, writerProperties);\n\n// Read Parquet buffer back to Arrow Table\n// arrowWasmTable is an Arrow table in WebAssembly memory\nconst arrowWasmTable = readParquet(parquetUint8Array);\n\n// table is now an Arrow table in JS memory\nconst table = arrow.tableFromIPC(arrowWasmTable.intoIPCStream());\nconsole.log(table.schema.toString());\n// Schema\u003c{ 0: precipitation: Float32, 1: date: Date64\u003cMILLISECOND\u003e }\u003e\n```\n\n### Published examples\n\n(These may use older versions of the library with a different API).\n\n- [GeoParquet on the Web (Observable)](https://observablehq.com/@kylebarron/geoparquet-on-the-web)\n- [Hello, Parquet-WASM (Observable)](https://observablehq.com/@bmschmidt/hello-parquet-wasm)\n\n## Performance considerations\n\nTl;dr: When you have a `Table` object (resulting from `readParquet`), try the new\n[`Table.intoFFI`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.Table.html#intoFFI)\nAPI to move it to JavaScript memory. This API is less well tested than the [`Table.intoIPCStream`](https://kylebarron.dev/parquet-wasm/classes/esm_parquet_wasm.Table.html#intoIPCStream) API, but should be\nfaster and have **much** less memory overhead (by a factor of 2). If you hit any bugs, please\n[create a reproducible issue](https://github.com/kylebarron/parquet-wasm/issues/new).\n\nUnder the hood, `parquet-wasm` first decodes a Parquet file into Arrow _in WebAssembly memory_. But\nthen that WebAssembly memory needs to be copied into JavaScript for use by Arrow JS. The \"normal\"\nconversion APIs (e.g. `Table.intoIPCStream`) use the [Arrow IPC\nformat](https://arrow.apache.org/docs/python/ipc.html) to get the data back to JavaScript. But this\nrequires another memory copy _inside WebAssembly_ to assemble the various arrays into a single\nbuffer to be copied back to JS.\n\nInstead, the new `Table.intoFFI` API uses Arrow's [C Data\nInterface](https://arrow.apache.org/docs/format/CDataInterface.html) to be able to copy or view\nArrow arrays from within WebAssembly memory without any serialization.\n\nNote that this approach uses the [`arrow-js-ffi`](https://github.com/kylebarron/arrow-js-ffi)\nlibrary to parse the Arrow C Data Interface definitions. This library has not yet been tested in\nproduction, so it may have bugs!\n\nI wrote an [interactive blog\npost](https://observablehq.com/@kylebarron/zero-copy-apache-arrow-with-webassembly) on this approach\nand the Arrow C Data Interface if you want to read more!\n\n### Example\n\n```js\nimport * as arrow from \"apache-arrow\";\nimport { parseTable } from \"arrow-js-ffi\";\nimport initWasm, { wasmMemory, readParquet } from \"parquet-wasm\";\n\n// Instantiate the WebAssembly context\nawait initWasm();\n\n// A reference to the WebAssembly memory object.\nconst WASM_MEMORY = wasmMemory();\n\nconst resp = await fetch(\"https://example.com/file.parquet\");\nconst parquetUint8Array = new Uint8Array(await resp.arrayBuffer());\nconst wasmArrowTable = readParquet(parquetUint8Array).intoFFI();\n\n// Arrow JS table that was directly copied from Wasm memory\nconst table: arrow.Table = parseTable(\n  WASM_MEMORY.buffer,\n  wasmArrowTable.arrayAddrs(),\n  wasmArrowTable.schemaAddr()\n);\n\n// VERY IMPORTANT! You must call `drop` on the Wasm table object when you're done using it\n// to release the Wasm memory.\n// Note that any access to the pointers in this table is undefined behavior after this call.\n// Calling any `wasmArrowTable` method will error.\nwasmArrowTable.drop();\n```\n\n## Compression support\n\nThe Parquet specification permits several compression codecs. This library currently supports:\n\n- [x] Uncompressed\n- [x] Snappy\n- [x] Gzip\n- [x] Brotli\n- [x] ZSTD\n- [x] LZ4_RAW\n- [ ] LZ4 (deprecated)\n\nLZ4 support in Parquet is a bit messy. As described [here](https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/Compression.md), there are _two_ LZ4 compression options in Parquet (as of version 2.9.0). The original version `LZ4` is now deprecated; it used an undocumented framing scheme which made interoperability difficult. The specification now reads:\n\n\u003e It is strongly suggested that implementors of Parquet writers deprecate this compression codec in their user-facing APIs, and advise users to switch to the newer, interoperable `LZ4_RAW` codec.\n\nIt's currently unknown how widespread the ecosystem support is for `LZ4_RAW`. As of `pyarrow` v7, it now writes `LZ4_RAW` by default and presumably has read support for it as well.\n\n## Custom builds\n\nIn some cases, you may know ahead of time that your Parquet files will only include a single compression codec, say Snappy, or even no compression at all. In these cases, you may want to create a custom build of `parquet-wasm` to keep bundle size at a minimum. If you install the Rust toolchain and `wasm-pack` (see [Development](DEVELOP.md)), you can create a custom build with only the compression codecs you require.\n\nThe minimum supported Rust version in this project is 1.60. To upgrade your toolchain, use `rustup update stable`.\n\n### Example custom builds\n\nReader-only bundle with Snappy compression:\n\n```\nwasm-pack build --no-default-features --features snappy --features reader\n```\n\nWriter-only bundle with no compression support, targeting Node:\n\n```\nwasm-pack build --target nodejs --no-default-features --features writer\n```\n\nBundle with reader and writer support, targeting Node, using `arrow` and `parquet` crates with all their supported compressions, with `console_error_panic_hook` enabled:\n\n```bash\nwasm-pack build \\\n  --target nodejs \\\n  --no-default-features \\\n  --features reader \\\n  --features writer \\\n  --features all_compressions \\\n  --features debug\n# Or, given the fact that the default feature includes several of these features, a shorter version:\nwasm-pack build --target nodejs --features debug\n```\n\nRefer to the [`wasm-pack` documentation](https://rustwasm.github.io/docs/wasm-pack/commands/build.html) for more info on flags such as `--release`, `--dev`, `target`, and to the [Cargo documentation](https://doc.rust-lang.org/cargo/reference/features.html) for more info on how to use features.\n\n### Available features\n\nBy default, `all_compressions`, `reader`, `writer`, and `async` features are enabled. Use `--no-default-features` to remove these defaults.\n\n- `reader`: Activate read support.\n- `writer`: Activate write support.\n- `async`: Activate asynchronous read support.\n- `all_compressions`: Activate all supported compressions.\n- `brotli`: Activate Brotli compression.\n- `gzip`: Activate Gzip compression.\n- `snappy`: Activate Snappy compression.\n- `zstd`: Activate ZSTD compression.\n- `lz4`: Activate LZ4_RAW compression.\n- `debug`: Expose the `setPanicHook` function for better error messages for Rust panics.\n\n## Node \u003c20\n\nOn Node versions before 20, you'll have to [polyfill the Web Cryptography API](https://docs.rs/getrandom/latest/getrandom/#nodejs-es-module-support).\n\n## Future work\n\n- [ ] Example of pushdown predicate filtering, to download only chunks that match a specific condition\n- [ ] Column filtering, to download only certain columns\n- [ ] More tests\n\n## Acknowledgements\n\nA starting point of my work came from @my-liminal-space's [`read-parquet-browser`](https://github.com/my-liminal-space/read-parquet-browser) (which is also dual licensed MIT and Apache 2).\n\n@domoritz's [`arrow-wasm`](https://github.com/domoritz/arrow-wasm) was a very helpful reference for bootstrapping Rust-WASM bindings.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkylebarron%2Fparquet-wasm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkylebarron%2Fparquet-wasm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkylebarron%2Fparquet-wasm/lists"}