{"id":20964551,"url":"https://github.com/hyparam/hyparquet","last_synced_at":"2025-05-15T01:08:12.222Z","repository":{"id":214671051,"uuid":"737060203","full_name":"hyparam/hyparquet","owner":"hyparam","description":"parquet file parser for javascript","archived":false,"fork":false,"pushed_at":"2025-05-04T03:53:04.000Z","size":4335,"stargazers_count":423,"open_issues_count":10,"forks_count":13,"subscribers_count":9,"default_branch":"master","last_synced_at":"2025-05-12T11:02:56.658Z","etag":null,"topics":["hyparquet","hyperparam","javascript","js","parquet","parquetjs","parser","snappy","thrift"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hyparam.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-29T17:30:35.000Z","updated_at":"2025-05-12T09:18:46.000Z","dependencies_parsed_at":"2023-12-29T22:25:15.845Z","dependency_job_id":"53e9fbb7-dee9-474b-a17f-750e8a2eb8c3","html_url":"https://github.com/hyparam/hyparquet","commit_stats":null,"previous_names":["platypii/hyparquet","hyparam/hyparquet"],"tags_count":80,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhyparquet","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhyparquet/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhyparquet/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hyparam%2Fhyparquet/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hyparam","download_url":"https://codeload.github.com/hyparam/hyparquet/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254227581,"owners_count":22035664,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hyparquet","hyperparam","javascript","js","parquet","parquetjs","parser","snappy","thrift"],"created_at":"2024-11-19T02:56:02.810Z","updated_at":"2025-05-15T01:08:07.215Z","avatar_url":"https://github.com/hyparam.png","language":"JavaScript","funding_links":[],"categories":["javascript","js","JavaScript"],"sub_categories":[],"readme":"# hyparquet\n\n![hyparquet parakeet](hyparquet.jpg)\n\n[![npm](https://img.shields.io/npm/v/hyparquet)](https://www.npmjs.com/package/hyparquet)\n[![minzipped](https://img.shields.io/bundlephobia/minzip/hyparquet)](https://www.npmjs.com/package/hyparquet)\n[![workflow status](https://github.com/hyparam/hyparquet/actions/workflows/ci.yml/badge.svg)](https://github.com/hyparam/hyparquet/actions)\n[![mit license](https://img.shields.io/badge/License-MIT-orange.svg)](https://opensource.org/licenses/MIT)\n![coverage](https://img.shields.io/badge/Coverage-96-darkred)\n[![dependencies](https://img.shields.io/badge/Dependencies-0-blueviolet)](https://www.npmjs.com/package/hyparquet?activeTab=dependencies)\n\nDependency free since 2023!\n\n## What is hyparquet?\n\n**Hyparquet** is a lightweight, dependency-free, pure JavaScript library for parsing [Apache Parquet](https://parquet.apache.org) files. Apache Parquet is a popular columnar storage format that is widely used in data engineering, data science, and machine learning applications for efficiently storing and processing large datasets.\n\nHyparquet aims to be the world's most compliant parquet parser. And it runs in the browser.\n\n## Parquet Viewer\n\n**Try hyparquet online**: Drag and drop your parquet file onto [hyperparam.app](https://hyperparam.app) to view it directly in your browser. This service is powered by hyparquet's in-browser capabilities.\n\n[![hyperparam parquet viewer](./hyperparam.png)](https://hyperparam.app/)\n\n## Features\n\n1. **Browser-native**: Built to work seamlessly in the browser, opening up new possibilities for web-based data applications and visualizations.\n2. **Performant**: Designed to efficiently process large datasets by only loading the required data, making it suitable for big data and machine learning applications.\n3. **TypeScript**: Includes TypeScript definitions.\n4. **Dependency-free**: Hyparquet has zero dependencies, making it lightweight and easy to use in any JavaScript project. Only 9.7kb min.gz!\n5. **Highly Compliant:** Supports all parquet encodings, compression codecs, and can open more parquet files than any other library.\n\n## Why hyparquet?\n\nParquet is widely used in data engineering and data science for its efficient storage and processing of large datasets. What if you could use parquet files directly in the browser, without needing a server or backend infrastructure? That's what hyparquet enables.\n\nExisting JavaScript-based parquet readers (like [parquetjs](https://github.com/ironSource/parquetjs)) are no longer actively maintained, may not support streaming or in-browser processing efficiently, and often rely on dependencies that can inflate your bundle size.\nHyparquet is actively maintained and designed with modern web usage in mind.\n\n## Demo\n\nCheck out a minimal parquet viewer demo that shows how to integrate hyparquet into a react web application using [HighTable](https://github.com/hyparam/hightable).\n\n - **Live Demo**: [https://hyparam.github.io/demos/hyparquet/](https://hyparam.github.io/demos/hyparquet/)\n - **Demo Source Code**: [https://github.com/hyparam/demos/tree/master/hyparquet](https://github.com/hyparam/demos/tree/master/hyparquet)\n\n## Quick Start\n\n### Node.js Example\n\nTo read the contents of a local parquet file in a node.js environment use `asyncBufferFromFile`:\n\n```javascript\nconst { asyncBufferFromFile, parquetReadObjects } = await import('hyparquet')\n\nconst file = await asyncBufferFromFile(filename)\nconst data = await parquetReadObjects({ file })\n```\n\nNote: hyparquet is published as an ES module, so dynamic `import()` may be required on the command line.\n\n### Browser Example\n\nIn the browser use `asyncBufferFromUrl` to wrap a url for reading asynchronously over the network.\nIt is recommended that you filter by row and column to limit fetch size:\n\n```javascript\nconst { asyncBufferFromUrl, parquetReadObjects } = await import('https://cdn.jsdelivr.net/npm/hyparquet/src/hyparquet.min.js')\n\nconst url = 'https://hyperparam-public.s3.amazonaws.com/bunnies.parquet'\nconst file = await asyncBufferFromUrl({ url }) // wrap url for async fetching\nconst data = await parquetReadObjects({\n  file,\n  columns: ['Breed Name', 'Lifespan'],\n  rowStart: 10,\n  rowEnd: 20,\n})\n```\n\n## Parquet Writing\n\nTo create parquet files from javascript, check out the [hyparquet-writer](https://github.com/hyparam/hyparquet-writer) package.\n\n## Advanced Usage\n\n### Reading Metadata\n\nYou can read just the metadata, including schema and data statistics using the `parquetMetadataAsync` function.\nTo load parquet metadata in the browser from a remote server:\n\n```javascript\nimport { parquetMetadataAsync, parquetSchema } from 'hyparquet'\n\nconst file = await asyncBufferFromUrl({ url })\nconst metadata = await parquetMetadataAsync(file)\n// Get total number of rows (convert bigint to number)\nconst numRows = Number(metadata.num_rows)\n// Get nested table schema\nconst schema = parquetSchema(metadata)\n// Get top-level column header names\nconst columnNames = schema.children.map(e =\u003e e.element.name)\n```\n\nYou can also read the metadata synchronously using `parquetMetadata` if you have an array buffer with the parquet footer:\n\n```javascript\nimport { parquetMetadata } from 'hyparquet'\n\nconst metadata = parquetMetadata(arrayBuffer)\n```\n\n### AsyncBuffer\n\nHyparquet requires an argument `file` of type `AsyncBuffer`. An `AsyncBuffer` is similar to a js `ArrayBuffer` but the `slice` method can return async `Promise\u003cArrayBuffer\u003e`.\n\n```typescript\ntype Awaitable\u003cT\u003e = T | Promise\u003cT\u003e\ninterface AsyncBuffer {\n  byteLength: number\n  slice(start: number, end?: number): Awaitable\u003cArrayBuffer\u003e\n}\n```\n\nIn most cases, you should probably use `asyncBufferFromUrl` or `asyncBufferFromFile` to create an `AsyncBuffer` for hyparquet.\n\n#### asyncBufferFromFile\n\nIf you are in a local node.js environment, use `asyncBufferFromFile` to wrap a local file as an `AsyncBuffer`:\n\n```typescript\nconst file: AsyncBuffer = asyncBufferFromFile('local.parquet')\nconst data = await parquetReadObjects({ file })\n```\n\n#### asyncBufferFromUrl\n\nIf you want to read a parquet file remotely over http, use `asyncBufferFromUrl` to wrap an http url as an `AsyncBuffer` using http range requests.\n\n - Pass `requestInit` option to provide additional fetch headers for authentication (optional)\n - Pass `byteLength` if you know the file size to save a round trip HEAD request (optional)\n\n```typescript\nconst url = 'https://s3.hyperparam.app/wiki_en.parquet'\nconst requestInit = { headers: { Authorization: 'Bearer my_token' } }\nconst byteLength = 415958713\nconst file: AsyncBuffer = await asyncBufferFromUrl({ url, requestInit, byteLength })\nconst data = await parquetReadObjects({ file })\n```\n\n#### ArrayBuffer\n\nYou can provide an `ArrayBuffer` anywhere that an `AsyncBuffer` is expected. This is useful if you already have the entire parquet file in memory.\n\n#### Custom AsyncBuffer\n\nYou can implement your own `AsyncBuffer` to create a virtual file that can be read asynchronously by hyparquet.\n\n### parquetRead vs parquetReadObjects\n\n#### parquetReadObjects\n\n`parquetReadObjects` is a convenience wrapper around `parquetRead` that returns the complete rows as `Promise\u003cRecord\u003cstring, any\u003e[]\u003e`. This is the simplest way to read parquet files.\n\n```typescript\nparquetReadObjects({ file }): Promise\u003cRecord\u003cstring, any\u003e[]\u003e\n```\n\n#### parquetRead\n\n`parquetRead` is the \"base\" function for reading parquet files.\nIt returns a `Promise\u003cvoid\u003e` that resolves when the file has been read or rejected if an error occurs.\nData is returned via `onComplete` or `onChunk` or `onPage` callbacks passed as arguments.\n\nThe reason for this design is that parquet is a column-oriented format, and returning data in row-oriented format requires transposing the column data. This is an expensive operation in javascript. If you don't pass in an `onComplete` argument to `parquetRead`, hyparquet will skip this transpose step and save memory.\n\n### Chunk Streaming\n\nThe `onChunk` callback returns column-oriented data as it is ready. `onChunk` will always return top-level columns, including structs, assembled as a single column. This may require waiting for multiple sub-columns to all load before assembly can occur.\n\nThe `onPage` callback returns column-oriented page data as it is ready. `onPage` will NOT assemble struct columns and will always return individual sub-column data. Note that `onPage` _will_ assemble nested lists.\n\nIn some cases, `onPage` can return data sooner than `onChunk`.\n\n```typescript\ninterface ColumnData {\n  columnName: string\n  columnData: ArrayLike\u003cany\u003e\n  rowStart: number\n  rowEnd: number\n}\nawait parquetRead({\n  file,\n  onChunk(chunk: ColumnData) {\n    console.log('chunk', chunk)\n  },\n  onPage(chunk: ColumnData) {\n    console.log('page', chunk)\n  },\n})\n```\n\n### Returned row format\n\nBy default, the `onComplete` function returns an **array** of values for each row: `[value]`. If you would prefer each row to be an **object**:  `{ columnName: value }`, set the option `rowFormat` to `'object'`.\n\n```javascript\nimport { parquetRead } from 'hyparquet'\n\nawait parquetRead({\n  file,\n  rowFormat: 'object',\n  onComplete: data =\u003e console.log(data),\n})\n```\n\nThe `parquetReadObjects` function defaults to `rowFormat: 'object'`.\n\n## Supported Parquet Files\n\nThe parquet format is known to be a sprawling format which includes options for a wide array of compression schemes, encoding types, and data structures.\nHyparquet supports all parquet encodings: plain, dictionary, rle, bit packed, delta, etc.\n\n**Hyparquet is the most compliant parquet parser on earth** — hyparquet can open more files than pyarrow, rust, and duckdb.\n\n## Compression\n\nBy default, hyparquet supports uncompressed and snappy-compressed parquet files.\nTo support the full range of parquet compression codecs (gzip, brotli, zstd, etc), use the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) package.\n\n| Codec         | hyparquet | with hyparquet-compressors |\n|---------------|-----------|----------------------------|\n| Uncompressed  | ✅        | ✅                         |\n| Snappy        | ✅        | ✅                         |\n| GZip          | ❌        | ✅                         |\n| LZO           | ❌        | ✅                         |\n| Brotli        | ❌        | ✅                         |\n| LZ4           | ❌        | ✅                         |\n| ZSTD          | ❌        | ✅                         |\n| LZ4_RAW       | ❌        | ✅                         |\n\n### hysnappy\n\nFor faster snappy decompression, try [hysnappy](https://github.com/hyparam/hysnappy), which uses WASM for a 40% speed boost on large parquet files.\n\n### hyparquet-compressors\n\nYou can include support for ALL parquet `compressors` plus hysnappy using the [hyparquet-compressors](https://github.com/hyparam/hyparquet-compressors) package.\n\n\n```javascript\nimport { parquetReadObjects } from 'hyparquet'\nimport { compressors } from 'hyparquet-compressors'\n\nconst file = await asyncBufferFromFile(filename)\nconst data = await parquetReadObjects({ file, compressors })\n```\n\n## References\n\n - https://github.com/apache/parquet-format\n - https://github.com/apache/parquet-testing\n - https://github.com/apache/thrift\n - https://github.com/apache/arrow\n - https://github.com/dask/fastparquet\n - https://github.com/duckdb/duckdb\n - https://github.com/google/snappy\n - https://github.com/hyparam/hightable\n - https://github.com/hyparam/hysnappy\n - https://github.com/hyparam/hyparquet-compressors\n - https://github.com/ironSource/parquetjs\n - https://github.com/zhipeng-jia/snappyjs\n\n## Contributions\n\nContributions are welcome!\nIf you have suggestions, bug reports, or feature requests, please open an issue or submit a pull request.\n\nHyparquet development is supported by an open-source grant from Hugging Face :hugs:\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fhyparquet","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhyparam%2Fhyparquet","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhyparam%2Fhyparquet/lists"}