{"id":19878610,"url":"https://github.com/ironsource/parquetjs","last_synced_at":"2025-05-15T15:06:07.765Z","repository":{"id":25720025,"uuid":"89837021","full_name":"ironSource/parquetjs","owner":"ironSource","description":"fully asynchronous, pure JavaScript implementation of the Parquet file format","archived":false,"fork":false,"pushed_at":"2024-05-20T17:04:01.000Z","size":252,"stargazers_count":362,"open_issues_count":82,"forks_count":175,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-05-12T07:54:33.740Z","etag":null,"topics":["javascript","nodejs","parquet"],"latest_commit_sha":null,"homepage":"","language":"JavaScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ironSource.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-04-30T07:43:22.000Z","updated_at":"2025-04-01T15:29:50.000Z","dependencies_parsed_at":"2024-05-20T18:57:30.519Z","dependency_job_id":null,"html_url":"https://github.com/ironSource/parquetjs","commit_stats":{"total_commits":265,"total_committers":15,"mean_commits":"17.666666666666668","dds":0.539622641509434,"last_synced_commit":"d4a9e44413f7e7bf3be96c64d33e946e860f235c"},"previous_names":[],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ironSource%2Fparquetjs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ironSource%2Fparquetjs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ironSource%2Fparquetjs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ironSource%2Fparquetjs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ironSource","download_url":"https://codeload.github.com/ironSource/parquetjs/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254364270,"owners_count":22058878,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["javascript","nodejs","parquet"],"created_at":"2024-11-12T17:06:07.118Z","updated_at":"2025-05-15T15:06:07.744Z","avatar_url":"https://github.com/ironSource.png","language":"JavaScript","readme":"# ANNOUCEMENT: repository changed ownership\nThis repository will changed ownership to the personal account of it's maintainer, [Yaniv Kessler](https://github.com/kessler)\n\nAlso on the npm registry to [Yaniv Kessler](https://www.npmjs.com/~kessler)\n\n# CURRENT STATUS: INACTIVE\nThis project requires a major overhaul, as well as handling and sorting through dozens of issues and prs.\nPlease contact me if you're up for the task.\n\n# parquet.js\n\nfully asynchronous, pure node.js implementation of the Parquet file format\n\n[![Build Status](https://travis-ci.org/ironSource/parquetjs.png?branch=master)](http://travis-ci.org/ironSource/parquetjs)\n[![License: MIT](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)\n[![npm version](https://badge.fury.io/js/parquetjs.svg)](https://badge.fury.io/js/parquetjs)\n\nThis package contains a fully asynchronous, pure JavaScript implementation of\nthe [Parquet](https://parquet.apache.org/) file format. The implementation conforms with the\n[Parquet specification](https://github.com/apache/parquet-format) and is tested\nfor compatibility with Apache's Java [reference implementation](https://github.com/apache/parquet-mr).\n\n**What is Parquet?**: Parquet is a column-oriented file format; it allows you to\nwrite a large amount of structured data to a file, compress it and then read parts\nof it back out efficiently. The Parquet format is based on [Google's Dremel paper](https://www.google.co.nz/url?sa=t\u0026rct=j\u0026q=\u0026esrc=s\u0026source=web\u0026cd=2\u0026cad=rja\u0026uact=8\u0026ved=0ahUKEwj_tJelpv3UAhUCm5QKHfJODhUQFggsMAE\u0026url=http%3A%2F%2Fwww.vldb.org%2Fpvldb%2Fvldb2010%2Fpapers%2FR29.pdf\u0026usg=AFQjCNGyMk3_JltVZjMahP6LPmqMzYdCkw).\n\n\nInstallation\n------------\n\nTo use parquet.js with node.js, install it using npm:\n\n```\n  $ npm install parquetjs\n```\n\n_parquet.js requires node.js \u003e= 8_\n\n\nUsage: Writing files\n--------------------\n\nOnce you have installed the parquet.js library, you can import it as a single\nmodule:\n\n``` js\nvar parquet = require('parquetjs');\n```\n\nParquet files have a strict schema, similar to tables in a SQL database. So,\nin order to produce a Parquet file we first need to declare a new schema. Here\nis a simple example that shows how to instantiate a `ParquetSchema` object:\n\n``` js\n// declare a schema for the `fruits` table\nvar schema = new parquet.ParquetSchema({\n  name: { type: 'UTF8' },\n  quantity: { type: 'INT64' },\n  price: { type: 'DOUBLE' },\n  date: { type: 'TIMESTAMP_MILLIS' },\n  in_stock: { type: 'BOOLEAN' }\n});\n```\n\nNote that the Parquet schema supports nesting, so you can store complex, arbitrarily\nnested records into a single row (more on that later) while still maintaining good\ncompression.\n\nOnce we have a schema, we can create a `ParquetWriter` object. The writer will\ntake input rows as JSON objects, convert them to the Parquet format and store\nthem on disk. \n\n``` js\n// create new ParquetWriter that writes to 'fruits.parquet`\nvar writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');\n\n// append a few rows to the file\nawait writer.appendRow({name: 'apples', quantity: 10, price: 2.5, date: new Date(), in_stock: true});\nawait writer.appendRow({name: 'oranges', quantity: 10, price: 2.5, date: new Date(), in_stock: true});\n```\n\nOnce we are finished adding rows to the file, we have to tell the writer object\nto flush the metadata to disk and close the file by calling the `close()` method:\n\n``` js\nawait writer.close();\n```\n\nUsage: Reading files\n--------------------\n\nA parquet reader allows retrieving the rows from a parquet file in order.\nThe basic usage is to create a reader and then retrieve a cursor/iterator\nwhich allows you to consume row after row until all rows have been read.\n\nYou may open more than one cursor and use them concurrently. All cursors become\ninvalid once close() is called on\nthe reader object.\n\n``` js\n// create new ParquetReader that reads from 'fruits.parquet`\nlet reader = await parquet.ParquetReader.openFile('fruits.parquet');\n\n// create a new cursor\nlet cursor = reader.getCursor();\n\n// read all records from the file and print them\nlet record = null;\nwhile (record = await cursor.next()) {\n  console.log(record);\n}\n```\n\nWhen creating a cursor, you can optionally request that only a subset of the\ncolumns should be read from disk. For example:\n\n``` js\n// create a new cursor that will only return the `name` and `price` columns\nlet cursor = reader.getCursor(['name', 'price']);\n```\n\nIt is important that you call close() after you are finished reading the file to\navoid leaking file descriptors.\n\n``` js\nawait reader.close();\n```\n\nEncodings\n---------\n\nInternally, the Parquet format will store values from each field as consecutive\narrays which can be compressed/encoded using a number of schemes.\n\n#### Plain Encoding (PLAIN)\n\nThe most simple encoding scheme is the PLAIN encoding. It simply stores the\nvalues as they are without any compression. The PLAIN encoding is currently\nthe default for all types except `BOOLEAN`:\n\n``` js\nvar schema = new parquet.ParquetSchema({\n  name: { type: 'UTF8', encoding: 'PLAIN' },\n});\n```\n\n#### Run Length Encoding (RLE)\n\nThe Parquet hybrid run length and bitpacking encoding allows to compress runs\nof numbers very efficiently. Note that the RLE encoding can only be used in\ncombination with the `BOOLEAN`, `INT32` and `INT64` types. The RLE encoding\nrequires an additional `bitWidth` parameter that contains the maximum number of\nbits required to store the largest value of the field.\n\n``` js\nvar schema = new parquet.ParquetSchema({\n  age: { type: 'UINT_32', encoding: 'RLE', bitWidth: 7 },\n});\n```\n\n\nOptional Fields\n---------------\n\nBy default, all fields are required to be present in each row. You can also mark\na field as 'optional' which will let you store rows with that field missing:\n\n``` js\nvar schema = new parquet.ParquetSchema({\n  name: { type: 'UTF8' },\n  quantity: { type: 'INT64', optional: true },\n});\n\nvar writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');\nawait writer.appendRow({name: 'apples', quantity: 10 });\nawait writer.appendRow({name: 'banana' }); // not in stock\n```\n\n\nNested Rows \u0026 Arrays\n--------------------\n\nParquet supports nested schemas that allow you to store rows that have a more\ncomplex structure than a simple tuple of scalar values. To declare a schema\nwith a nested field, omit the `type` in the column definition and add a `fields`\nlist instead:\n\nConsider this example, which allows us to store a more advanced \"fruits\" table\nwhere each row contains a name, a list of colours and a list of \"stock\" objects. \n\n``` js\n// advanced fruits table\nvar schema = new parquet.ParquetSchema({\n  name: { type: 'UTF8' },\n  colours: { type: 'UTF8', repeated: true },\n  stock: {\n    repeated: true,\n    fields: {\n      price: { type: 'DOUBLE' },\n      quantity: { type: 'INT64' },\n    }\n  }\n});\n\n// the above schema allows us to store the following rows:\nvar writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');\n\nawait writer.appendRow({\n  name: 'banana',\n  colours: ['yellow'],\n  stock: [\n    { price: 2.45, quantity: 16 },\n    { price: 2.60, quantity: 420 }\n  ]\n});\n\nawait writer.appendRow({\n  name: 'apple',\n  colours: ['red', 'green'],\n  stock: [\n    { price: 1.20, quantity: 42 },\n    { price: 1.30, quantity: 230 }\n  ]\n});\n\nawait writer.close();\n\n// reading nested rows with a list of explicit columns\nlet reader = await parquet.ParquetReader.openFile('fruits.parquet');\n\nlet cursor = reader.getCursor([['name'], ['stock', 'price']]);\nlet record = null;\nwhile (record = await cursor.next()) {\n  console.log(record);\n}\n\nawait reader.close();\n```\n\nIt might not be obvious why one would want to implement or use such a feature when\nthe same can - in  principle - be achieved by serializing the record using JSON\n(or a similar scheme) and then storing it into a UTF8 field:\n\nPutting aside the philosophical discussion on the merits of strict typing,\nknowing about the structure and subtypes of all records (globally) means we do not\nhave to duplicate this metadata (i.e. the field names) for every record. On top\nof that, knowing about the type of a field allows us to compress the remaining\ndata more efficiently.\n\n\nList of Supported Types \u0026 Encodings\n-----------------------------------\n\nWe aim to be feature-complete and add new features as they are added to the\nParquet specification; this is the list of currently implemented data types and\nencodings:\n\n\u003ctable\u003e\n  \u003ctr\u003e\u003cth\u003eLogical Type\u003c/th\u003e\u003cth\u003ePrimitive Type\u003c/th\u003e\u003cth\u003eEncodings\u003c/th\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eUTF8\u003c/td\u003e\u003ctd\u003eBYTE_ARRAY\u003c/td\u003e\u003ctd\u003ePLAIN\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eJSON\u003c/td\u003e\u003ctd\u003eBYTE_ARRAY\u003c/td\u003e\u003ctd\u003ePLAIN\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eBSON\u003c/td\u003e\u003ctd\u003eBYTE_ARRAY\u003c/td\u003e\u003ctd\u003ePLAIN\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eBYTE_ARRAY\u003c/td\u003e\u003ctd\u003eBYTE_ARRAY\u003c/td\u003e\u003ctd\u003ePLAIN\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eTIME_MILLIS\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eTIME_MICROS\u003c/td\u003e\u003ctd\u003eINT64\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eTIMESTAMP_MILLIS\u003c/td\u003e\u003ctd\u003eINT64\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eTIMESTAMP_MICROS\u003c/td\u003e\u003ctd\u003eINT64\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eBOOLEAN\u003c/td\u003e\u003ctd\u003eBOOLEAN\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eFLOAT\u003c/td\u003e\u003ctd\u003eFLOAT\u003c/td\u003e\u003ctd\u003ePLAIN\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eDOUBLE\u003c/td\u003e\u003ctd\u003eDOUBLE\u003c/td\u003e\u003ctd\u003ePLAIN\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eINT64\u003c/td\u003e\u003ctd\u003eINT64\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eINT96\u003c/td\u003e\u003ctd\u003eINT96\u003c/td\u003e\u003ctd\u003ePLAIN\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eINT_8\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eINT_16\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eINT_32\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eINT_64\u003c/td\u003e\u003ctd\u003eINT64\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eUINT_8\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eUINT_16\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eUINT_32\u003c/td\u003e\u003ctd\u003eINT32\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n  \u003ctr\u003e\u003ctd\u003eUINT_64\u003c/td\u003e\u003ctd\u003eINT64\u003c/td\u003e\u003ctd\u003ePLAIN, RLE\u003c/td\u003e\u003c/tr\u003e\n\u003c/table\u003e\n\n\nBuffering \u0026 Row Group Size\n--------------------------\n\nWhen writing a Parquet file, the `ParquetWriter` will buffer rows in memory\nuntil a row group is complete (or `close()` is called) and then write out the row\ngroup to disk.\n\nThe size of a row group is configurable by the user and controls the maximum\nnumber of rows that are buffered in memory at any given time as well as the number\nof rows that are co-located on disk:\n\n``` js\nvar writer = await parquet.ParquetWriter.openFile(schema, 'fruits.parquet');\nwriter.setRowGroupSize(8192);\n```\n\n\nDependencies\n-------------\n\nParquet uses [thrift](https://thrift.apache.org/) to encode the schema and other\nmetadata, but the actual data does not use thrift.\n\nContributions\n-------------\nPlease make sure you sign the [contributor license agreement](https://github.com/ironSource/cla) in order for us to be able to accept your contribution. We thank you very much!\n\n\nLicense\n-------\n\nCopyright (c) 2017-2019 ironSource Ltd.\n\nPermission is hereby granted, free of charge, to any person obtaining a copy of\nthis software and associated documentation files (the \"Software\"), to deal in the\nSoftware without restriction, including without limitation the rights to use,\ncopy, modify, merge, publish, distribute, sublicense, and/or sell copies of the\nSoftware, and to permit persons to whom the Software is furnished to do so,\nsubject to the following conditions:\n\nThe above copyright notice and this permission notice shall be included in all\ncopies or substantial portions of the Software.\n\nTHE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED,\nINCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A\nPARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT\nHOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION\nOF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE\nSOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fironsource%2Fparquetjs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fironsource%2Fparquetjs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fironsource%2Fparquetjs/lists"}