{"id":13545744,"url":"https://github.com/apache/parquet-format","last_synced_at":"2025-05-14T08:05:10.996Z","repository":{"id":17795477,"uuid":"20675635","full_name":"apache/parquet-format","owner":"apache","description":"Apache Parquet Format","archived":false,"fork":false,"pushed_at":"2025-04-18T02:16:55.000Z","size":1369,"stargazers_count":1941,"open_issues_count":65,"forks_count":440,"subscribers_count":68,"default_branch":"master","last_synced_at":"2025-05-07T07:13:34.449Z","etag":null,"topics":["apache","parquet","parquet-format"],"latest_commit_sha":null,"homepage":"https://parquet.apache.org/","language":"Thrift","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apache.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2014-06-10T07:00:07.000Z","updated_at":"2025-05-05T00:24:15.000Z","dependencies_parsed_at":"2023-02-11T22:01:15.228Z","dependency_job_id":"742c6828-a2a3-4025-ac15-3d41b3ca5e1d","html_url":"https://github.com/apache/parquet-format","commit_stats":{"total_commits":351,"total_committers":83,"mean_commits":4.228915662650603,"dds":0.8176638176638177,"last_synced_commit":"d784f11f4485e64fdeaa614e0bde125f5132093d"},"previous_names":[],"tags_count":23,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-format","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-format/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-format/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-format/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apache","download_url":"https://codeload.github.com/apache/parquet-format/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":253900615,"owners_count":21981273,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","parquet","parquet-format"],"created_at":"2024-08-01T11:01:13.353Z","updated_at":"2025-05-14T08:05:10.975Z","avatar_url":"https://github.com/apache.png","language":"Thrift","funding_links":[],"categories":["Thrift","File Formats","Resources","Capabilities"],"sub_categories":["Documentation","Storage"],"readme":"\u003c!--\n  - Licensed to the Apache Software Foundation (ASF) under one\n  - or more contributor license agreements.  See the NOTICE file\n  - distributed with this work for additional information\n  - regarding copyright ownership.  The ASF licenses this file\n  - to you under the Apache License, Version 2.0 (the\n  - \"License\"); you may not use this file except in compliance\n  - with the License.  You may obtain a copy of the License at\n  -\n  -   http://www.apache.org/licenses/LICENSE-2.0\n  -\n  - Unless required by applicable law or agreed to in writing,\n  - software distributed under the License is distributed on an\n  - \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n  - KIND, either express or implied.  See the License for the\n  - specific language governing permissions and limitations\n  - under the License.\n  --\u003e\n\n# Parquet [![Build Status](https://github.com/apache/parquet-format/actions/workflows/test.yml/badge.svg)](https://github.com/apache/parquet-format/actions)\n\nThis repository contains the specification for [Apache Parquet] and\n[Apache Thrift] definitions to read and write Parquet metadata.\n\nApache Parquet is an open source, column-oriented data file format\ndesigned for efficient data storage and retrieval. It provides high\nperformance compression and encoding schemes to handle complex data in\nbulk and is supported in many programming language and analytics\ntools.\n\n[Apache Parquet]: https://parquet.apache.org\n[Apache Thrift]: https://thrift.apache.org\n\n## Motivation\n\nWe created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.\n\nParquet is built from the ground up with complex nested data structures in mind, and uses the [record shredding and assembly algorithm](https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper) described in the Dremel paper. We believe this approach is superior to simple flattening of nested name spaces.\n\nParquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level, and is future-proofed to allow adding more encodings as they are invented and implemented.\n\nParquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.\n\n## Modules\n\nThe [parquet-format] project contains format specifications and Thrift definitions of metadata required to properly read Parquet files.\n\nThe [parquet-java] project contains multiple sub-modules, which implement the core components of reading and writing a nested, column-oriented data stream, map this core onto the parquet format, and provide Hadoop Input/Output Formats, Pig loaders, and other java-based utilities for interacting with Parquet.\n\nThe [parquet-testing] project contains a set of files that can be used to verify that implementations in different languages can read and write each other's files.\n\n[parquet-format]: https://github.com/apache/parquet-format\n[parquet-java]: https://github.com/apache/parquet-java\n[parquet-testing]: https://github.com/apache/parquet-testing\n\n## Building\n\nJava resources can be built using `mvn package`. The current stable version should always be available from Maven Central.\n\nC++ thrift resources can be generated via make.\n\nThrift can be also code-generated into any other thrift-supported language.\n\n## Glossary\n  - Block (HDFS block): This means a block in HDFS and the meaning is\n    unchanged for describing this file format.  The file format is\n    designed to work well on top of HDFS.\n\n  - File: A HDFS file that must include the metadata for the file.\n    It does not need to actually contain the data.\n\n  - Row group: A logical horizontal partitioning of the data into rows.\n    There is no physical structure that is guaranteed for a row group.\n    A row group consists of a column chunk for each column in the dataset.\n\n  - Column chunk: A chunk of the data for a particular column.  They live\n    in a particular row group and are guaranteed to be contiguous in the file.\n\n  - Page: Column chunks are divided up into pages.  A page is conceptually\n    an indivisible unit (in terms of compression and encoding).  There can\n    be multiple page types which are interleaved in a column chunk.\n\nHierarchically, a file consists of one or more row groups.  A row group\ncontains exactly one column chunk per column.  Column chunks contain one or\nmore pages.\n\n## Unit of parallelization\n  - MapReduce - File/Row Group\n  - IO - Column chunk\n  - Encoding/Compression - Page\n\n## File format\nThis file and the [Thrift definition](src/main/thrift/parquet.thrift) should be read together to understand the format.\n\n    4-byte magic number \"PAR1\"\n    \u003cColumn 1 Chunk 1\u003e\n    \u003cColumn 2 Chunk 1\u003e\n    ...\n    \u003cColumn N Chunk 1\u003e\n    \u003cColumn 1 Chunk 2\u003e\n    \u003cColumn 2 Chunk 2\u003e\n    ...\n    \u003cColumn N Chunk 2\u003e\n    ...\n    \u003cColumn 1 Chunk M\u003e\n    \u003cColumn 2 Chunk M\u003e\n    ...\n    \u003cColumn N Chunk M\u003e\n    File Metadata\n    4-byte length in bytes of file metadata (little endian)\n    4-byte magic number \"PAR1\"\n\nIn the above example, there are N columns in this table, split into M row\ngroups.  The file metadata contains the locations of all the column chunk\nstart locations.  More details on what is contained in the metadata can be found\nin the Thrift definition.\n\nFile Metadata is written after the data to allow for single pass writing.\n\nReaders are expected to first read the file metadata to find all the column\nchunks they are interested in.  The columns chunks should then be read sequentially.\n\n ![File Layout](https://raw.github.com/apache/parquet-format/master/doc/images/FileLayout.gif)\n\n## Metadata\nThere are two types of metadata: file metadata and page header metadata.  All thrift structures\nare serialized using the TCompactProtocol.\n\n ![Metadata diagram](https://github.com/apache/parquet-format/raw/master/doc/images/FileFormat.gif)\n\n## Types\nThe types supported by the file format are intended to be as minimal as possible,\nwith a focus on how the types effect on disk storage.  For example, 16-bit ints\nare not explicitly supported in the storage format since they are covered by\n32-bit ints with an efficient encoding.  This reduces the complexity of implementing\nreaders and writers for the format.  The types are:\n  - BOOLEAN: 1 bit boolean\n  - INT32: 32 bit signed ints\n  - INT64: 64 bit signed ints\n  - INT96: 96 bit signed ints\n  - FLOAT: IEEE 32-bit floating point values\n  - DOUBLE: IEEE 64-bit floating point values\n  - BYTE_ARRAY: arbitrarily long byte arrays\n  - FIXED_LEN_BYTE_ARRAY: fixed length byte arrays\n\n### Logical Types\nLogical types are used to extend the types that parquet can be used to store,\nby specifying how the primitive types should be interpreted. This keeps the set\nof primitive types to a minimum and reuses parquet's efficient encodings. For\nexample, strings are stored with the primitive type BYTE_ARRAY with a STRING\nannotation. These annotations define how to further decode and interpret the data.\nAnnotations are stored as `LogicalType` fields in the file metadata and are\ndocumented in [LogicalTypes.md][logical-types].\n\n[logical-types]: LogicalTypes.md\n\n### Sort Order\nParquet stores min/max statistics at several levels (such as Column Chunk,\nColumn Index, and Data Page). These statistics are according to a sort order,\nwhich is defined for each column in the file footer. Parquet supports common\nsort orders for logical and primitve types. The details are documented in the\n[Thrift definition](src/main/thrift/parquet.thrift) in the `ColumnOrder` union.\n\n## Nested Encoding\nTo encode nested columns, Parquet uses the Dremel encoding with definition and\nrepetition levels.  Definition levels specify how many optional fields in the\npath for the column are defined.  Repetition levels specify at what repeated field\nin the path has the value repeated.  The max definition and repetition levels can\nbe computed from the schema (i.e. how much nesting there is).  This defines the\nmaximum number of bits required to store the levels (levels are defined for all\nvalues in the column).\n\nTwo encodings for the levels are supported BIT_PACKED and RLE. Only RLE is now used as it supersedes BIT_PACKED.\n\n## Nulls\nNullity is encoded in the definition levels (which is run-length encoded).  NULL values\nare not encoded in the data.  For example, in a non-nested schema, a column with 1000 NULLs\nwould be encoded with run-length encoding (0, 1000 times) for the definition levels and\nnothing else.\n\n## Data Pages\nFor data pages, the 3 pieces of information are encoded back to back, after the page\nheader. No padding is allowed in the data page.\nIn order we have:\n 1. repetition levels data\n 1. definition levels data\n 1. encoded values\n\nThe value of `uncompressed_page_size` specified in the header is for all the 3 pieces combined.\n\nThe encoded values for the data page is always required.  The definition and repetition levels\nare optional, based on the schema definition.  If the column is not nested (i.e.\nthe path to the column has length 1), we do not encode the repetition levels (it would\nalways have the value 1).  For data that is required, the definition levels are\nskipped (if encoded, it will always have the value of the max definition level).\n\nFor example, in the case where the column is non-nested and required, the data in the\npage is only the encoded values.\n\nThe supported encodings are described in [Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md)\n\nThe supported compression codecs are described in\n[Compression.md](https://github.com/apache/parquet-format/blob/master/Compression.md)\n\n## Column chunks\nColumn chunks are composed of pages written back to back.  The pages share a common\nheader and readers can skip over pages they are not interested in.  The data for the\npage follows the header and can be compressed and/or encoded.  The compression and\nencoding is specified in the page metadata.\n\nA column chunk might be partly or completely dictionary encoded. It means that\ndictionary indexes are saved in the data pages instead of the actual values. The\nactual values are stored in the dictionary page. See details in\n[Encodings.md](https://github.com/apache/parquet-format/blob/master/Encodings.md#dictionary-encoding-plain_dictionary--2-and-rle_dictionary--8).\nThe dictionary page must be placed at the first position of the column chunk. At\nmost one dictionary page can be placed in a column chunk.\n\nAdditionally, files can contain an optional column index to allow readers to\nskip pages more efficiently. See [PageIndex.md](PageIndex.md) for details and\nthe reasoning behind adding these to the format.\n\n## Checksumming\nPages of all kinds can be individually checksummed. This allows disabling of checksums\nat the HDFS file level, to better support single row lookups. Checksums are calculated\nusing the standard CRC32 algorithm - as used in e.g. GZip - on the serialized binary\nrepresentation of a page (not including the page header itself).\n\n## Error recovery\nIf the file metadata is corrupt, the file is lost.  If the column metadata is corrupt,\nthat column chunk is lost (but column chunks for this column in other row groups are\nokay).  If a page header is corrupt, the remaining pages in that chunk are lost.  If\nthe data within a page is corrupt, that page is lost.  The file will be more\nresilient to corruption with smaller row groups.\n\nPotential extension: With smaller row groups, the biggest issue is placing the file\nmetadata at the end.  If an error happens while writing the file metadata, all the\ndata written will be unreadable.  This can be fixed by writing the file metadata\nevery Nth row group.\nEach file metadata would be cumulative and include all the row groups written so\nfar.  Combining this with the strategy used for rc or avro files using sync markers,\na reader could recover partially written files.\n\n## Separating metadata and column data.\nThe format is explicitly designed to separate the metadata from the data.  This\nallows splitting columns into multiple files, as well as having a single metadata\nfile reference multiple parquet files.\n\n## Configurations\n- Row group size: Larger row groups allow for larger column chunks which makes it\npossible to do larger sequential IO.  Larger groups also require more buffering in\nthe write path (or a two pass write).  We recommend large row groups (512MB - 1GB).\nSince an entire row group might need to be read, we want it to completely fit on\none HDFS block.  Therefore, HDFS block sizes should also be set to be larger.  An\noptimized read setup would be: 1GB row groups, 1GB HDFS block size, 1 HDFS block\nper HDFS file.\n- Data page size: Data pages should be considered indivisible so smaller data pages\nallow for more fine grained reading (e.g. single row lookup).  Larger page sizes\nincur less space overhead (less page headers) and potentially less parsing overhead\n(processing headers).  Note: for sequential scans, it is not expected to read a page\nat a time; this is not the IO chunk.  We recommend 8KB for page sizes.\n\n## Extensibility\nThere are many places in the format for compatible extensions:\n- File Version: The file metadata contains a version.\n- Encodings: Encodings are specified by enum and more can be added in the future.\n- Page types: Additional page types can be added and safely skipped.\n\n### [Binary Protocol Extensions](BinaryProtocolExtensions.md)\n\nParquet Thrift IDL reserves field-id `32767` of every Thrift struct for extensions.\nThe (Thrift) type of this field is always `binary`.\n\n## Contributing\nComment on the issue and/or contact [the parquet-dev mailing list](http://mail-archives.apache.org/mod_mbox/parquet-dev/) with your questions and ideas.\nChanges to this core format definition are proposed and discussed in depth on the mailing list. You may also be interested in contributing to the Parquet-Java subproject, which contains all the Java-side implementation and APIs. See the \"How To Contribute\" section of the [Parquet-Java project](https://github.com/apache/parquet-java#how-to-contribute)\n\n## Code of Conduct\n\nWe hold ourselves and the Parquet developer community to a code of conduct as described by [Twitter OSS](https://engineering.twitter.com/opensource): \u003chttps://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md\u003e.\n\n## License\nCopyright 2013 Twitter, Cloudera and other contributors.\n\nLicensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fparquet-format","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapache%2Fparquet-format","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fparquet-format/lists"}