{"id":10144316,"url":"https://github.com/apache/parquet-java","last_synced_at":"2026-01-11T17:01:23.779Z","repository":{"id":17795478,"uuid":"20675636","full_name":"apache/parquet-java","owner":"apache","description":"Apache Parquet Java","archived":false,"fork":false,"pushed_at":"2026-01-08T15:21:37.000Z","size":15885,"stargazers_count":3009,"open_issues_count":710,"forks_count":1511,"subscribers_count":93,"default_branch":"master","last_synced_at":"2026-01-11T10:06:23.694Z","etag":null,"topics":["apache","parquet","parquet-java"],"latest_commit_sha":null,"homepage":"https://parquet.apache.org/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apache.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGES.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":"NOTICE","maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2014-06-10T07:00:07.000Z","updated_at":"2026-01-10T14:32:13.000Z","dependencies_parsed_at":"2025-12-06T09:06:46.908Z","dependency_job_id":null,"html_url":"https://github.com/apache/parquet-java","commit_stats":{"total_commits":2306,"total_committers":285,"mean_commits":8.091228070175438,"dds":0.7970511708586296,"last_synced_commit":"08a4e7e6279f8c3b8558bd294fee4489a96d0db1"},"previous_names":["apache/parquet-java","apache/parquet-mr"],"tags_count":99,"template":false,"template_full_name":null,"purl":"pkg:github/apache/parquet-java","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-java","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-java/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-java/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-java/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apache","download_url":"https://codeload.github.com/apache/parquet-java/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apache%2Fparquet-java/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28314255,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-11T14:58:17.114Z","status":"ssl_error","status_checked_at":"2026-01-11T14:55:53.580Z","response_time":60,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","parquet","parquet-java"],"created_at":"2024-05-23T09:52:31.121Z","updated_at":"2026-01-11T17:01:23.736Z","avatar_url":"https://github.com/apache.png","language":"Java","readme":"\u003c!--\n  ~ Licensed to the Apache Software Foundation (ASF) under one\n  ~ or more contributor license agreements.  See the NOTICE file\n  ~ distributed with this work for additional information\n  ~ regarding copyright ownership.  The ASF licenses this file\n  ~ to you under the Apache License, Version 2.0 (the\n  ~ \"License\"); you may not use this file except in compliance\n  ~ with the License.  You may obtain a copy of the License at\n  ~\n  ~   http://www.apache.org/licenses/LICENSE-2.0\n  ~\n  ~ Unless required by applicable law or agreed to in writing,\n  ~ software distributed under the License is distributed on an\n  ~ \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY\n  ~ KIND, either express or implied.  See the License for the\n  ~ specific language governing permissions and limitations\n  ~ under the License.\n  --\u003e\n\nParquet Java (formerly Parquet MR) [![CI Hadoop 3](https://github.com/apache/parquet-java/actions/workflows/ci-hadoop3.yml/badge.svg)](https://github.com/apache/parquet-java/actions/workflows/ci-hadoop3.yml)\n======\n\nThis repository contains a Java implementation of [Apache Parquet](https://parquet.apache.org/)\n\nApache Parquet is an open source, column-oriented data file format\ndesigned for efficient data storage and retrieval. It provides high\nperformance compression and encoding schemes to handle complex data in\nbulk and is supported in many programming languages and analytics\ntools.\n\nThe [parquet-format](https://github.com/apache/parquet-format)\nrepository contains the file format specification.\n\nParquet uses the [record shredding and assembly algorithm](https://github.com/julienledem/redelm/wiki/The-striping-and-assembly-algorithms-from-the-Dremel-paper) described in the Dremel paper to represent nested structures.\nYou can find additional details about the format and intended use cases in our [Hadoop Summit 2013 presentation](http://www.slideshare.net/julienledem/parquet-hadoop-summit-2013)\n\n## Building\n\nParquet-Java uses Maven to build and depends on the thrift compiler (protoc is now managed by maven plugin).\n\n### Install Thrift\n\nTo build and install the thrift compiler, run:\n\n```\nwget -nv https://archive.apache.org/dist/thrift/0.22.0/thrift-0.22.0.tar.gz\ntar xzf thrift-0.22.0.tar.gz\ncd thrift-0.22.0\nchmod +x ./configure\n./configure --disable-libs\nsudo make install -j\n```\n\nIf you're on OSX and use homebrew, you can instead install Thrift 0.22.0 with `brew` and ensure that it comes first in your `PATH`.\n\n```\nbrew install thrift\nexport PATH=\"/usr/local/opt/thrift@0.22.0/bin:$PATH\"\n```\n\n### Build Parquet with Maven\n\nOnce protobuf and thrift are available in your path, you can build the project by running:\n\n```\nLC_ALL=C ./mvnw clean install\n```\n\n## Features\n\nParquet is an active project, and new features are being added quickly. Here are a few features:\n\n* Type-specific encoding\n* Hive integration (deprecated)\n* Pig integration (deprecated)\n* Cascading integration (deprecated)\n* Crunch integration\n* Apache Arrow integration\n* Scrooge integration (deprecated)\n* Impala integration (non-nested)\n* Java Map/Reduce API\n* Native Avro support\n* Native Thrift support\n* Native Protocol Buffers support\n* Complex structure support\n* Run-length encoding (RLE)\n* Bit Packing\n* Adaptive dictionary encoding\n* Predicate pushdown\n* Column stats\n* Delta encoding\n* Index pages\n* Scala DSL (deprecated)\n* Java Vector API support (experimental)\n\n## Java Vector API support\n`The feature is experimental and is currently not part of the parquet distribution`.\n\nParquet-Java has supported Java Vector API to speed up reading, to enable this feature:\n\n* Java 17+, 64-bit\n* Requiring the CPU to support instruction sets:\n  * avx512vbmi\n  * avx512_vbmi2\n* To build the jars: `./mvnw clean package -P vector-plugins`\n* For Apache Spark to enable this feature:\n  * Build parquet and replace the parquet-encoding-{VERSION}.jar on the spark jars folder\n  * Build parquet-encoding-vector and copy parquet-encoding-vector-{VERSION}.jar to the spark jars folder\n  * Edit spark class#VectorizedRleValuesReader, function#readNextGroup refer to parquet class#ParquetReadRouter, function#readBatchUsing512Vector\n  * Build spark with maven and replace spark-sql_2.12-{VERSION}.jar on the spark jars folder\n\n## Map/Reduce integration\n\n[Input](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetInputFormat.java) and [Output](https://github.com/apache/parquet-java/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetOutputFormat.java) formats.\nNote that to use an Input or Output format, you need to implement a WriteSupport or ReadSupport class, which will implement the conversion of your object to and from a Parquet schema.\n\nWe've implemented this for 2 popular data formats to provide a clean migration path as well:\n\n### Thrift\n\nThrift integration is provided by the [parquet-thrift](https://github.com/apache/parquet-java/tree/master/parquet-thrift) sub-project.\n\n### Avro\n\nAvro conversion is implemented via the [parquet-avro](https://github.com/apache/parquet-java/tree/master/parquet-avro) sub-project.\n\n### Protobuf\n\nProtobuf conversion is implemented via the [parquet-protobuf](https://github.com/apache/parquet-java/tree/master/parquet-protobuf) sub-project.\n\n### Create your own objects\n\n* The ParquetOutputFormat can be provided a WriteSupport to write your own objects to an event based RecordConsumer.\n* The ParquetInputFormat can be provided a ReadSupport to materialize your own objects by implementing a RecordMaterializer\n\nSee the APIs:\n\n* [Record conversion API](https://github.com/apache/parquet-java/tree/master/parquet-column/src/main/java/org/apache/parquet/io/api)\n* [Hadoop API](https://github.com/apache/parquet-java/tree/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/api)\n\n## Hive integration\n\nHive integration is now deprecated within the Parquet project. It is now maintained by Apache Hive.\n\n## Build\n\nTo run the unit tests: `./mvnw test`\n\nTo build the jars: `./mvnw package`\n\nThe build runs in [GitHub Actions](https://github.com/apache/parquet-java/actions):\n[![Build Status](https://github.com/apache/parquet-java/workflows/Test/badge.svg)](https://github.com/apache/parquet-java/actions)\n\n## Add Parquet as a dependency in Maven\n\nThe current release is version `1.15.1`.\n\n```xml\n  \u003cdependencies\u003e\n    \u003cdependency\u003e\n      \u003cgroupId\u003eorg.apache.parquet\u003c/groupId\u003e\n      \u003cartifactId\u003eparquet-common\u003c/artifactId\u003e\n      \u003cversion\u003e1.15.1\u003c/version\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n      \u003cgroupId\u003eorg.apache.parquet\u003c/groupId\u003e\n      \u003cartifactId\u003eparquet-encoding\u003c/artifactId\u003e\n      \u003cversion\u003e1.15.1\u003c/version\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n      \u003cgroupId\u003eorg.apache.parquet\u003c/groupId\u003e\n      \u003cartifactId\u003eparquet-column\u003c/artifactId\u003e\n      \u003cversion\u003e1.15.1\u003c/version\u003e\n    \u003c/dependency\u003e\n    \u003cdependency\u003e\n      \u003cgroupId\u003eorg.apache.parquet\u003c/groupId\u003e\n      \u003cartifactId\u003eparquet-hadoop\u003c/artifactId\u003e\n      \u003cversion\u003e1.15.1\u003c/version\u003e\n    \u003c/dependency\u003e\n  \u003c/dependencies\u003e\n```\n\n### How To Contribute\n\nWe prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the [parquet-java](https://github.com/apache/parquet-java) Git repository. If you've previously forked Parquet from its old location, you will need to add a remote or update your origin remote to `https://github.com/apache/parquet-java.git`.\n\nIf you are looking for some ideas on what to contribute, check out [GitHub issues](https://github.com/apache/parquet-java/issues) for labeled [Good first issue](https://github.com/apache/parquet-java/issues?q=state%3Aopen%20label%3A%22Good%20first%20issue%22). Comment on the issue and/or contact [dev@parquet.apache.org](https://lists.apache.org/list.html?dev@parquet.apache.org) with your questions and ideas.\n\nIf you’d like to report a bug but don’t have time to fix it, you can still raise an [issue on GitHub](https://github.com/apache/parquet-java/issues/new/choose), or email the mailing list [dev@parquet.apache.org](https://lists.apache.org/list.html?dev@parquet.apache.org).\n\nTo contribute a patch:\n\n  1. Break your work into small, single-purpose patches if possible. It’s much harder to merge in a large change with a lot of disjoint features.\n  2. Create an issue for your patch on the [GitHub issues](https://github.com/apache/parquet-java/issues).\n  3. Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request. Prefix your pull request name with the issue (ex: https://github.com/apache/parquet-java/pull/3260).\n  4. Make sure that your code passes the unit tests. You can run the tests with `./mvnw test` in the root directory.\n  5. Add new unit tests for your code.\n\nWe tend to do fairly close readings of pull requests, and you may get a lot of comments. Some common issues that are not code structure related, but still important:\n\n  * Use 2 spaces for whitespace. Not tabs, not 4 spaces. The number of the spacing shall be 2.\n  * Give your operators some room. Not `a+b` but `a + b` and not `foo(int a,int b)` but `foo(int a, int b)`.\n  * Generally speaking, stick to the [Sun Java Code Conventions](http://www.oracle.com/technetwork/java/javase/documentation/codeconvtoc-136057.html)\n  * Make sure tests pass!\n\nThank you for getting involved!\n\n## Authors and contributors\n\n* [Contributors](https://github.com/apache/parquet-java/graphs/contributors)\n* [Committers](https://projects.apache.org/committee.html?parquet)\n\n## Code of Conduct\n\nWe hold ourselves and the Parquet developer community to two codes of conduct:\n\n  1. [The Apache Software Foundation Code of Conduct](https://www.apache.org/foundation/policies/conduct.html)\n  2. [The Twitter OSS Code of Conduct](https://github.com/twitter/code-of-conduct/blob/master/code-of-conduct.md)\n\n## Discussions\n\n* Mailing list: [dev@parquet.apache.org](https://lists.apache.org/list.html?dev@parquet.apache.org)\n* GitHub issues: [Issues](https://github.com/apache/parquet-java/issues)\n* Discussions also take place in GitHub pull requests\n\n## License\n\nLicensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0\n","funding_links":[],"categories":["Java","Data Storage Optimisation","其他_大数据","大数据","Libraries"],"sub_categories":["资源传输下载","Java"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fparquet-java","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapache%2Fparquet-java","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapache%2Fparquet-java/lists"}