{"id":13571581,"url":"https://github.com/twitter/elephant-bird","last_synced_at":"2025-07-09T22:43:09.871Z","repository":{"id":849570,"uuid":"578435","full_name":"twitter/elephant-bird","owner":"twitter","description":"Twitter's collection of LZO and Protocol Buffer-related Hadoop, Pig, Hive, and HBase code.","archived":false,"fork":false,"pushed_at":"2023-04-10T11:31:57.000Z","size":62273,"stargazers_count":1136,"open_issues_count":89,"forks_count":385,"subscribers_count":186,"default_branch":"master","last_synced_at":"2025-07-05T11:18:58.466Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/twitter.png","metadata":{"files":{"readme":"Readme.md","changelog":"Changes.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2010-03-25T01:49:40.000Z","updated_at":"2025-06-10T20:52:01.000Z","dependencies_parsed_at":"2023-07-06T11:46:46.481Z","dependency_job_id":null,"html_url":"https://github.com/twitter/elephant-bird","commit_stats":{"total_commits":997,"total_committers":76,"mean_commits":"13.118421052631579","dds":0.6288866599799399,"last_synced_commit":"3ae48b10bc56b2d66de45739ef7d6aad821c06e0"},"previous_names":[],"tags_count":42,"template":false,"template_full_name":null,"purl":"pkg:github/twitter/elephant-bird","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Felephant-bird","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Felephant-bird/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Felephant-bird/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Felephant-bird/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/twitter","download_url":"https://codeload.github.com/twitter/elephant-bird/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/twitter%2Felephant-bird/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264504616,"owners_count":23618831,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-08-01T14:01:03.493Z","updated_at":"2025-07-09T22:43:09.259Z","avatar_url":"https://github.com/twitter.png","language":"Java","funding_links":[],"categories":["Java","Service Programming","II. Databases, search engines, big data and machine learning","Libraries and Tools","大数据"],"sub_categories":["7. Big data"],"readme":"# Elephant Bird [![Build Status](https://secure.travis-ci.org/twitter/elephant-bird.png)](https://travis-ci.org/twitter/elephant-bird)\n\n## About\n\nElephant Bird is Twitter's open source library of [LZO](https://github.com/twitter/hadoop-lzo), [Thrift](https://thrift.apache.org/), and/or [Protocol Buffer](https://code.google.com/p/protobuf)-related [Hadoop](https://hadoop.apache.org) InputFormats, OutputFormats, Writables, [Pig](https://pig.apache.org/) LoadFuncs, [Hive](https://hadoop.apache.org/hive) SerDe, [HBase](https://hadoop.apache.org/hbase) miscellanea, etc. The majority of these are in production at Twitter running over data every day.\n\nJoin the conversation about Elephant-Bird on the [developer mailing list](https://groups.google.com/forum/?fromgroups#!forum/elephantbird-dev).\n\n## License\n\n[Apache License, Version 2.0](https://apache.org/licenses/LICENSE-2.0).\n\n## Quickstart\n\n1. Make sure you have [Protocol Buffers](https://code.google.com/apis/protocolbuffers/) installed. Please see **Version compatibility** section below.\n1. Make sure you have [Apache Thrift](https://thrift.apache.org) installed. Please see **Version compatibility** section below.\n1. Get the code: `git clone git://github.com/twitter/elephant-bird.git`\n1. Build the jar: `mvn package`\n1. Explore what's available: `mvn javadoc:javadoc`\n\nNote: For any of the LZO-based code, make sure that the native LZO libraries are on your `java.library.path`.  Generally this is done by setting `JAVA_LIBRARY_PATH` in `pig-env.sh` or `hadoop-env.sh`.  You can also add lines like\n\n```\nPIG_OPTS=-Djava.library.path=/path/to/my/libgplcompression/dir\n```\n\nto `pig-env.sh`. See the instructions for [Hadoop-LZO](https://www.github.com/kevinweil/hadoop-lzo) for more details.\n\nThere are a few simple examples that use the input formats. Note how the Protocol Buffer and Thrift\nclasses are passed to input formats through configuration.\n\n## Maven repository\n\nElephant Bird release artifacts are published to the [Sonatype OSS](https://oss.sonatype.org/) [releases repository](https://oss.sonatype.org/content/repositories/releases/) and promoted from there to [Maven Central](https://search.maven.org/). From time to time we may also deploy snapshot releases to the Sonatype OSS [snapshots repository](https://oss.sonatype.org/content/repositories/snapshots/).\n\n## Version compatibility\n\n1. Hadoop 20.2x, 1.x, 2.x\n1. Pig 0.8+\n1. Protocol Buffers 2.5.0, 2.4.1, 2.3.0 (default build version is 2.4.1 can be changed with `-Dprotobuf.version=2.3.0`)\n1. Hive 0.7 (with HIVE-1616)\n1. Thrift 0.5.0, 0.6.0, 0.7.0, greater versions than 0.9 are provided via thrift9 maven profile\n1. Mahout 0.6\n1. Cascading2 (as the API is evolving, see libraries.properties for the currently supported version)\n1. Crunch 0.8.1+\n\n### Runtime Dependencies\n\nElephant-Bird defines majority of its depenendencies in maven [provided scope](https://maven.apache.org/guides/introduction/introduction-to-dependency-mechanism.html#Dependency_Scope).\nAs a result these dependencies are not transitively Elephant-Bird modules. Please see [wiki page](https://github.com/kevinweil/elephant-bird/wiki/Build-and-Runtime-Dependencies) for more information.\n\n## Contents\n\n### Hadoop Input and Output Formats\n\nElephant-Bird provides input and output formats for working with a variety of plaintext formats stored in LZO compressed files.\n\n* JSON data\n* Line-based data (TextInputFormat but for LZO)\n* [W3C logs](https://www.w3.org/TR/WD-logfile.html)\n\nAdditionally, protocol buffers and thrift messages can be stored in a variety of file formats.\n\n* Block-based, into generic bytes\n* Line-based, base64 encoded\n* SequenceFile\n* RCFile\n\n### Hadoop API wrappers\n\nHadoop provides two API implementations: the the old-style `org.apache.hadoop.mapred` and new-style `org.apache.hadoop.mapreduce` packages. Elephant-Bird provides wrapper classes that allow unmodified usage of `mapreduce` input and output formats in contexts where the `mapred` interface is required.\n\nFor more information, see [DeprecatedInputFormatWrapper.java](https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapred/input/DeprecatedInputFormatWrapper.java) and [DeprecatedOutputFormatWrapper.java](https://github.com/kevinweil/elephant-bird/blob/master/core/src/main/java/com/twitter/elephantbird/mapred/output/DeprecatedOutputFormatWrapper.java)\n\n\n### Hadoop 2.x Support\n\nElephant-bird published packages are tested with both Hadoop 1.x and 2.x.\n\n### Hadoop Writables\n* Elephant-Bird provides protocol buffer and thrift writables for directly working with these formats in map-reduce jobs.\n\n### Pig Support\n\nLoaders and storers are available for the input and output formats listed above. Additionally, pig-specific features include:\n\n* JSON loader (including nested structures)\n* Regex-based loader\n* Includes converter interface for turning Tuples into Writables and vice versa\n* Provides implementations to convert generic Writables, Thrift, Protobufs, and other specialized classes, such as [Apache Mahout](https://mahout.apache.org/)'s [VectorWritable](https://svn.apache.org/repos/asf/mahout/trunk/core/src/main/java/org/apache/mahout/math/VectorWritable.java).\n\n### Hive Support\n\nElephant-Bird provides Hive support for reading thrift and protocol buffers. For more information, see [How to use Elephant Bird with Hive](https://github.com/kevinweil/elephant-bird/wiki/How-to-use-Elephant-Bird-with-Hive).\n\n### Lucene Integration\n\nElephant-Bird provides hadoop Input/Output Formats and pig Load/Store Funcs for creating + searching lucene indexes. See [Elephant Bird Lucene](https://github.com/kevinweil/elephant-bird/wiki/Elephant-Bird-Lucene)\n\n### Utilities\n* Counters in Pig\n* Protocol Buffer utilities\n* Thrift utilities\n* Conversions from Protocol Buffers and Thrift messages to Pig tuples\n* Conversions from Thrift to Protocol Buffer's `DynamicMessage`\n* Reading and writing block-based Protocol Buffer format (see `ProtobufBlockWriter`)\n\n### Protocol Buffer and Thrift compiler dependencies\n\nElephant Bird requires Protocol Buffer compiler at build time, as generated\nclasses are used internally. Thrift compiler is required to generate classes used in tests.\nAs these are native-code tools they must be installed on the build\nmachine (java library dependencies are pulled from maven repositories during the build).\n\n## Working with Thrift and Protocol Buffers in Hadoop\n\nWe provide InputFormats, OutputFormats, Pig Load / Store functions, Hive SerDes,\nand Writables for working with Thrift and Google Protocol Buffers.\nWe haven't written up the docs yet, but look at `ProtobufMRExample.java`, `ThriftMRExample.java`, `people_phone_number_count.pig`, `people_phone_number_count_thrift.pig` under `examples` directory for reflection-based dynamic usage.\nWe also provide utilities for generating Protobuf-specific Loaders, Input/Output Formats, etc, if for some reason you want to avoid\nthe dynamic bits.\n\n## Hadoop SequenceFiles and Pig\n\nReading and writing Hadoop SequenceFiles with Pig is supported via classes\n[SequenceFileLoader](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java)\nand\n[SequenceFileStorage](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java). These\nclasses make use of a\n[WritableConverter](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/util/WritableConverter.java)\ninterface, allowing pluggable conversion of key and value instances to and from\nPig data types.\n\nHere's a short example: Suppose you have `SequenceFile\u003cText, LongWritable\u003e` data\nsitting beneath path `input`. We can load that data with the following Pig\nscript:\n\n```\nREGISTER '/path/to/elephant-bird.jar';\n\n%declare SEQFILE_LOADER 'com.twitter.elephantbird.pig.load.SequenceFileLoader';\n%declare TEXT_CONVERTER 'com.twitter.elephantbird.pig.util.TextConverter';\n%declare LONG_CONVERTER 'com.twitter.elephantbird.pig.util.LongWritableConverter';\n\npairs = LOAD 'input' USING $SEQFILE_LOADER (\n  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'\n) AS (key: chararray, value: long);\n```\n\nTo store `{key: chararray, value: long}` data as `SequenceFile\u003cText, LongWritable\u003e`, the following may be used:\n\n```\n%declare SEQFILE_STORAGE 'com.twitter.elephantbird.pig.store.SequenceFileStorage';\n\nSTORE pairs INTO 'output' USING $SEQFILE_STORAGE (\n  '-c $TEXT_CONVERTER', '-c $LONG_CONVERTER'\n);\n```\n\nFor details, please see Javadocs in the following classes:\n* [SequenceFileLoader](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/load/SequenceFileLoader.java)\n* [SequenceFileStorage](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/store/SequenceFileStorage.java)\n* [WritableConverter](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/util/WritableConverter.java)\n* [GenericWritableConverter](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/util/GenericWritableConverter.java)\n* [AbstractWritableConverter](https://github.com/kevinweil/elephant-bird/blob/master/pig/src/main/java/com/twitter/elephantbird/pig/util/AbstractWritableConverter.java)\n\n## How To Contribute\n\nBug fixes, features, and documentation improvements are welcome! Please fork the project and send us a pull request on github.\n\nEach new release since 2.1.3 has a *tag*. The latest version on master is what we are actively running on Twitter's hadoop clusters daily, over hundreds of terabytes of data.\n\n## Contributors\n\nMajor contributors are listed below. Lots of others have helped too, thanks to all of them!\nSee git logs for credits.\n\n* Kevin Weil ([@kevinweil](https://twitter.com/kevinweil))\n* Dmitriy Ryaboy ([@squarecog](https://twitter.com/squarecog))\n* Raghu Angadi ([@raghuangadi](https://twitter.com/raghuangadi))\n* Andy Schlaikjer ([@sagemintblue](https://twitter.com/sagemintblue))\n* Travis Crawford ([@tc](https://twitter.com/tc))\n* Johan Oskarsson ([@skr](https://twitter.com/skr))\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwitter%2Felephant-bird","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftwitter%2Felephant-bird","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftwitter%2Felephant-bird/lists"}