{"id":18408956,"url":"https://github.com/msgpack/msgpack-hadoop","last_synced_at":"2025-04-07T09:33:23.606Z","repository":{"id":1477401,"uuid":"1720767","full_name":"msgpack/msgpack-hadoop","owner":"msgpack","description":"MessagePack-Hadoop integration provides an efficient schema-free data representation for Hadoop and Hive.","archived":false,"fork":false,"pushed_at":"2011-05-18T02:27:17.000Z","size":6594,"stargazers_count":34,"open_issues_count":2,"forks_count":15,"subscribers_count":5,"default_branch":"master","last_synced_at":"2025-03-22T16:45:20.564Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"http://msgpack.org/","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msgpack.png","metadata":{"files":{"readme":"README","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2011-05-09T02:13:57.000Z","updated_at":"2023-10-05T23:39:52.000Z","dependencies_parsed_at":"2022-07-22T06:01:58.737Z","dependency_job_id":null,"html_url":"https://github.com/msgpack/msgpack-hadoop","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msgpack%2Fmsgpack-hadoop","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msgpack%2Fmsgpack-hadoop/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msgpack%2Fmsgpack-hadoop/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msgpack%2Fmsgpack-hadoop/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msgpack","download_url":"https://codeload.github.com/msgpack/msgpack-hadoop/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247626658,"owners_count":20969348,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-06T03:22:49.456Z","updated_at":"2025-04-07T09:33:23.356Z","avatar_url":"https://github.com/msgpack.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"MessagePack-Hadoop Integration\n========================================\n\nThis package contains the bridge layer between MessagePack (http://msgpack.org)\nand Hadoop (http://hadoop.apache.org/) families.\n\nThis enables you to run MR jobs on the MessagePack-formatted data, and also\nenables you to issue Hive query language over it.\n\nMessagePack-Hive adapter enables SQL-based adhoc-query, which takes *nested*\n*unstructured* data as input (like JSON, but binary-encoded). Of course, query\nis executed with MapReduce framework!\n\nHere is the sample MessagePack-Hive query, which counts unique user per URL.\n\n\u003e CREATE EXTERNAL TABLE IF NOT EXISTS mpbin (v string) \\\n  ROW FORMAT DELIMITED FIELDS TERMINATED BY '@'  LINES TERMINATED BY '\\n' \\\n  LOCATION '/path/to/hdfs/';\n\n\u003e SELECT url, COUNT(1) \\\n  FROM mpbin LATERAL VIEW msgpack_map(v, 'user', 'url') m AS user, url\n  GROUP BY txt;\n\nRequired Setup\n========================================\n\nPlease setup Hadoop + Hive system. Either Local, Pseudo-Distributed, or\nDistributed environment is OK.\n\nHive Getting Started\n========================================\n\n1. locate jars\n\nPut these jars to $HIVE_HOME/lib/ directory.\n\n* msgpack-hadoop-$version.jar\n* msgpack-$version.jar\n* javassist-$version.jar\n\n2. exec hive shell\n\nPlease execute the following command.\n\n$ hive --auxpath $HIVE_HOME/lib/msgpack-hadoop-$version.jar,$HIVE_HOME/lib/msgpack-$version.jar,$HIVE_HOME/lib/javassist-$version.jar\n\nYou can skip --auxpath option once modify your hive-site.xml.\n\n\u003cproperty\u003e\n  \u003cname\u003ehive.aux.jars.path\u003c/name\u003e\n  \u003cvalue\u003e$HIVE_HOME/lib/msgpack-hadoop-$version.jar,$HIVE_HOME/lib/msgpack-$version.jar,$HIVE_HOME/lib/javassist-$version.jar\u003c/value\u003e\n\u003c/property\u003e\n\n3. add jar and load custom UDTF function\n\nThis step is required for every Hive query.\n\nhive\u003e add $HIVE_HOME/lib/msgpack-hadoop-$version.jar\nhive\u003e add $HIVE_HOME/lib/msgpack-$version.jar\nhive\u003e add $HIVE_HOME/lib/javassist-$version.jar\nhive\u003e CREATE TEMPORARY FUNCTION msgpack_map AS 'org.msgpack.hadoop.hive.udf.GenericUDTFMessagePackMap';\n\n4. create external table\n\nCreate external table, which points the data directory.\n\nhive\u003e CREATE EXTERNAL TABLE IF NOT EXISTS mp_table (v string) \\\n      ROW FORMAT DELIMITED FIELDS TERMINATED BY '@'  LINES TERMINATED BY '\\n' \\\n      LOCATION '/path/to/hdfs/';\n\n5. execute the query\n\nFinally, execute the SELECT query over input data.\n\nInput msgpack data is unstructured, nested data. Therefore, you need to \"map\"\nMessagePack structure to Hive field name. Actually, you can map the field by\nusing msgpack_map() UDTF function, and name the fields by \"AS\" clause.\n\nhive\u003e SELECT url, COUNT(1) \\\n      FROM mp_table LATERAL VIEW msgpack_map(v, 'user', 'url') m AS user, url\n      GROUP BY txt;\n\nCaveats\n========================================\n\nCurrently, MessagePackInputFormat is now unsplittable. Therefore, you need to\nmanually *shred* the data into small pieces.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsgpack%2Fmsgpack-hadoop","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsgpack%2Fmsgpack-hadoop","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsgpack%2Fmsgpack-hadoop/lists"}