{"id":18304046,"url":"https://github.com/googleclouddataproc/hive-bigquery-storage-handler","last_synced_at":"2026-03-16T16:08:21.813Z","repository":{"id":40460912,"uuid":"208860486","full_name":"GoogleCloudDataproc/hive-bigquery-storage-handler","owner":"GoogleCloudDataproc","description":"Hive Storage Handler for interoperability between BigQuery and Apache Hive","archived":false,"fork":false,"pushed_at":"2025-01-29T19:22:16.000Z","size":77,"stargazers_count":19,"open_issues_count":11,"forks_count":11,"subscribers_count":30,"default_branch":"master","last_synced_at":"2025-07-02T03:05:41.019Z","etag":null,"topics":["apache","bigquery","gcp","google","hadoop","hive"],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/GoogleCloudDataproc.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-09-16T17:41:05.000Z","updated_at":"2023-08-02T01:49:08.000Z","dependencies_parsed_at":"2025-04-05T15:41:36.647Z","dependency_job_id":null,"html_url":"https://github.com/GoogleCloudDataproc/hive-bigquery-storage-handler","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/GoogleCloudDataproc/hive-bigquery-storage-handler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fhive-bigquery-storage-handler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fhive-bigquery-storage-handler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fhive-bigquery-storage-handler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fhive-bigquery-storage-handler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/GoogleCloudDataproc","download_url":"https://codeload.github.com/GoogleCloudDataproc/hive-bigquery-storage-handler/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/GoogleCloudDataproc%2Fhive-bigquery-storage-handler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265692300,"owners_count":23812194,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["apache","bigquery","gcp","google","hadoop","hive"],"created_at":"2024-11-05T15:27:39.122Z","updated_at":"2026-03-16T16:08:16.774Z","avatar_url":"https://github.com/GoogleCloudDataproc.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hive-BigQuery StorageHandler  [No Longer Maintained]\n  \nThis is a Hive StorageHandler plugin that enables Hive to interact with BigQuery. It allows you keep\nyour existing pipelines but move to BigQuery. It utilizes the high throughput \n[BigQuery Storage API](https://cloud.google.com/bigquery/docs/reference/storage/) to read data and\n uses the BigQuery API to write data.  \n   \n The following steps are performed under Dataproc cluster in Google Cloud Platform. If you need to run in your cluster, you\n will need setup Google Cloud SDK and Google Cloud Storage connector for Hadoop.  \n  \n## Getting the StorageHandler  \n  \n1. Check it out from GitHub.  \n2. Build it with the new Google [Hadoop BigQuery Connector](https://cloud.google.com/dataproc/docs/concepts/connectors/bigquery)   \n``` shell  \ngit clone https://github.com/GoogleCloudPlatform/hive-bigquery-storage-handler  \ncd hive-bigquery-storage-handler  \nmvn clean install  \n```  \n3. Deploy hive-bigquery-storage-handler-1.0-shaded.jar   \n  \n## Using the StorageHandler to access BigQuery  \n  \n1. Enable the BigQuery Storage API. Follow [these instructions](https://cloud.google.com/bigquery/docs/reference/storage/#enabling_the_api)  and check [pricing details](https://cloud.google.com/bigquery/pricing#storage-api) \n       \n3. Copy the compiled Jar to a Google Cloud Storage bucket that can be accessed by your hive cluster  \n  \n4. Open Hive CLI and load the jar as shown below:  \n  \n```shell  \nhive\u003e add jar gs://\u003cJar location\u003e/hive-bigquery-storage-handler-1.0-shaded.jar;  \n```  \n  \n4. Verify the jar is loaded successfully  \n  \n```shell  \nhive\u003e list jars;  \n```  \n  \nAt this point you can operate Hive just like you used to do.  \n  \n### Creating BigQuery tables  \nIf you have BigQuery table already, here is how you can define Hive table that refer to it:  \n```sql  \nCREATE TABLE bq_test (word_count bigint, word string)  \n STORED BY \n 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler' \n TBLPROPERTIES ( \n 'bq.dataset'='\u003cBigQuery dataset name\u003e', \n 'bq.table'='\u003cBigQuery table name\u003e', \n 'mapred.bq.project.id'='\u003cYour Project ID\u003e', \n 'mapred.bq.temp.gcs.path'='gs://\u003cBucket name\u003e/\u003cTemporary path\u003e', \n 'mapred.bq.gcs.bucket'='\u003cCloud Storage Bucket name\u003e' \n );\n```  \n\nYou will need to provide the following table properties:  \n  \n| Property | Value |  \n|--------|-----|  \n| bq.dataset | BigQuery dataset id (Optional if hive database name matches BQ dataset name) |  \n| bq.table | BigQuery table name (Optional if hive table name matches BQ table name) |  \n| mapred.bq.project.id | Your project id |  \n| mapred.temp.gcs.path | Temporary file location in GCS bucket |  \n| mapred.bq.gcs.bucket | Temporary GCS bucket name |  \n  \n### Data Type Mapping  \n  \n| BigQuery | Hive | DESCRIPTION | \n|--------|-----| ---------------|  \n| INTEGER | BIGINT | Signed 8-byte Integer  \n| FLOAT | DOUBLE | 8-byte double precision floating point number  \n| DATE | DATE | FORMAT IS YYYY-[M]M-[D]D. The range of values supported for the Date type is 0001-­01-­01 to 9999-­12-­31  \n| TIMESTAMP | TIMESTAMP | Represents an absolute point in time since Unix epoch with millisecond precision (on Hive) compared to Microsecond precision on Bigquery.  \n| BOOLEAN | BOOLEAN | Boolean values are represented by the keywords TRUE and FALSE  \n| STRING | STRING | Variable-length character data  \n| BYTES | BINARY | Variable-length binary data  \n| REPEATED | ARRAY | Represents repeated values  \n| RECORD | STRUCT | Represents nested structures  \n  \n### Filtering\nThe new API allows column pruning and predicate filtering to only read the data you are interested in.\n\n#### Column Pruning\nSince BigQuery is [backed by a columnar datastore](https://cloud.google.com/blog/big-data/2016/04/inside-capacitor-bigquerys-next-generation-columnar-storage-format), it can efficiently stream data without reading all columns.\n\n#### Predicate Filtering\nThe Storage API supports arbitrary pushdown of predicate filters. To enable predicate pushdown ensure \u003cb\u003ehive.optimize.ppd\u003c/b\u003e is set to true. \u003cbr/\u003e\nFilters on all primitive type columns will be pushed to storage layer improving the performance of reads. Predicate pushdown is not supported on complex types such as arrays and structs.  For example - filters like `address.city = \"Sunnyvale\"` will not get pushdown to Bigquery.\n  \n### Caveats  \n1. Ensure that table exists in bigquery and column names are always lowercase  \n2. timestamp column in hive is interpreted to be timezoneless and stored as an offset from the UNIX epoch with milliseconds precision.  \n   To display in human readable format from_unix_time udf can be used as   \n   ```sql   \n   from_unixtime(cast(cast(\u003ctimestampcolumn\u003e as bigint)/1000 as bigint), 'yyyy-MM-dd hh:mm:ss')      \n   ```  \n  ### Issues  \n1. Writing to BigQuery will fail when using Apache Tez as the execution engine. As a workaround ```set hive.execution.engine=mr``` to use MapReduce as the execution engine\n2. STRUCT type is not supported unless avro schema is explicitly specified using either avro.schema.literal or avro.schema.url table properties.\n   Below table contains all supported types defining schema explicitly.  Note: If table doesn't need struct then specifying schema is optional\n   ```sql\n    CREATE TABLE dbname.alltypeswithSchema(currenttimestamp TIMESTAMP,currentdate DATE, userid BIGINT, sessionid STRING, skills Array\u003cString\u003e,\n      eventduration DOUBLE, eventcount BIGINT, is_latest BOOLEAN,keyset BINARY,addresses ARRAY\u003cSTRUCT\u003cstatus: STRING, street: STRING,city: STRING, state: STRING,zip: BIGINT\u003e\u003e )\n      STORED BY 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'\n      TBLPROPERTIES (\n       'bq.dataset'='bqdataset',\n       'bq.table'='bqtable',\n       'mapred.bq.project.id'='bqproject',\n       'mapred.bq.temp.gcs.path'='gs://bucketname/prefix',\n       'mapred.bq.gcs.bucket'='bucketname',\n       'avro.schema.literal'='{\"type\":\"record\",\"name\":\"alltypesnonnull\",\n           \"fields\":[{\"name\":\"currenttimestamp\",\"type\":[\"null\",{\"type\":\"long\",\"logicalType\":\"timestamp-micros\"}], \"default\" : null}\n                    ,{\"name\":\"currentdate\",\"type\":{\"type\":\"int\",\"logicalType\":\"date\"}, \"default\" : -1},{\"name\":\"userid\",\"type\":\"long\",\"doc\":\"User identifier.\", \"default\" : -1}\n                    ,{\"name\":\"sessionid\",\"type\":[\"null\",\"string\"], \"default\" : null},{\"name\":\"skills\",\"type\":[\"null\", {\"type\":\"array\",\"items\":\"string\"}], \"default\" : null}\n                    ,{\"name\":\"eventduration\",\"type\":[\"null\",\"double\"], \"default\" : null},{\"name\":\"eventcount\",\"type\":[\"null\",\"long\"], \"default\" : null}\n                    ,{\"name\":\"is_latest\",\"type\":[\"null\",\"boolean\"], \"default\" : null},{\"name\":\"keyset\",\"type\":[\"null\",\"bytes\"], \"default\" : null}\n                    ,{\"name\":\"addresses\",\"type\":[\"null\", {\"type\":\"array\",\n                       \"items\":{\"type\":\"record\",\"name\":\"__s_0\",\n                       \"fields\":[{\"name\":\"status\",\"type\":\"string\"},{\"name\":\"street\",\"type\":\"string\"},{\"name\":\"city\",\"type\":\"string\"},{\"name\":\"state\",\"type\":\"string\"},{\"name\":\"zip\",\"type\":\"long\"}]\n                       }}], \"default\" : null\n                     }\n                   ]\n           }'\n      );\n   ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fhive-bigquery-storage-handler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgoogleclouddataproc%2Fhive-bigquery-storage-handler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgoogleclouddataproc%2Fhive-bigquery-storage-handler/lists"}