Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/vilcek/HiveKVStorageHandler2

Hive Storage Handler for Oracle NoSQL Database v2
https://github.com/vilcek/HiveKVStorageHandler2

Last synced: 3 months ago
JSON representation

Hive Storage Handler for Oracle NoSQL Database v2

Awesome Lists containing this project

README

        

HiveKVStorageHandler

About HiveKVStorageHandler:


This is an implementation of a Storage Handler to query data stored in Oracle NoSQL Database via Hive.


Note: This version works only with Oracle NoSQL Database v2.x.


Written by: Alexandre Vilcek ([email protected])




If you want to know what a Hive Storage Handler is:


Hive Storage Handler




Current Limitations:



  • Supports only external non-native Hive tables.

  • Writing data to Oracle NoSQLDB is not supported yet.

  • Parsing of Hive SerDe properties is very rudimentary yet and spaces between NoSQL DB keys definitions in the key mappings properties in the Hive table create statement will cause key names to be misinterpreted.

  • Columns names and types specified in the Hive table definition are ignored; only NoSQL DB Major and Minor Keys mappings in the Hive table create statement define the column names.

  • A NoSQL DB Value for a given key is always interpred as string in the Hive table.




Hive CREATE TABLE Syntax:


CREATE EXTERNAL TABLE <hive_table_name> (column_name column_type,column_name column type, ...)

STORED BY 'org.vilcek.hive.kv.KVHiveStorageHandler'

WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "<majorKey1,majorKey2,...>", "kv.minor.keys.mapping" = "<minorKey1,minorKey2,...>")

TBLPROPERTIES ("kv.host.port" = "<kvstore hostname>:<kvstore port number>", "kv.name" = "<kvstore name>");



Example:


Data stored in Oracle NoSQL Database:


/Smith/Bob/-/birthdate: 05/02/1975

/Smith/Bob/-/phonenumber: 1111-1111

/Smith/Bob/-/userid: 1

/Smith/Patricia/-/birthdate: 10/25/1967

/Smith/Patricia/-/phonenumber: 2222-2222

/Smith/Patricia/-/userid: 2

/Wong/Bill/-/birthdate: 03/10/1982

/Wong/Bill/-/phonenumber: 3333-3333

/Wong/Bill/-/userid: 3

Table definition and query in Hive:


hive> ADD JAR HiveKVStorageHandler.jar;


hive> CREATE EXTERNAL TABLE nosqldbtest (lastname string, firstname string, birthdate string, phonenumber string, userid string)

      STORED BY 'org.vilcek.hive.kv.KVHiveStorageHandler'

      WITH SERDEPROPERTIES ("kv.major.keys.mapping" = "lastname,firstname", "kv.minor.keys.mapping" = "birthdate,phonenumber,userID")

      TBLPROPERTIES ("kv.host.port" = "localhost:5000", "kv.name" = "kvstore");

hive> SELECT * FROM nosqldbtest;

OK

Smith     
Patricia     
10/25/1967     
NULL
NULL

Smith
Patricia
NULL
2222-2222     
NULL

Smith
Patricia
NULL
NULL
2

Smith
Bob
05/02/1975
NULL
NULL

Smith
Bob
NULL
1111-1111
NULL

Smith
Bob
NULL
NULL
1

Wong
Bill
03/10/1982
NULL
NULL

Wong
Bill
NULL
3333-3333
NULL

Wong
Bill
NULL
NULL
3

Note: Please consider setting hive execution mode to local when working with small datasets. This avoids the overhead of lauching MapReduce jobs and in some cases the query will execute much faster. The better way to do that is by letting hive decide when to run jobs locally or not:


hive> set hive.exec.mode.local.auto=true


hive> SELECT lastname, firstname, collect_set(birthdate)[0], collect_set(phonenumber)[0], collect_set(userid)[0]

      FROM nosqldbtest

      GROUP BY lastname, firstname;

OK

Smith     
Bob     
05/02/1975     
1111-1111     
1

Smith
Patricia     
10/25/1967
2222-2222
2

Wong
Bill
03/10/1982
3333-3333
3