{"id":16600806,"url":"https://github.com/sonalgoyal/hiho","last_synced_at":"2025-03-21T13:32:42.405Z","repository":{"id":1181514,"uuid":"1081160","full_name":"sonalgoyal/hiho","owner":"sonalgoyal","description":"Hadoop Data Integration with various databases, ftp servers, salesforce. Incremental update, dedup, append, merge your data on Hadoop.","archived":false,"fork":false,"pushed_at":"2013-04-11T06:53:10.000Z","size":47010,"stargazers_count":91,"open_issues_count":5,"forks_count":32,"subscribers_count":11,"default_branch":"master","last_synced_at":"2025-03-18T01:51:30.641Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"www.nubetech.co/products","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sonalgoyal.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2010-11-15T06:39:45.000Z","updated_at":"2024-06-05T08:33:15.000Z","dependencies_parsed_at":"2022-08-16T12:30:10.893Z","dependency_job_id":null,"html_url":"https://github.com/sonalgoyal/hiho","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonalgoyal%2Fhiho","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonalgoyal%2Fhiho/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonalgoyal%2Fhiho/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sonalgoyal%2Fhiho/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sonalgoyal","download_url":"https://codeload.github.com/sonalgoyal/hiho/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244806183,"owners_count":20513396,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-12T00:15:45.453Z","updated_at":"2025-03-21T13:32:42.062Z","avatar_url":"https://github.com/sonalgoyal.png","language":"Java","funding_links":[],"categories":["Data Ingestion"],"sub_categories":[],"readme":"# HIHO: Hadoop In, Hadoop Out. \n\n\u003e Hadoop Data Integration, deduplication, incremental update and more.  \n\nThis branch is for support for HIHO on Apache Hadoop 0.21.\n\n## Import from a database to HDFS\n\n**query based import**  \n\nJoin multiple tables, provide where conditions, dynamically bind parameters to SQL queries to get data to Hadoop. As simple as creating a simple config and running the job.\n\n\tbin/hadoop jar hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf dbInputQueryDelimited.xml\n\nor\n \n\t${HIHO_HOME}/scripts/hiho import \n\t\t-jdbcDriver \u003cjdbcDriver\u003e \n\t\t-jdbcUrl \u003cjdbcUrl\u003e \n\t\t-jdbcUsername \u003cjdbcUsername\u003e \n\t\t-jdbcPassword \u003cjdbcPassword\u003e \n\t\t-inputQuery \u003cinputQuery\u003e \n\t\t-inputBoundingQuery \u003cinputBoundingQuery\u003e \n\t\t-outputPath \u003coutputPath\u003e \n\t\t-outputStrategy \u003coutputStrategy\u003e \n\t\t-delimiter \u003cdelimiter\u003e \n\t\t-numberOfMappers \u003cnumberOfMappers\u003e \n\t\t-inputOrderBy \u003cinputOrderBy\u003e \n \n**table based import**  \n\n\tbin/hadoop jar hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf dbInputTableDelimited.xml\n\nor \n\n \t${HIHO_HOME}/scripts/hiho import \n\t\t-jdbcDriver \u003cjdbcDriver\u003e \n\t\t-jdbcUrl \u003cjdbcUrl\u003e \n\t\t-jdbcUsername \u003cjdbcUsername\u003e \n\t\t-jdbcPassword \u003cjdbcPassword\u003e\n\t\t-outputPath \u003coutputPath\u003e \n\t\t-outputStrategy \u003coutputStrategy\u003e \n\t\t-delimiter \u003cdelimiter\u003e \n\t\t-numberOfMappers \u003cnumberOfMappers\u003e \n\t\t-inputOrderBy \u003cinputOrderBy\u003e \n\t\t-inputTableName \u003cinputTableName\u003e \n\t\t-inputFieldNames \u003cinputFieldNames\u003e\n\n**incremental import** by appending to existing `HDFS` location so that all data is in one place.\njust specify `isAppend = true` in the configurations and import. Import will be written to existing HDFS folder.\n\n**configurable format for data import: delimited, avro** by specifying the `mapreduce.jdbc.hiho.input.outputStrategy` as DELIMITED or AVRO.\n \n**Note:** \n\n1. Please specify delimiter in double qoutes, because in some cases such as semi colon ';' it breaks For example `-delimiter \";\"`. If you are specifing * for `inputFieldNames` then also you put in double qoutes  \n\n## Export to Databases\n\n**high performance MySQL loading using LOAD DATA INFILE**\n\n\t${HIHO_HOME}/scripts/hiho export mysql\n\t\t-inputPath \u003cinputPath\u003e \n\t\t-url \u003curl\u003e \n\t\t-userName \u003cuserName\u003e \n\t\t-password \u003cpassword\u003e \n\t\t-querySuffix  \u003cquerySuffix\u003e\n\n**high performance Oracle loading by creating external tables.** See [expert opinion](http://asktom.oracle.com/pls/asktom/f?p=100:11:0::::P11_QUESTION_ID:6611962171229)  \nFor information on external tables, check [here](http://download.oracle.com/docs/cd/B12037_01/server.101/b10825/et_concepts.htm)  \nOn the Oracle server\n\n1. Make a folder\n\n\t\tmkdir -p ageTest\n\n2. Create a directory through the Oracle Client (sqlplus) and grant it privileges.\n\n\t\tsqlplus\u003ecreate or replace directory age_ext as '/home/nube/age';\n\n3. Allow ftp to the Oracle server\n\t\n\t\t${HIHO_HOME}/scripts/hiho export oracle \n\t\t\t-inputPath \u003cinputPath\u003e \n\t\t\t-oracleFtpAddress \u003coracleFtpAddress\u003e \n\t\t\t-oracleFtpPortNumber \u003coracleFtpPortNumber\u003e \n\t\t\t-oracleFtpUserName \u003coracleFtpUserName\u003e \n\t\t\t-oracleFtpPassword \u003coracleFtpPassword\u003e \n\t\t\t-oracleExternalTableDirectory \u003coracleExternalTableDirectory\u003e \n\t\t\t-driver \u003cdriver\u003e \n\t\t\t-url \u003curl\u003e \n\t\t\t-userName \u003cuserName\u003e \n\t\t\t-password \u003cpassword\u003e \n\t\t\t-externalTable \u003ccreateExternalTableQuery\u003e\n\n**custom loading and export to any database** by emitting own `GenericDBWritables`. Check `DelimitedLoadMapper`\n\n## Export to SalesForce  \n\n**send computed map reduce results to Salesforce.**\n\nFor this, you need to have a developer account with Bulk API enabled. You can join at http://developer.force.com/join\n \nIf you get message: \n\n\u003e[LoginFault [ApiFault  exceptionCode='INVALID_LOGIN' exceptionMessage='Invalid username, password, security token; or user locked out.'\n\"Invalid username, password, security token; or user locked out. Are you at a new location? When accessing Salesforce--either via a desktop client or the API--from outside of your company’s trusted networks, you must add a security token to your password to log in. To receive a new security token, log in to [Salesforce](http://www.salesforce.com) and click Setup | My Personal Information | Reset Security Token.\"\nlogin and get the security token. \n\nthen try\n\n\tsfUserName - Name of Salesforce account\n\tsfPassword - Password and security token. The Security Token can be obtained by logging in to the Salesforce.com site and clicking on Reset Security Token.\n\tsfObjectType - The Salesforce object to export\n\tsfHeaders - header describing the Salesforce object properties. For more information, refer to the Bulk API Developer's Guide.\n\n\t${HIHO_HOME}/scripts/hiho export saleforce \n\t\t-inputPath \u003cinputPath\u003e \n\t\t-sfUserName \u003csfUserName\u003e \n\t\t-sfPassword \u003csfPassword\u003e \n\t\t-sfObjectType \u003csfObjectType\u003e \n\t\t-sfHeaders \u003csfHeaders\u003e\n\n\n## Export results to an FTP Server.\n\nUse the `co.nubetech.hiho.mapreduce.lib.output.FTPOutputFormat` directly in your job, just like `FileOutputFormat`. For usage, check `co.nubetech.hiho.job.ExportToFTPserver`. This job writes the output directly to an FTP server.\n It can be invoked as:\n\n\t${HIHO_HOME}/scripts/hiho export ftp \n\t\t-inputPath \u003cinputPath\u003e \n\t\t-outputPath \u003coutputPath\u003e \n\t\t-ftpUserName \u003cftpUserName\u003e \n\t\t-ftpAddress \u003cftpAddress\u003e \n\t\t-ftpPortNumper \u003cftpPortNumper\u003e \n\t\t-ftpPassword \u003cftpPassword\u003e\n\nWhere:\n\n\tftpUserName - FTP server login username\n\tftpAddress - FTP server address\n\tftpPortNumper - FTP port\n\tftpPassword - FTP server password\n\toutputPath is the location on the FTP server to which the output will be written. It should be a complete directory path - /home/sgoyal/output  \n\n\n## Export to Hive\nThis is used to export data from any other database to Hive database. \nHive export can be done in two method query base and table based configuration needed are\n \n\tmapreduce.jdbc.hiho.input.loadTo - this configuration defines in which database you want to load your data from HDFS for eg:- hive\n\tmapreduce.jdbc.hiho.input.loadToPath - Our program also generates script for all queries , this configuration defines where to store that script on your local system  \n\tmapreduce.jdbc.hiho.hive.driver - name of hive jdbc driver For eg:- org.apache.hadoop.hive.jdbc.HiveDriver\n\tmapreduce.jdbc.hiho.hive.url - hive url for jdbc connection For eg:- jdbc:hive:// (for embedded mode),jdbc:hive://localhost:10000/default for standalone mode\n\tmapreduce.jdbc.hiho.hive.usrName - user name for jdbc connection \n\tmapreduce.jdbc.hiho.hive.password - password for jdbc connection\n\tmapreduce.jdbc.hiho.hive.partitionBy - This configuration is when we want to create partitioned hive table. For eg :- country:string:us;name:string:jack (basic partition), country:string:us;name:string (static and one dynamic partition), country:string (dynamic partition) till now we allow only one dynamic partition\n\t\t\t\t\t\t\t\t\t\t\tWe also allow to store data in a table for multiple partition at a time for that value is given as country:string:us,uk,aus for this we need to define three different queries or table in there respective configurations \n\tmapreduce.jdbc.hiho.hive.ifNotExists - set true if you want include 'if not exits' clause in your create table query\n\tmapreduce.jdbc.hiho.hive.tableName - write the name for the table in the hive you want to create\n\tmapreduce.jdbc.hiho.hive.sortedBy - this can be only used if clusteredBy configuration is defined, in this give the name of column by which u want to sort your data\n\tmapreduce.jdbc.hiho.hive.clusteredBy - This configuration defines name of column by which you want to cluster your data and define the number of buckets you want to create. For eg:- name:2\n \nExecution command for table based\n\n \tbin/hadoop jar ~/workspace/hiho/build/classes/hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf  ~/workspace/hiho/conf/dbInputTableDelimitedHive.xml\n\nor\n\n\t${HIHO_HOME}/scripts/hiho import \n\t\t-jdbcDriver \u003cjdbcDriver\u003e \n\t\t-jdbcUrl \u003cjdbcUrl\u003e \n\t\t-jdbcUsername \u003cjdbcUsername\u003e \n\t\t-jdbcPassword \u003cjdbcPassword\u003e \n\t\t-outputPath \u003coutputPath\u003e \n\t\t-outputStrategy \u003coutputStrategy\u003e \n\t\t-delimiter \u003cdelimiter\u003e \n\t\t-numberOfMappers \u003cnumberOfMappers\u003e \n\t\t-inputOrderBy \u003cinputOrderBy\u003e \n\t\t-inputTableName \u003cinputTableName\u003e \n\t\t-inputFieldNames \u003cinputFieldNames\u003e \n\t\t-inputLoadTo hive \n\t\t-inputLoadToPath \u003cinputLoadToPath\u003e \n\t\t-hiveDriver \u003chiveDriver\u003e  \n\t\t-hiveUrl \u003chiveUrl\u003e \n\t\t-hiveUsername \u003chiveUsername\u003e \n\t\t-hivePassword \u003chivePassword\u003e \n\t\t-hivePartitionBy \u003chivePartitionBy\u003e \n\t\t-hiveIfNotExists \u003chiveIfNotExists\u003e \n\t\t-hiveTableName \u003chiveTableName\u003e \n\t\t-hiveSortedBy \u003chiveSortedBy\u003e \n\t\t-hiveClusteredBy \u003chiveClusteredBy\u003e \n \nFor query based\n\n\tbin/hadoop jar ~/workspace/hiho/build/classes/hiho.jar co.nubetech.hiho.job.DBQueryInputJob -conf  ~/workspace/hiho/conf/dbInputQueryDelimitedHive.xml\nor\n\n\t${HIHO_HOME}/scripts/hiho import \n\t\t-jdbcDriver \u003cjdbcDriver\u003e \n\t\t-jdbcUrl \u003cjdbcUrl\u003e \n\t\t-jdbcUsername \u003cjdbcUsername\u003e \n\t\t-jdbcPassword \u003cjdbcPassword\u003e \n\t\t-outputPath \u003coutputPath\u003e \n\t\t-outputStrategy \u003coutputStrategy\u003e \n\t\t-delimiter \u003cdelimiter\u003e \n\t\t-numberOfMappers \u003cnumberOfMappers\u003e \n\t\t-inputOrderBy \u003cinputOrderBy\u003e \n\t\t-inputLoadTo hive \n\t\t-inputLoadToPath \u003cinputLoadToPath\u003e \n\t\t-hiveDriver \u003chiveDriver\u003e  \n\t\t-hiveUrl \u003chiveUrl\u003e \n\t\t-hiveUsername \u003chiveUsername\u003e \n\t\t-hivePassword \u003chivePassword\u003e \n\t\t-hivePartitionBy \u003chivePartitionBy\u003e \n\t\t-hiveIfNotExists \u003chiveIfNotExists\u003e \n\t\t-hiveTableName \u003chiveTableName\u003e \n\t\t-hiveSortedBy \u003chiveSortedBy\u003e \n\t\t-hiveClusteredBy \u003chiveClusteredBy\u003e\n\n**Notes:**  \n\n1. Hive table name is mandatory when you are quering more than one query or table that is the case of multiple partition\n2. Please note that sorted feature will not work untill clustered feature is defined\n\n\n## Dedup details\n\n\tbin/hadoop jar ~/workspace/HIHO/deploy/hiho.jar co.nubetech.hiho.dedup.DedupJob -inputFormat \u003cinputFormat\u003e -dedupBy \u003c\"key\" or \"value\"\u003e -inputKeyClassName \u003cinputKeyClassName\u003e -inputValueClassName \u003cinputValueClassName\u003e -inputPath \u003cinputPath\u003e -outputPath \u003coutputPath\u003e -delimeter \u003cdelimeter\u003e -column \u003ccolumn\u003e -outputFormat \u003coutputFormat\u003e\n\nAlternatively Dedup can also be executed as:-\nRunning HadoopTransform script present in `$HIHO_HOME/scripts/`\n\n\t${HIHO_HOME}/scripts/hiho dedup \n\t\t-inputFormat \u003cinputFormat\u003e \n\t\t-dedupBy \u003c\"key\" or \"value\"\u003e \n\t\t-inputKeyClassName \u003cinputKeyClassName\u003e \n\t\t-inputValueClassName \u003cinputValueClassName\u003e \n\t\t-inputPath \u003cinputPath\u003e \n\t\t-outputPath \u003coutputPath\u003e \n\t\t-delimeter \u003cdelimeter\u003e -column \u003ccolumn\u003e\n\n**Example For Deduplication with key:**  \n\nFor Sequence Files:\n \n\t${HIHO_HOME}/scripts/hiho dedup \n\t\t-inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.IntWritable \n\t\t-inputValueClassName org.apache.hadoop.io.Text \n\t\t-inputPath testData/dedup/inputForSeqTest \n\t\t-outputPath output -dedupBy key\n\nFor Delimited Text Files: \n\n\t${HIHO_HOME}/scripts/hiho dedup \n\t\t-inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.Text \n\t\t-inputValueClassName org.apache.hadoop.io.Text \n\t\t-inputPath testData/dedup/textFilesForTest \n\t\t-outputPath output -delimeter , \n\t\t-column 1 \n\t\t-dedupBy key\n\n**Example For Deduplication with value:**  \n\nFor Sequence Files: \n\n\t${HIHO_HOME}/scripts/hiho dedup \n\t\t-inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.IntWritable \n\t\t-inputValueClassName org.apache.hadoop.io.Text \n\t\t-inputPath testData/dedup/inputForSeqTest \n\t\t-outputPath output -dedupBy value\n\nFor Delimited Text Files: \n\n\t${HIHO_HOME}/scripts/hiho dedup \n\t\t-inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.Text \n\t\t-inputValueClassName org.apache.hadoop.io.Text \n\t\t-inputPath testData/dedup/textFilesForTest \n\t\t-outputPath output \n\t\t-dedupBy value\n\n## Merge details: \n\n\t${HIHO_HOME}/scripts/hiho merge \n\t\t-newPath \u003cnewPath\u003e \n\t\t-oldPath \u003coldPath\u003e \n\t\t-mergeBy \u003c\"key\" or \"value\"\u003e \n\t\t-outputPath \u003coutputPath\u003e \n\t\t-inputFormat \u003cinputFormat\u003e \n\t\t-inputKeyClassName \u003cinputKeyClassName\u003e \n\t\t-inputValueClassName \u003cinputValueClassName\u003e \n\t\t-outputFormat \u003coutputFormat\u003e\n\n**Example For Merge with key:**  \n\nFor Sequence Files:\n\n\t${HIHO_HOME}/scripts/hiho merge \n\t\t-newPath testData/merge/inputNew/input1.seq \n\t\t-oldPath testData/merge/inputOld/input2.seq \n\t\t-mergeBy key -outputPath output  \n\t\t-inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.IntWritable \n\t\t-inputValueClassName org.apache.hadoop.io.Text\n\nFor Delimited Text Files:\n\n\t${HIHO_HOME}/scripts/hiho merge \n\t\t-newPath testData/merge/inputNew/fileInNewPath.txt \n\t\t-oldPath testData/merge/inputOld/fileInOldPath.txt \n\t\t-mergeBy key \n\t\t-outputPath output \n\t\t-inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.Text \n\t\t-inputValueClassName org.apache.hadoop.io.Text\n\n**Example For Merge with value:**  \n\nFor Sequence Files:\n\n\t${HIHO_HOME}/scripts/hiho merge \n\t\t-newPath testData/merge/inputNew/input1.seq \n\t\t-oldPath testData/merge/inputOld/input2.seq \n\t\t-mergeBy value \n\t\t-outputPath output  \n\t\t-inputFormat org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.IntWritable \n\t\t-inputValueClassName org.apache.hadoop.io.Text\n\nFor Delimited Text Files:\n\n\t${HIHO_HOME}/scripts/hiho merge \n\t\t-newPath testData/merge/inputNew/fileInNewPath.txt \n\t\t-oldPath testData/merge/inputOld/fileInOldPath.txt \n\t\t-mergeBy value \n\t\t-outputPath output \n\t\t-inputFormat co.nubetech.hiho.dedup.DelimitedTextInputFormat \n\t\t-inputKeyClassName org.apache.hadoop.io.Text \n\t\t-inputValueClassName org.apache.hadoop.io.Text\n\n## Export to DB:\n\n\tbin/hadoop jar deploy/hiho-0.4.0.jar co.nubetech.hiho.job.ExportToDB  \n\t\t-jdbcDriver \u003cjdbcDriverName\u003e  \n\t\t-jdbcUrl \u003cjdbcUrl\u003e  \n\t\t-jdbcUsername \u003cjdbcUserName\u003e  \n\t\t-jdbcPassword \u003cjdbcPassword\u003e \n\t\t-delimiter \u003cdelimiter\u003e \n\t\t-numberOfMappers \u003cnumberOfMappers\u003e \n\t\t-tableName \u003ctableName\u003e \n\t\t-columnNames \u003ccolumnNames\u003e \n\t\t-inputPath \u003cinputPath\u003e \nor\n\n\t${HIHO_HOME}/scripts/hiho export db \n\t\t-jdbcDriver \u003cjdbcDriverName\u003e  \n\t\t-jdbcUrl \u003cjdbcUrl\u003e  \n\t\t-jdbcUsername \u003cjdbcUserName\u003e  \n\t\t-jdbcPassword \u003cjdbcPassword\u003e \n\t\t-delimiter \u003cdelimiter\u003e \n\t\t-numberOfMappers \u003cnumberOfMappers\u003e \n\t\t-tableName \u003ctableName\u003e \n\t\t-columnNames \u003ccolumnNames\u003e \n\t\t-inputPath \u003cinputPath\u003e \n\n## New Features in this release\n- incremental import and introduction of AppendFileInputFormat\n- Oracle export\n- FTP Server integration\n- Salesforce\n- Support for Apache Hadoop 0.20\n- Support for Apache Hadoop 0.21\n- Generic dedup and merge\n\n## Other improvements\n- Ivy based build and dependency management\n- Junit and mockito based test cases \n\n**Note:** \n\n1. To run `TestExportToMySQLDB` we need to add `hiho-0.4.0.jar`, all jars of hadoop and hadoop lib with also `mysql-connector-java.jar`\n\t\tin the `classpath`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonalgoyal%2Fhiho","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsonalgoyal%2Fhiho","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsonalgoyal%2Fhiho/lists"}