{"id":13798623,"url":"https://github.com/dharmeshkakadia/tpch-hdinsight","last_synced_at":"2026-02-05T20:36:19.040Z","repository":{"id":149040032,"uuid":"67905935","full_name":"dharmeshkakadia/tpch-hdinsight","owner":"dharmeshkakadia","description":"TPCH benchmark for various engines","archived":false,"fork":false,"pushed_at":"2017-09-17T03:07:53.000Z","size":117,"stargazers_count":6,"open_issues_count":0,"forks_count":8,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-05-19T03:22:55.730Z","etag":null,"topics":["benchmarking","hive","llap","presto","spark","tpch"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dharmeshkakadia.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2016-09-11T02:39:28.000Z","updated_at":"2021-07-31T01:19:20.000Z","dependencies_parsed_at":"2024-01-13T11:13:11.134Z","dependency_job_id":"4479501d-9464-42e1-9633-98a35f4154cb","html_url":"https://github.com/dharmeshkakadia/tpch-hdinsight","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dharmeshkakadia%2Ftpch-hdinsight","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dharmeshkakadia%2Ftpch-hdinsight/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dharmeshkakadia%2Ftpch-hdinsight/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dharmeshkakadia%2Ftpch-hdinsight/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dharmeshkakadia","download_url":"https://codeload.github.com/dharmeshkakadia/tpch-hdinsight/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":225183752,"owners_count":17434169,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["benchmarking","hive","llap","presto","spark","tpch"],"created_at":"2024-08-04T00:00:47.397Z","updated_at":"2026-02-05T20:36:19.026Z","avatar_url":"https://github.com/dharmeshkakadia.png","language":"Python","funding_links":[],"categories":["Benchmarks"],"sub_categories":["Unsorted"],"readme":"# tpch-datagen-as-hive-query\nThis are set of UDFs and queries that you can use with Hive to use TPCH datagen in parrellel on hadoop cluster. You can deploy to azure using :\n\u003ca href=\"https://portal.azure.com/#create/Microsoft.Template/uri/https%3A%2F%2Fraw.githubusercontent.com%2Fdharmeshkakadia%2Ftpch-datagen-as-hive-query%2Fmaster%2Fazure%2Fazuredeploy.json\" target=\"_blank\"\u003e\n    \u003cimg src=\"http://azuredeploy.net/deploybutton.png\"/\u003e\n\u003c/a\u003e\n\n\n## How to use with Hive CLI\n1. Clone this repo.\n\n    ```shell\n    git clone https://github.com/dharmeshkakadia/tpch-datagen-as-hive-query/ \u0026\u0026 cd tpch-datagen-as-hive-query\n    ```\n2. Run TPCHDataGen.hql with settings.hql file and set the required config variables.\n    ```shell\n    hive -i settings.hql -f TPCHDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCH/ -hiveconf TPCHBIN=resources \n    ```\n    Here, `SCALE` is a scale factor for TPCH, \n    `PARTS` is a number of task to use for datagen (parrellelization), \n    `LOCATION` is the directory where the data will be stored on HDFS, \n    `TPCHBIN` is where the resources are found. You can specify specific settings in settings.hql file.\n\n3. Now you can create tables on the generated data.\n    ```shell\n    hive -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCH/ -hiveconf DBNAME=tpch\n    ```\n    Generate ORC tables and analyze\n    ```shell\n    hive -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpch_orc -hiveconf SOURCE=tpch \n    hive -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpch_orc \n    ```\n\n4. Run the queries !\n    ```shell\n    hive -database tpch_orc -i settings.hql -f queries/tpch_query1.hql \n    ```\n\n## How to use with Beeline CLI\n1. Clone this repo.\n\n    ```shell\n    git clone https://github.com/dharmeshkakadia/tpch-datagen-as-hive-query/ \u0026\u0026 cd tpch-datagen-as-hive-query\n    ```\n2. Upload the resources to DFS.\n    ```shell\n    hdfs dfs -copyFromLocal resoruces /tmp\n    ```\n\n3. Run TPCHDataGen.hql with settings.hql file and set the required config variables.\n    ```shell\n   beeline -u \"jdbc:hive2://`hostname -f`:10001/;transportMode=http\" -n \"\" -p \"\" -i settings.hql -f TPCHDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCH/ -hiveconf TPCHBIN=`grep -A 1 \"fs.defaultFS\" /etc/hadoop/conf/core-site.xml | grep -o \"wasb[^\u003c]*\"`/tmp/resources \n    ```\n    Here, `SCALE` is a scale factor for TPCH, \n    `PARTS` is a number of task to use for datagen (parrellelization), \n    `LOCATION` is the directory where the data will be stored on HDFS, \n    `TPCHBIN` is where the resources are uploaded on step 2. You can specify specific settings in settings.hql file.\n\n4. Now you can create tables on the generated data.\n    ```shell\n    beeline -u \"jdbc:hive2://`hostname -f`:10001/;transportMode=http\" -n \"\" -p \"\" -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCH/ -hiveconf DBNAME=tpch\n    ```\n    Generate ORC tables and analyze\n    ```shell\n    beeline -u \"jdbc:hive2://`hostname -f`:10001/;transportMode=http\" -n \"\" -p \"\" -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpch_orc -hiveconf SOURCE=tpch \n    beeline -u \"jdbc:hive2://`hostname -f`:10001/;transportMode=http\" -n \"\" -p \"\" -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpch_orc \n    ```\n\n5. Run the queries !\n    ```shell\n    beeline -u \"jdbc:hive2://`hostname -f`:10001/tpch_orc;transportMode=http\" -n \"\" -p \"\" -i settings.hql -f queries/tpch_query1.hql \n    ```\n\nIf you want to run all the queries 10 times and measure the times it takes, you can use the following command:\n\n    for f in queries/*.sql; do for i in {1..10} ; do STARTTIME=\"`date +%s`\";  beeline -u \"jdbc:hive2://`hostname -f`:10001/tpch_orc;transportMode=http\" -i settings.hql -f $f  \u003e $f.run_$i.out 2\u003e\u00261 ; ENDTIME=\"`date +%s`\"; echo \"$f,$i,$STARTTIME,$ENDTIME,$(($ENDTIME-$STARTTIME))\" \u003e\u003e times_orc.csv; done; done;\n\n## FAQ\n\n1. Does it work with scale factor 1?\n\n    No. The parrellel data generation assumes that scale \u003e 1. If you are just starting out, I would suggest you start with 10 and then move to standard higher scale factors (100, 1000, 10000,..)\n\n2. Do I have to specify PARTS=SCALE ?\n\n    Yes.\n\n3. How do I avoid my session getting killed due to network errors while long running benchmark?\n    \n   Use byobu. Type byobu which will start a new session and then run the command. It will be there when you come back even if your network connection is broken. \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdharmeshkakadia%2Ftpch-hdinsight","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdharmeshkakadia%2Ftpch-hdinsight","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdharmeshkakadia%2Ftpch-hdinsight/lists"}