Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dharmeshkakadia/tpch-hdinsight
TPCH benchmark for various engines
https://github.com/dharmeshkakadia/tpch-hdinsight
benchmarking hive llap presto spark tpch
Last synced: 3 months ago
JSON representation
TPCH benchmark for various engines
- Host: GitHub
- URL: https://github.com/dharmeshkakadia/tpch-hdinsight
- Owner: dharmeshkakadia
- Created: 2016-09-11T02:39:28.000Z (about 8 years ago)
- Default Branch: master
- Last Pushed: 2017-09-17T03:07:53.000Z (about 7 years ago)
- Last Synced: 2024-05-19T03:22:55.730Z (6 months ago)
- Topics: benchmarking, hive, llap, presto, spark, tpch
- Language: Python
- Homepage:
- Size: 114 KB
- Stars: 6
- Watchers: 4
- Forks: 8
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-hive - TPCH
README
# tpch-datagen-as-hive-query
This are set of UDFs and queries that you can use with Hive to use TPCH datagen in parrellel on hadoop cluster. You can deploy to azure using :
## How to use with Hive CLI
1. Clone this repo.```shell
git clone https://github.com/dharmeshkakadia/tpch-datagen-as-hive-query/ && cd tpch-datagen-as-hive-query
```
2. Run TPCHDataGen.hql with settings.hql file and set the required config variables.
```shell
hive -i settings.hql -f TPCHDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCH/ -hiveconf TPCHBIN=resources
```
Here, `SCALE` is a scale factor for TPCH,
`PARTS` is a number of task to use for datagen (parrellelization),
`LOCATION` is the directory where the data will be stored on HDFS,
`TPCHBIN` is where the resources are found. You can specify specific settings in settings.hql file.3. Now you can create tables on the generated data.
```shell
hive -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCH/ -hiveconf DBNAME=tpch
```
Generate ORC tables and analyze
```shell
hive -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpch_orc -hiveconf SOURCE=tpch
hive -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpch_orc
```4. Run the queries !
```shell
hive -database tpch_orc -i settings.hql -f queries/tpch_query1.hql
```## How to use with Beeline CLI
1. Clone this repo.```shell
git clone https://github.com/dharmeshkakadia/tpch-datagen-as-hive-query/ && cd tpch-datagen-as-hive-query
```
2. Upload the resources to DFS.
```shell
hdfs dfs -copyFromLocal resoruces /tmp
```3. Run TPCHDataGen.hql with settings.hql file and set the required config variables.
```shell
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f TPCHDataGen.hql -hiveconf SCALE=10 -hiveconf PARTS=10 -hiveconf LOCATION=/HiveTPCH/ -hiveconf TPCHBIN=`grep -A 1 "fs.defaultFS" /etc/hadoop/conf/core-site.xml | grep -o "wasb[^<]*"`/tmp/resources
```
Here, `SCALE` is a scale factor for TPCH,
`PARTS` is a number of task to use for datagen (parrellelization),
`LOCATION` is the directory where the data will be stored on HDFS,
`TPCHBIN` is where the resources are uploaded on step 2. You can specify specific settings in settings.hql file.4. Now you can create tables on the generated data.
```shell
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllExternalTables.hql -hiveconf LOCATION=/HiveTPCH/ -hiveconf DBNAME=tpch
```
Generate ORC tables and analyze
```shell
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/createAllORCTables.hql -hiveconf ORCDBNAME=tpch_orc -hiveconf SOURCE=tpch
beeline -u "jdbc:hive2://`hostname -f`:10001/;transportMode=http" -n "" -p "" -i settings.hql -f ddl/analyze.hql -hiveconf ORCDBNAME=tpch_orc
```5. Run the queries !
```shell
beeline -u "jdbc:hive2://`hostname -f`:10001/tpch_orc;transportMode=http" -n "" -p "" -i settings.hql -f queries/tpch_query1.hql
```If you want to run all the queries 10 times and measure the times it takes, you can use the following command:
for f in queries/*.sql; do for i in {1..10} ; do STARTTIME="`date +%s`"; beeline -u "jdbc:hive2://`hostname -f`:10001/tpch_orc;transportMode=http" -i settings.hql -f $f > $f.run_$i.out 2>&1 ; ENDTIME="`date +%s`"; echo "$f,$i,$STARTTIME,$ENDTIME,$(($ENDTIME-$STARTTIME))" >> times_orc.csv; done; done;
## FAQ
1. Does it work with scale factor 1?
No. The parrellel data generation assumes that scale > 1. If you are just starting out, I would suggest you start with 10 and then move to standard higher scale factors (100, 1000, 10000,..)
2. Do I have to specify PARTS=SCALE ?
Yes.
3. How do I avoid my session getting killed due to network errors while long running benchmark?
Use byobu. Type byobu which will start a new session and then run the command. It will be there when you come back even if your network connection is broken.