https://github.com/wrayx/cs346_database_coursework
https://github.com/wrayx/cs346_database_coursework
Last synced: 4 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/wrayx/cs346_database_coursework
- Owner: wrayx
- Created: 2022-02-08T13:05:15.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2022-03-23T04:05:19.000Z (about 3 years ago)
- Last Synced: 2025-01-09T10:16:53.370Z (5 months ago)
- Language: Python
- Size: 8.05 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CS346
## TODOs
Part 3 Java Map-Reduce:
- [x] Query 1.a
- [x] Query 1.b
- [x] Query 1.c
- [ ] Query 2 (Reduce Side Join)Part 4 Hive Query:
- [x] Hive schema
- [x] Query 1.a
- [x] Query 1.b
- [x] Query 1.c
- [ ] Query 2 (Reduce Side Join)Additionally:
- [ ] Comment software code
- [ ] EDA for csv files
- [ ] Performance analysesReport:
- [ ] ..## Tutorials
+ [MapReduce Program – Weather Data Analysis For Analyzing Hot And Cold Days
](https://www.geeksforgeeks.org/mapreduce-program-weather-data-analysis-for-analyzing-hot-and-cold-days/?ref=lbp)
+ [Apache Hive - Language Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
+ [Importing Data from Files into Hive Tables](https://www.informit.com/articles/article.aspx?p=2756471&seqNum=4)## Git
[Git cheat sheet](https://education.github.com/git-cheat-sheet-education.pdf)## Hadoop Commands
```bash
# every time you log in, you must run the following in order to set some important environment variables:
source cs346env.sh
# no longer required, command has been added to .bashrc# start the hadoop service
$HADOOP_HOME/sbin/start-all.sh# stop all services
$HADOOP_HOME/sbin/stop-all.sh# check that all daemons are up and running
jps# turning off safe mode
hdfs dfsadmin -safemode leave# running query 1 java map reduce
bin/hadoop jar your_package.jar main K start_date end_date input_file output_directory# hdfs listing files
hdfs dfs -ls input/1G/store# hdfs cat dat file first line
hdfs dfs -cat input/1G/store/store.dat>&1 | head -n 1# Compile java code
hadoop com.sun.tools.javac.Main WordCount.java# Create jar file
jar cf wc.jar WordCount*.class# run program
hadoop jar wc.jar WordCount input/wc output/wclabsheet# view results
hdfs dfs -ls output/wclabsheet
hdfs dfs -cat output/wclabsheet/part-r-00000# delete output
hdfs dfs -rm -r output/*# query 2 mapreduce output
ss_store_sk_4 475400665.40 9341467
ss_store_sk_10 476650853.94 9294113
ss_store_sk_11 000000000.00 9294113
ss_store_sk_5 000000000.00 9078805
ss_store_sk_6 000000000.00 9026222
ss_store_sk_7 479048569.12 8954883
ss_store_sk_3 000000000.00 7557959
ss_store_sk_8 479051954.37 6995995
ss_store_sk_9 000000000.00 6995995
ss_store_sk_2 477594514.78 5285950
ss_store_sk_1 475457349.02 5250760
ss_store_sk_12 000000000.00 5219562```
## Hive commands
```bashTry setting those properties to higher values.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;# start Beeline client
$HIVE_HOME/bin/beeline -u jdbc:hive2://# lab3 sql command
SELECT colour, MAX(height*width) AS area FROM rectangles10m GROUP BY colour;
# to obtain:
+---------+-----------+
| colour | area |
+---------+-----------+
| blue | 99920007 |
| green | 99910008 |
| yellow | 99870012 |
| red | 99820077 |
+---------+-----------+
```