https://github.com/wrayx/cs346_database_coursework
https://github.com/wrayx/cs346_database_coursework
Last synced: 3 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/wrayx/cs346_database_coursework
- Owner: wrayx
- Created: 2022-02-08T13:05:15.000Z (almost 4 years ago)
- Default Branch: main
- Last Pushed: 2022-03-23T04:05:19.000Z (almost 4 years ago)
- Last Synced: 2025-01-09T10:16:53.370Z (12 months ago)
- Language: Python
- Size: 8.05 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# CS346
## TODOs
Part 3 Java Map-Reduce:
- [x] Query 1.a
- [x] Query 1.b
- [x] Query 1.c
- [ ] Query 2 (Reduce Side Join)
Part 4 Hive Query:
- [x] Hive schema
- [x] Query 1.a
- [x] Query 1.b
- [x] Query 1.c
- [ ] Query 2 (Reduce Side Join)
Additionally:
- [ ] Comment software code
- [ ] EDA for csv files
- [ ] Performance analyses
Report:
- [ ] ..
## Tutorials
+ [MapReduce Program – Weather Data Analysis For Analyzing Hot And Cold Days
](https://www.geeksforgeeks.org/mapreduce-program-weather-data-analysis-for-analyzing-hot-and-cold-days/?ref=lbp)
+ [Apache Hive - Language Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual)
+ [Importing Data from Files into Hive Tables](https://www.informit.com/articles/article.aspx?p=2756471&seqNum=4)
## Git
[Git cheat sheet](https://education.github.com/git-cheat-sheet-education.pdf)
## Hadoop Commands
```bash
# every time you log in, you must run the following in order to set some important environment variables:
source cs346env.sh
# no longer required, command has been added to .bashrc
# start the hadoop service
$HADOOP_HOME/sbin/start-all.sh
# stop all services
$HADOOP_HOME/sbin/stop-all.sh
# check that all daemons are up and running
jps
# turning off safe mode
hdfs dfsadmin -safemode leave
# running query 1 java map reduce
bin/hadoop jar your_package.jar main K start_date end_date input_file output_directory
# hdfs listing files
hdfs dfs -ls input/1G/store
# hdfs cat dat file first line
hdfs dfs -cat input/1G/store/store.dat>&1 | head -n 1
# Compile java code
hadoop com.sun.tools.javac.Main WordCount.java
# Create jar file
jar cf wc.jar WordCount*.class
# run program
hadoop jar wc.jar WordCount input/wc output/wclabsheet
# view results
hdfs dfs -ls output/wclabsheet
hdfs dfs -cat output/wclabsheet/part-r-00000
# delete output
hdfs dfs -rm -r output/*
# query 2 mapreduce output
ss_store_sk_4 475400665.40 9341467
ss_store_sk_10 476650853.94 9294113
ss_store_sk_11 000000000.00 9294113
ss_store_sk_5 000000000.00 9078805
ss_store_sk_6 000000000.00 9026222
ss_store_sk_7 479048569.12 8954883
ss_store_sk_3 000000000.00 7557959
ss_store_sk_8 479051954.37 6995995
ss_store_sk_9 000000000.00 6995995
ss_store_sk_2 477594514.78 5285950
ss_store_sk_1 475457349.02 5250760
ss_store_sk_12 000000000.00 5219562
```
## Hive commands
```bash
Try setting those properties to higher values.
SET hive.exec.max.dynamic.partitions=100000;
SET hive.exec.max.dynamic.partitions.pernode=100000;
# start Beeline client
$HIVE_HOME/bin/beeline -u jdbc:hive2://
# lab3 sql command
SELECT colour, MAX(height*width) AS area FROM rectangles10m GROUP BY colour;
# to obtain:
+---------+-----------+
| colour | area |
+---------+-----------+
| blue | 99920007 |
| green | 99910008 |
| yellow | 99870012 |
| red | 99820077 |
+---------+-----------+
```