https://github.com/valendrew/big-data-project

Last synced: 10 months ago
JSON representation

Host: GitHub
URL: https://github.com/valendrew/big-data-project
Owner: Valendrew
Created: 2024-01-07T21:13:28.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-01-21T12:03:25.000Z (over 2 years ago)
Last Synced: 2025-03-16T06:26:50.449Z (over 1 year ago)
Language: Jupyter Notebook
Size: 3.76 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# San Francisco Crime Classification - Big Data Project

## Description

The goal of this project is to predict the **category of crimes** that occurred in the city of San Francisco. The dataset contains nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given **time and location**, the goal is to predict the category of the crime.

The project is divided into two parts:
- The first one is the data analysis, where the data is cleaned and analyzed to extract useful information.
- The second one is the machine learning part, where the data is used to train a model that can predict the category of a crime given the time and the location.

The project is executed on a **Hadoop Cluster** with **Spark** and **Jupyter Notebook** installed. The cluster is created using **Vagrant** and **VirtualBox**.

This contribution is part of the **Text Mining-Big Data-Data Mining** course at the University of Bologna, in the Master's Degree in Artificial Intelligence.

## Structure

## Installation

### Vagrant

- Install [Vagrant](https://developer.hashicorp.com/vagrant/downloads)
- Install [Oracle Virtual Box](https://www.virtualbox.org/wiki/Downloads)
- On Windows, firstly install [Microsoft Visual C++ 2019 Redistributable](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170)
- Create a new virtual machine
- `vagrant init box_name`, for example `box_name` could be `ubuntu/jammy64`
- `vagrant up` to run the virtual machine
- `vagrant ssh` to SSH into the machine, with multi-machine the machine name must be specified
- `vagrant reload` to reload the changes in the VagrantFile
- `vagrant logout` to logout from the machine
- `vagrant destroy` to stop and remove the machine
- Vagrant shares a directory at `/vagrant` with the directory on the host containing the Vagrantfile
- Add new [synced folder](https://developer.hashicorp.com/vagrant/docs/synced-folders/basic_usage)
- Add a provisioning script in VagrantFile
- `config.vm.provision :shell, path: "filename.sh"`
- Forward a port from the host machine to the guest machine
- `config.vm.network :forwarded_port, guest: guest_port, host: host_port`
- Additional networking [documentation](https://developer.hashicorp.com/vagrant/docs/networking)

### Hadoop Cluster

- Master node maintains knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation; it hosts two daemons
- The NameNode manages the distributed file system and knows where stored data blocks inside the cluster are
- The ResourceManager manages the YARN jobs and takes care of scheduling and executing processes on worker nodes
- Worker nodes store the actual data and provide processing power to run the jobs; they hosts two daemons
- The DataNode manages the physical data stored on the node; it’s named, NameNode.
- The NodeManager manages execution of tasks on the node
- [HDFS commands](https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/#run-and-monitor-hdfs)
- Start HDFS: `start-dfs.sh`
- In master node: `NameNode, SecondaryNameNode`
- In worker node: `DataNode`
- Stop HDFS: `stop-dfs.sh`
- Format HDFS: `hdfs namenode -format`
- Must be run only once, **before** starting HDFS for the first time
- Create user directory: `hdfs dfs -mkdir -p /user/vagrant`
- The `vagrant` subdirectory must match the username
- **Commands**:
- Create a directory: `hdfs dfs -mkdir DIR`
- Put a file: `hdfs dfs -put FILE DIR`
- List the content of the directory: `hdfs dfs -ls DIR`
- Output the content of the file: `hdfs dfs -cat DIR/FILE`
- Status of the HDFS: `hdfs dfsadmin -report`
- [YARN commands](https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/#run-yarn)
- Run YARN: `start-yarn.sh`
- In master node: `ResourceManager`
- In worker node: `NodeManager`
- Stop YARN: `stop-yarn.sh`
- **Commands**:
- Run a MapReduce job: `yarn jar JAR_FILE CLASS_NAME INPUT OUTPUT`
- List the running nodes: `yarn node -list`
- List the running applications: `yarn application -list`
- **MAPRED commands**:
- Run MAPRED: `mapred --daemon start historyserver`
- In master node: `JobHistoryServer`
- Stop MAPRED: `mapred --daemon stop historyserver`
- **Test the cluster**:
```bash
hdfs dfs -mkdir -p /user/vagrant/books
hdfs dfs -put /vagrant/books /user/vagrant
yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount "books/*" output
```
- **View results**: `hdfs dfs -cat output/*`
- **Web UI**:
- HDFS Web UI: http://node-master:9870/
- Yarn Web UI: http://node-master:8088
- MapReduce Web UI: http://node-master:19888

### Spark

- Create the spark-logs folder in the HDFS: `hdfs dfs -mkdir /spark-logs`
- Configuration file: `spark/conf/spark-defaults.conf`
- **History server**:
- Run: `spark/sbin/start-history-server.sh`
- In master node: `HistoryServer`
- Stop: `spark/sbin/stop-history-server.sh`
- **Test the cluster**:
```bash
spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 10
```
- `--master yarn` can be omitted since `spark.master` is set in the configuration file
- `-executor-memory` can be omitted since `spark.executor.memory` is set in the configuration file
- `--num-executors` can be omitted since `spark.executor.instances` is set in the configuration file
- **Web UI**:
- Spark Web UI: http://node-master:4040
- History Server Web UI: http://node-master:18080

### Python

- Run jupyter notebook: `jupyter notebook --ip=0.0.0.0 --no-browser`
- Virtual environment:
- Create environment: `python3 -m venv pyspark_venv`
- Activate environment: `source pyspark_venv/bin/activate`
- Create venv-pack: `venv-pack -o pyspark_venv.tar.gz`
- **Run PySpark shell**:
```bash
pyspark --archives pyspark_venv.tar.gz#pyspark_venv
$SPARK_HOME/bin/spark-submit --deploy-mode client app.py
```

### Misc

- Command to view java processes: `jps`

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/valendrew/big-data-project

Awesome Lists containing this project

README