{"id":21359307,"url":"https://github.com/valendrew/big-data-project","last_synced_at":"2025-08-28T06:05:43.727Z","repository":{"id":218361755,"uuid":"740216131","full_name":"Valendrew/big-data-project","owner":"Valendrew","description":null,"archived":false,"fork":false,"pushed_at":"2024-01-21T12:03:25.000Z","size":3946,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-16T06:26:50.449Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Valendrew.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-01-07T21:13:28.000Z","updated_at":"2024-01-21T12:05:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"789fbb75-8b52-4324-a5bc-15f8cd7635d9","html_url":"https://github.com/Valendrew/big-data-project","commit_stats":null,"previous_names":["valendrew/big-data-project"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Valendrew/big-data-project","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valendrew%2Fbig-data-project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valendrew%2Fbig-data-project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valendrew%2Fbig-data-project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valendrew%2Fbig-data-project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Valendrew","download_url":"https://codeload.github.com/Valendrew/big-data-project/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Valendrew%2Fbig-data-project/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272451984,"owners_count":24937463,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-28T02:00:10.768Z","response_time":74,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-22T05:27:39.493Z","updated_at":"2025-08-28T06:05:43.710Z","avatar_url":"https://github.com/Valendrew.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# San Francisco Crime Classification - Big Data Project\n\n## Description\n\nThe goal of this project is to predict the **category of crimes** that occurred in the city of San Francisco. The dataset contains nearly 12 years of crime reports from across all of San Francisco's neighborhoods. Given **time and location**, the goal is to predict the category of the crime.\n\nThe project is divided into two parts:\n- The first one is the data analysis, where the data is cleaned and analyzed to extract useful information.\n- The second one is the machine learning part, where the data is used to train a model that can predict the category of a crime given the time and the location.\n\nThe project is executed on a **Hadoop Cluster** with **Spark** and **Jupyter Notebook** installed. The cluster is created using **Vagrant** and **VirtualBox**.\n\nThis contribution is part of the **Text Mining-Big Data-Data Mining** course at the University of Bologna, in the Master's Degree in Artificial Intelligence.\n\n## Structure\n\n```\n./\n|\n├── cluster/\n|   │\n|   ├── conf/\n|   │   └── configuration files for Hadoop and Spark\n|   │\n|   ├── data/\n|   │  └── dataset for the project\n|   │\n|   ├── downloads/\n|   │   └── downloaded files from the internet\n|   │\n|   ├── scripts/\n|   │   └── provisioning scripts for master and worker nodes\n|   │\n|   ├── utils/\n|   │   └── utility scripts for running services (HDFS, YARN, Spark)\n|   │\n|   └── VagrantFile\n|        └── configuration for the cluster\n|\n├── project.ipynb\n|    └── Jupyter Notebook for the project\n|\n└── README.md\n|    └── Explanation of the project\n|\n├── presentation.pdf\n|    └── Presentation of the project\n  \n```\n\n## Installation\n\n### Vagrant\n\n- Install [Vagrant](https://developer.hashicorp.com/vagrant/downloads)\n- Install [Oracle Virtual Box](https://www.virtualbox.org/wiki/Downloads)\n  - On Windows, firstly install  [Microsoft Visual C++ 2019 Redistributable](https://learn.microsoft.com/en-us/cpp/windows/latest-supported-vc-redist?view=msvc-170)\n- Create a new virtual machine\n  - `vagrant init box_name`, for example `box_name` could be `ubuntu/jammy64`\n  - `vagrant up` to run the virtual machine\n  - `vagrant ssh` to SSH into the machine, with multi-machine the machine name must be specified\n  - `vagrant reload` to reload the changes in the VagrantFile\n  - `vagrant logout` to logout from the machine\n  - `vagrant destroy` to stop and remove the machine\n- Vagrant shares a directory at `/vagrant` with the directory on the host containing the Vagrantfile\n  - Add new [synced folder](https://developer.hashicorp.com/vagrant/docs/synced-folders/basic_usage)\n- Add a provisioning script in VagrantFile\n  - `config.vm.provision :shell, path: \"filename.sh\"`\n- Forward a port from the host machine to the guest machine\n  - `config.vm.network :forwarded_port, guest: guest_port, host: host_port`\n  - Additional networking [documentation](https://developer.hashicorp.com/vagrant/docs/networking)\n\n### Hadoop Cluster\n\n- Master node maintains knowledge about the distributed file system, like the inode table on an ext3 filesystem, and schedules resources allocation; it hosts two daemons\n  - The NameNode manages the distributed file system and knows where stored data blocks inside the cluster are\n  - The ResourceManager manages the YARN jobs and takes care of scheduling and executing processes on worker nodes\n- Worker nodes store the actual data and provide processing power to run the jobs; they hosts two daemons\n  - The DataNode manages the physical data stored on the node; it’s named, NameNode.\n  - The NodeManager manages execution of tasks on the node\n- [HDFS commands](https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/#run-and-monitor-hdfs)\n  - Start HDFS: `start-dfs.sh`\n    - In master node: `NameNode, SecondaryNameNode`\n    - In worker node: `DataNode`\n  - Stop HDFS: `stop-dfs.sh`\n  - Format HDFS: `hdfs namenode -format`\n    - Must be run only once, **before** starting HDFS for the first time\n  - Create user directory: `hdfs dfs -mkdir -p /user/vagrant`\n    - The `vagrant` subdirectory must match the username\n  - **Commands**: \n    - Create a directory: `hdfs dfs -mkdir DIR`\n    - Put a file: `hdfs dfs -put FILE DIR`\n    - List the content of the directory: `hdfs dfs -ls DIR`\n    - Output the content of the file: `hdfs dfs -cat DIR/FILE`\n    - Status of the HDFS: `hdfs dfsadmin -report`\n- [YARN commands](https://www.linode.com/docs/guides/how-to-install-and-set-up-hadoop-cluster/#run-yarn)\n  - Run YARN: `start-yarn.sh`\n    - In master node: `ResourceManager`\n    - In worker node: `NodeManager`\n  - Stop YARN: `stop-yarn.sh`\n  - **Commands**: \n    - Run a MapReduce job: `yarn jar JAR_FILE CLASS_NAME INPUT OUTPUT`\n    - List the running nodes: `yarn node -list`\n    - List the running applications: `yarn application -list`\n- **MAPRED commands**:\n  - Run MAPRED: `mapred --daemon start historyserver`\n    - In master node: `JobHistoryServer`\n  - Stop MAPRED: `mapred --daemon stop historyserver`\n- **Test the cluster**: \n  ```bash\n    hdfs dfs -mkdir -p /user/vagrant/books\n    hdfs dfs -put /vagrant/books /user/vagrant\n    yarn jar hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-3.3.6.jar wordcount \"books/*\" output\n  ```\n  - **View results**: `hdfs dfs -cat output/*`\n- **Web UI**: \n  - HDFS Web UI: http://node-master:9870/\n  - Yarn Web UI: http://node-master:8088\n  - MapReduce Web UI: http://node-master:19888\n\n### Spark\n\n- Create the spark-logs folder in the HDFS: `hdfs dfs -mkdir /spark-logs`\n- Configuration file: `spark/conf/spark-defaults.conf`\n- **History server**:\n  - Run: `spark/sbin/start-history-server.sh`\n    - In master node: `HistoryServer`\n  - Stop: `spark/sbin/stop-history-server.sh`\n- **Test the cluster**: \n  ```bash\n  spark-submit --deploy-mode client --class org.apache.spark.examples.SparkPi $SPARK_HOME/examples/jars/spark-examples_2.11-2.2.0.jar 10\n  ```\n    - `--master yarn` can be omitted since `spark.master` is set in the configuration file\n    - `-executor-memory` can be omitted since `spark.executor.memory` is set in the configuration file\n    - `--num-executors` can be omitted since `spark.executor.instances` is set in the configuration file\n- **Web UI**: \n  - Spark Web UI: http://node-master:4040\n  - History Server Web UI: http://node-master:18080\n\n### Python\n\n- Run jupyter notebook: `jupyter notebook --ip=0.0.0.0 --no-browser`\n- Virtual environment:\n  - Create environment: `python3 -m venv pyspark_venv`\n  - Activate environment: `source pyspark_venv/bin/activate`\n  - Create venv-pack: `venv-pack -o pyspark_venv.tar.gz`\n- **Run PySpark shell**: \n  ```bash\n  pyspark --archives pyspark_venv.tar.gz#pyspark_venv\n  $SPARK_HOME/bin/spark-submit --deploy-mode client app.py\n  ```\t\n\n### Misc\n\n- Command to view java processes: `jps`","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvalendrew%2Fbig-data-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvalendrew%2Fbig-data-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvalendrew%2Fbig-data-project/lists"}