{"id":18758885,"url":"https://github.com/baldidon/big-data-project","last_synced_at":"2025-12-02T09:30:15.227Z","repository":{"id":136898698,"uuid":"428435807","full_name":"baldidon/Big-Data-Project","owner":"baldidon","description":"spark-based program for sentiment-analysis ","archived":false,"fork":false,"pushed_at":"2022-12-09T15:54:03.000Z","size":178676,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2024-12-29T03:29:39.865Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Java","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/baldidon.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-11-15T22:06:43.000Z","updated_at":"2024-06-24T06:59:15.000Z","dependencies_parsed_at":null,"dependency_job_id":"bdcf7bcf-8c21-4075-bd2a-8c8416445396","html_url":"https://github.com/baldidon/Big-Data-Project","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baldidon%2FBig-Data-Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baldidon%2FBig-Data-Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baldidon%2FBig-Data-Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/baldidon%2FBig-Data-Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/baldidon","download_url":"https://codeload.github.com/baldidon/Big-Data-Project/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239650813,"owners_count":19674814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-07T17:48:12.359Z","updated_at":"2025-12-02T09:30:15.140Z","avatar_url":"https://github.com/baldidon.png","language":"Java","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Big Data Project\n\nA spark-based application for tweet's *sentiment analysis*, running inside an Hadoop cluster\n\n---\n#### Table Of Contents\n1. [Project Goals](https://github.com/baldidon/Big-Data-Project#project-goals-)\n2. [Cluster Setup](https://github.com/baldidon/Big-Data-Project#cluster-setup)   \n3. [Usage](https://github.com/baldidon/Big-Data-Project#usage)\n---\n\n## Project goals \u003ca name=\"Project goals\"/\u003e\nWith this project, we wanted to explore usage some of the most popular softwares for *Big Data management*. \nIn detail, we've used **Apache Hadoop** for build-up a 3-node cluster (**with HDFS as FS**) and we use **Apache Spark** above them whith **MLlib**, a Spark library for design machine learning's models.\nFor task (*given a tweet/phrase, choose if it's a positive or negative comment*), we've choose to use **Naive Bayes classifier**: the perfect trade off between simplicity and performance. Thanks by a\nsimple (and quite incorrect for documents) hypotesis: the features (in this case words) of a sample (in this case a text/tweet) are *independent random variables*.  Altough in a text words might be correlated, this\nmodel provide good performance!\n\nAs Dataset, the [Μαριος Μιχαηλιδης KazAnova dataset](https://www.kaggle.com/kazanova/sentiment140) was perfect: tons of labeled tweet (**1.6 millions of tweets!!**) for justify the distributed approach and high usability.\nBelow a snippet of the dataset\n| Target | ids | date | flag | user | text|\n| --- | --- | --- | --- | --- | --- |\n|0|\"1467810369\"|\"Mon Apr 06 22:19:45 PDT 2009\"|\"NO_QUERY\"|\"_TheSpecialOne_\"|\"@switchfoot http://twitpic.com/2y1zl - Awww, that's a bummer.  You shoulda got David Carr of Third Day to do it. \tD\"|\t\t\t\t\t\t\t\t\t\n|0|\"1467810672\"|\"Mon Apr 06 22:19:49 PDT 2009\"|\"NO_QUERY\"|\"scotthamilton\"|\"is upset that he can't update his Facebook by texting it... and might cry as a result  School today also. Blah!\"\t\t\t\t|\t\t\t\t\t\t\n\u003c!--\n|0|\"1467810917\"|\"Mon Apr 06 22:19:53 PDT 2009\"|\"NO_QUERY\"|\"mattycus\"|\"@Kenichan I dived many times for the ball. Managed to save 50%  The rest go out of bounds\"\t\t\t\t\t\t\t\t\t\t|\n|0|\"1467811184\"|\"Mon Apr 06 22:19:57 PDT 2009\"|\"NO_QUERY\"|\"ElleCTF\"|\"my whole body feels itchy and like its on fire \"\t\t\t\t\t\t\t\t\t\t|\n--\u003e\n\n* target: the sentiment of the tweet (0 = negative, 1 = positive)\n* ids: The id of the tweet ( 2087)\n* date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)\n* flag: Used for lyx queries. Useless\n* user: the user that tweeted\n* text: the text of the tweet\n\nFor the application, only *Target* and *Text* colums are needed.\n\nThe cluster is developed with 3 virtual machines which are running Ubuntu 20.04.3: 1 **MasterNode** (it running Namenode, SecondaryNamenode and ResourceManager  processes and submit the spark application) and 2 **WorkerNode** (runnning tasks, Datanode and NodeManager processes). All 3 machines running on the same local network (sams subclass o private addresses), so the can communicate through local network and not over internet!\n\n\n# Cluster setup\n\n## Requirements\n- [Apache Spark 3.0.3](https://spark.apache.org/releases/spark-release-3-0-3.html)\n- [Apache Hadoop 3.2.2](https://hadoop.apache.org/docs/r3.2.2/)\n- [Apache MLlib](https://spark.apache.org/mllib/)\n- Java 8 (We know, it's weird use Java for an ML task :-) )\n\n## Setup Hadoop cluster\nfirst of all, it's recommended create a new user on the OS for build up the cluster (in our case we defined 'hadoopuser' user) .\nThe first step is settingUp Secure SHell (ssh) on all machines to permit to Master and WorkerNode the passwordlesses access.\n\nExecute this commands separately:\n```bash\nsudo apt install ssh\nsudo apt install pdsh\nnano .bashrc\nexport PDSH_RCMD_TYPE=ssh\nssh-keygen -t rsa -P \"\"\ncat ~/.ssh/id_rsa.pub \u003e\u003e ~/.ssh/authorized_keys\n```\nIn the end, if everything gone well, ``` ssh localhost``` working without password request.\n\nNext, java and hadoop must be installed. After, add to file ```(path to your hadoop installation, is recomended to move folders in /usr/local directory)/etc/hadoop/hadoop-env.sh``` the following variable:\n\n```bash \nexport JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64/\n```\n\nthen add the following config to ```/etc/enviroment```:\n```bash\nPATH=\"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/usr/local/hadoop/bin:/usr/local/hadoop/sbin\"JAVA_HOME=\"/usr/lib/jvm/java-8-openjdk-amd64/jre\"\n```\n\n---\n### Network setup\nOpen file ```/etc/hosts``` with sudo and insert on each machine the ip address and the Hostname of the other machine, like this:\n\n```bash\nsudo nano /etc/hosts\n```\n\n```bash\n# ip address of machines and their hostnames\nxxx.xxx.xxx.xxx MasterNode\nxxx.xxx.xxx.xxx WorkerNode\nxxx.xxx.xxx.xxx WorkerNode2\n```\n\nin this snippet, we assumed that the machines's hostnames are MasterNode, WorkerNode and WorkerNode2. For change the hostname go to file ```etc/hostname``` and change the name.\n\nAfter that, we need to distribute between all nodes of cluster a *public-key* for ssh access. On the machine \"master\" execute first the command for *generate* the key, then the commands for do a *secure-copy* inside all machines:\n\n```bash\nssh-keygen -t rsa\nssh-copy-id hadoopuser@MasterNode\nssh-copy-id hadoopuser@WorkerNode\nssh-copy-id hadoopuser@WorkerNode2\n```\n**PAY ATTENTION: change hadoopuser with right user of machine and the values after \"@\" with correct hostnames!!**\n\n---\n### Configure HDFS\n\nOn the master node, open file ```path-to-hadoop/etc/hadoop/core-site.xml``` and add the following configuration:\n\n```xml\n\u003cconfiguration\u003e\n   \u003cproperty\u003e\n    \u003cname\u003efs.defaultFS\u003c/name\u003e\n    \u003cvalue\u003ehdfs://MasterNode:9000\u003c/value\u003e\n      \u003c!-- change MasterNode with correct hostname --\u003e\n  \u003c/property\u003e\n\u003c/configuration\u003e\n```\nStill on master node, open  ```path-to-hadoop/etc/hadoop/hdfs-site.xml``` and add:\n\n```xml\n\u003cconfiguration\u003e\n   \u003cproperty\u003e\n      \u003cname\u003edfs.namenode.name.dir\u003c/name\u003e\u003cvalue\u003epath-to-hadoop/data/nameNode\u003c/value\u003e\n   \u003c/property\u003e\n   \u003cproperty\u003e\n      \u003cname\u003edfs.datanode.data.dir\u003c/name\u003e\u003cvalue\u003epath-to-hadoop/data/dataNode\u003c/value\u003e\n   \u003c/property\u003e\n   \u003cproperty\u003e\n      \u003cname\u003edfs.replication\u003c/name\u003e\n      \u003cvalue\u003e2\u003c/value\u003e\n   \u003c/property\u003e\n\u003c/configuration\u003e\n```\n\nthen open ```path-to-hadoop/etc/hadoop/workers``` and add Hostnames of workers:\n\n```text\nWorkerNode\nWorkerNode2\n```\nSame thing for define hostname of masternode, inside  ```path-to-hadoop/etc/hadoop/master```:\n\n```text\nMasterNode\n``` \n\n\nWe need to copy theese configs on workers, execute:\n\n```\nscp path-to-hadoop/etc/hadoop/* (worker hostname):path-to-hadoop/etc/hadoop/\nscp path-to-hadoop/etc/hadoop/* (worker hostname):path-to-hadoop/etc/hadoop/\n```\n\nAfter all, next step is format the HDFS fs! Run:\n\n```\nsource /etc/environment\nhdfs namenode -format\n```\nand then\n\n```\nexport HADOOP_HOME=\"path-to-hadoop\"\nexport HADOOP_COMMON_HOME=$HADOOP_HOME\nexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop\nexport HADOOP_HDFS_HOME=$HADOOP_HOME\nexport HADOOP_MAPRED_HOME=$HADOOP_HOME\nexport HADOOP_YARN_HOME=$HADOOP_HOME\n\n$HADOOP_HOME/sbin/start-dfs.sh\n```\n\nafter this procedure, on the workers execute ```jps``` and if is present **Datanode** process, everything gone well!.\nOpen ```https://ip-master:9870``` to open HDFS web panel.\n\n---\n### Setup YARN\nOn master, execute:\n\n```\nexport HADOOP_HOME=\"path-to-hadoop\"\nexport HADOOP_COMMON_HOME=$HADOOP_HOME\nexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop\nexport HADOOP_HDFS_HOME=$HADOOP_HOME\nexport HADOOP_MAPRED_HOME=$HADOOP_HOME\nexport HADOOP_YARN_HOME=$HADOOP_HOME\n```\n\nNext, open on both worker nodes (not on master) ``` $HADOOP_HOME/etc/hadoop/yarn_site.xml ``` and paste between \u003cconfiguration\u003e tags:\n```xml\n   \u003cproperty\u003e\n      \u003cname\u003eyarn.resourcemanager.hostname\u003c/name\u003e\n      \u003cvalue\u003eMasterNode\u003c/value\u003e\n   \u003c/property\u003e\n```\nFinally, the next step is launch yarn on master, with:\n```\n   start-yarn.sh\n```\nand after open ```https://master-ip:8088/cluster``` to se hadoop web panel!\n\n### Configure Apache Spark\n\nIn this case, considering that spark will be only the \"execution engine above hadoop\", the setup is much simpler than hadoop setup. \nFirst, download Spark, then add to ```bashrc``` the following variable (open with following command form home directory ```nano ./bashrc```):\n\n```bash\nexport PATH=$PATH:/absolute-path-from-root-to-spark-folder/spark/bin\n```\n\nand then execute, to refresh configuration:\n```bash\nsource ~/.bashrc\n```\n\nLast step is to configure a variable inside ```spark-env.sh``` file, it defines the environment for Spark. So, change directory to *conf* folder, inside your spark installation and then:\n\n```bash\ncd /folder-to-Spark/conf/\ncp ./spark-env.sh.template  ./spark.env.sh\n\nnano ./spark-env.sh\n```\n\ninside *spark-env* just simply add an export, referred to ```$HADOOP_HOME```:\n\n```bash\nexport HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop\n```\n\nDone! Setup finished!\n\n\n\n---\n\n## Usage\n\nFirst of all, open the bin folder of this repo: there are two scripts and a jar file. The ```run-train.sh``` is a script for submitting the Training Application of classificator model and all transformation models (*necessary to make the data usable by the classifier*) to Hadoop. After training, application puts models into HDFS, so that the Training is executed once in a \"lifetime\".\n\n**Note that: other info about the workflow are described inside the [*Project's paper*](docs/Tesina%20progetto%20big%20data%20management.pdf) inside docs folder**\n\nAfter that, the ```run-test.sh``` script submit the test application; a command-line interaction where user can write sentences and retrive a prediction of sentiment (also with a verbose explanation about transformation applied to user input)!\n\n\n![ciao](img/shot.png)\n\n\n\n\n   \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaldidon%2Fbig-data-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fbaldidon%2Fbig-data-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fbaldidon%2Fbig-data-project/lists"}