https://github.com/mirzaim/hadoop-twitter-analysis
Hadoop MapReduce analysis of US Election 2020 Tweets.
https://github.com/mirzaim/hadoop-twitter-analysis
hadoop hdfs map-reduce tweet-analysis us-election-2020
Last synced: 4 months ago
JSON representation
Hadoop MapReduce analysis of US Election 2020 Tweets.
- Host: GitHub
- URL: https://github.com/mirzaim/hadoop-twitter-analysis
- Owner: mirzaim
- License: mit
- Created: 2024-09-15T23:08:30.000Z (9 months ago)
- Default Branch: main
- Last Pushed: 2024-09-15T23:08:32.000Z (9 months ago)
- Last Synced: 2025-01-09T05:18:43.290Z (5 months ago)
- Topics: hadoop, hdfs, map-reduce, tweet-analysis, us-election-2020
- Language: Java
- Homepage:
- Size: 3.91 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Hadoop Twitter Analysis
This project uses Hadoop and MapReduce to analyze the "US Election 2020 Tweets" dataset. It demonstrates the power of distributed computing for processing and analyzing big data related to the 2020 US presidential election.
## Dataset
The analysis is based on the "US Election 2020 Tweets" dataset, which can be found on Kaggle:
[US Election 2020 Tweets](https://www.kaggle.com/datasets/manchunhui/us-election-2020-tweets)## Project Overview
The project consists of three main MapReduce jobs:
1. **Tweet Like and Retweet Counter**: Calculates the total number of likes and retweets for tweets mentioning Joe Biden and Donald Trump.
2. **Tweet Country Location Analyzer**: Analyzes the percentage of tweets from various countries mentioning each candidate.
3. **Tweet Geo-Location Analyzer**: Uses geographical coordinates to determine tweet origins, focusing on tweets from the USA and France.
## Setup
### HDFS Setup
1. Create the necessary HDFS directories:
```bash
hdfs dfs -mkdir -p /user/hadoop/input
```2. Upload the input files to HDFS:
```bash
hdfs dfs -put input/*.csv /user/hadoop/input
```## Running the MapReduce Jobs
### 1. Tweet Like and Retweet Counter
```bash
hadoop jar hadoop-twitter-analysis.jar TweetCounterDriver /user/hadoop/input /user/hadoop/output/like_retweet_counts
```### 2. Tweet Country Location Analyzer
```bash
hadoop jar hadoop-twitter-analysis.jar TweetCountryLocationDriver /user/hadoop/input /user/hadoop/output/country_analysis
```### 3. Tweet Geo-Location Analyzer
```bash
hadoop jar hadoop-twitter-analysis.jar TweetGeoLocationDriver /user/hadoop/input /user/hadoop/output/geo_location_analysis
```## Output
The output of each job will be stored in the respective output directory in HDFS. You can view the results using:
```bash
hdfs dfs -cat /user/hadoop/output//*
```