Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/burhanahmed1/twitter-sentiment-analysis-using-pyspark
This repository contains a project that demonstrates how to perform sentiment analysis on Twitter data using Apache Spark, including data preprocessing, feature engineering, model training, and evaluation.
https://github.com/burhanahmed1/twitter-sentiment-analysis-using-pyspark
apache-spark batch-gradient-descent kmeans-clustering knearest-neighbor-classification machine-learning matplotlib nltk-python numpy pandas pyspark python schotastic-gradient-descent seaborn sentiment-analysis textblob-sentiment-analysis
Last synced: 3 months ago
JSON representation
This repository contains a project that demonstrates how to perform sentiment analysis on Twitter data using Apache Spark, including data preprocessing, feature engineering, model training, and evaluation.
- Host: GitHub
- URL: https://github.com/burhanahmed1/twitter-sentiment-analysis-using-pyspark
- Owner: burhanahmed1
- License: mit
- Created: 2024-07-09T03:53:23.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-07-09T04:48:43.000Z (6 months ago)
- Last Synced: 2024-09-24T09:02:01.276Z (3 months ago)
- Topics: apache-spark, batch-gradient-descent, kmeans-clustering, knearest-neighbor-classification, machine-learning, matplotlib, nltk-python, numpy, pandas, pyspark, python, schotastic-gradient-descent, seaborn, sentiment-analysis, textblob-sentiment-analysis
- Language: Jupyter Notebook
- Homepage:
- Size: 1.34 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Twitter Sentiment Analysis
Twitter Sentiment Analysis repository contains a project for performing sentiment analysis on Twitter data using Apache Spark.
## Contents
- `Sentiment_Analysis.ipynb`: Jupyter Notebook containing the code for the sentiment analysis.
- `Sentiment.csv`: The dataset file containing the Twitter data and sentiment labels.## Project Overview
This project demonstrates how to use Apache Spark for sentiment analysis on Twitter data. The steps covered in the project include:
1. **Data Loading**: Reading the dataset into Spark DataFrame.
2. **Data Cleaning**: Preprocessing the data by handling missing values and performing necessary transformations.
3. **Feature Engineering**: Extracting features from the text data for model training.
4. **Model Training**: Training a machine learning model to classify the sentiment of tweets.
5. **Evaluation**: Evaluating the model's performance using appropriate metrics.## Getting Started
### Prerequisites
- Apache Spark
- Jupyter Notebook
- Python
- Required Python libraries: `pandas`, `numpy`, `nltk`, `pyspark`### Installation
1. Clone the repository:
```bash
git clone https://github.com/burhanahmed1/Twitter-Sentiment-Analysis-Using-PySpark.git
cd Twitter-Sentiment-Analysis-Using-PySpark
pip install -r requirements.txt
```
2. Install the required Python libraries:
```bash
pip install pandas numpy nltk pyspark
```
3. Start Jupyter Notebook:
```bash
jupyter notebook
```4. Open `Sentiment_Analysis.ipynb` in Jupyter Notebook and run the cells to execute the project.
### Results
The project demonstrates the effectiveness of using Apache Spark for sentiment analysis on large datasets. The final model achieves good accuracy in classifying the sentiment of tweets.
- The accuracy of the sentiment model using **Logistic Regression** is `0.62`
- Root Mean Squared Error (RMSE) and Explained Variance (R²) using **Linear Regression** are `0.7331581773635055` and `0.07966124788395001` respectively.
- The accuracy of the sentiment model using **Batch Gradient Descent** is `0.73`
- The accuracy of the sentiment model using **Schotastic Gradient Descent** is `0.75`### Visualizations
Data visualization techniques such as confusion matrices are used to evaluate the performance of the sentiment classification model and scatter plots are used to visualize the distribution and relationships of features in the dataset.
## Contributing
Contributions are welcome! If you have any ideas, suggestions, or improvements, feel free to open an issue or submit a pull request.## License
This project is licensed under the MIT License.## Acknowledgments
Thanks to the open-source community for providing valuable tools and libraries.