https://github.com/sathviknayak123/sentiment-anyalysis

Sentiment Analysis using DistlBERT Transformer from HuggingFace. Also integrated Airflow for end-to-end pipeline
https://github.com/sathviknayak123/sentiment-anyalysis

airflow astronomer distilbert flask huggingface-transformer nlp python s3-bucket selenium-webdriver sentiment-analysis tensorflow web-scraping-python

Last synced: 7 months ago
JSON representation

Sentiment Analysis using DistlBERT Transformer from HuggingFace. Also integrated Airflow for end-to-end pipeline

Host: GitHub
URL: https://github.com/sathviknayak123/sentiment-anyalysis
Owner: SathvikNayak123
License: mit
Created: 2024-09-24T13:10:01.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2024-12-26T09:11:30.000Z (10 months ago)
Last Synced: 2025-02-13T02:47:45.469Z (9 months ago)
Topics: airflow, astronomer, distilbert, flask, huggingface-transformer, nlp, python, s3-bucket, selenium-webdriver, sentiment-analysis, tensorflow, web-scraping-python
Language: Jupyter Notebook
Homepage:
Size: 17.7 MB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          # Sentiment Analysis with DistilBERT

## Project Overview  

This project focuses on building a sentiment analysis model to predict the sentiment of customer reviews using **DistilBERT**. The pipeline involves scraping reviews, preprocessing the data, training a fine-tuned DistilBERT model, and deploying it through a Flask application. The workflow is automated using **Apache Airflow**.

---

## What is DistilBERT?

- DistilBERT is a smaller, faster, and more efficient version of BERT. It was developed through a process called knowledge distillation, where a smaller model (the student) is trained to reproduce the behavior of a larger model (the teacher) while retaining most of its performance. 

- DistilBERT retains 97% of BERT's performance on various NLP tasks while being 40% smaller and 60% faster.

  ![DistilBERT](docs/0_06fQisdQnb_BPajl.png)

- BERT consists of 12 transformer layers, an embedding layer and a prediction layer.

- During distillation, DistilBERT learns from BERT by mimicking its outputs (logits) and intermediate representations.

- As a resilt, DistilBERT has 6 transformer layers (half of BERT's 12 layers) while maintaining similar functionality.

  ![BERT&DistilBERT](docs/The-DistilBERT-model-architecture-and-components.png)

---

## Features  

- **Data Collection**:  

  - Scraped 20,000+ Amazon reviews using **Selenium WebDriver**.  

  - Stored the collected data securely in an **AWS S3 bucket**.  

- **Data Preprocessing**:  

  - Cleaned and processed reviews and ratings:  

    - Removed non-English Reviews

    - Removed stopwords, special characters, and extra spaces.  

    - Performed lemmatization and stemming to normalize text.

    - Encoded ratings ranging from 0-5 stars to labels negative(0), neutral(1) and positive(2)  

  - Stored the processed data in **S3** for further use.  

- **Data Tokenization and Preparation**:  

  - Tokenized reviews using **DistilBERT tokenizer**.  

  - Divided the dataset into **training**, **validation**, and **testing** sets.

- **Model Training**:  

  - Imported **DistilBERT** from **Hugging Face Transformers**.  

  - Fine-tuned the model on the dataset.  

  - Implemented **early stopping** to optimize training.  

  - Used **customized class weights** to handle class imbalance in training dataset.

- **Deployment**:  

  - Built a user-friendly **Flask application** for making predictions.

    ![Flask](docs/Screenshot%202024-09-29%20171650.png)

- **Workflow Automation**:  

  - Integrated **Apache Airflow** to automate the entire pipeline. 

    ![Airflow](docs/Screenshot%202024-11-10%20215335.png)

---

## Result

- **Accuracy**: Achieved an accuracy of 85% on the test dataset.

- **F1-score**: Achieved f1-scores of 0.9 for positive class and 0.7 for others on the test dataset.

- Enhanced multi-class classification performance by implementing class-weighted training, improving the Precision-Recall AUC for minority classes (e.g., 'neutral') by over 30%, resulting in more balanced and accurate predictions across all categories

  - Before: 

    ![Before](docs/output.png)

  - After class-weighted training:

    ![After](docs/download.png)

## Getting Started  

To get started with this project, follow these steps:

1. Clone the repository:  

   ```bash

   git clone https://github.com/SathvikNayak123/sentiment-anyalysis

   ```

2. Install the necessary dependencies:

    ```bash

    pip install -r requirements.txt

    ```

3. Add AWS S3 credentials in .env file

4. Run Airflow to execute scraping and training pipeline:

    ```bash

    astro dev init

    astro dev start

5. Run app for prediction

    ```bash

    python app.py

    ```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sathviknayak123/sentiment-anyalysis

Awesome Lists containing this project

README