{"id":18146522,"url":"https://github.com/sathviknayak123/sentiment-anyalysis","last_synced_at":"2026-04-09T09:32:55.625Z","repository":{"id":260108059,"uuid":"862353263","full_name":"SathvikNayak123/sentiment-anyalysis","owner":"SathvikNayak123","description":"Sentiment Analysis using DistlBERT Transformer from HuggingFace. Also integrated Airflow for end-to-end pipeline","archived":false,"fork":false,"pushed_at":"2024-12-26T09:11:30.000Z","size":18602,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-13T02:47:45.469Z","etag":null,"topics":["airflow","astronomer","distilbert","flask","huggingface-transformer","nlp","python","s3-bucket","selenium-webdriver","sentiment-analysis","tensorflow","web-scraping-python"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/SathvikNayak123.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-24T13:10:01.000Z","updated_at":"2024-12-26T09:11:34.000Z","dependencies_parsed_at":null,"dependency_job_id":"79a0c60a-a10d-4131-a4f0-f3926e1529f4","html_url":"https://github.com/SathvikNayak123/sentiment-anyalysis","commit_stats":null,"previous_names":["sathviknayak123/sentiment-anyalysis-sagemaker"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SathvikNayak123%2Fsentiment-anyalysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SathvikNayak123%2Fsentiment-anyalysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SathvikNayak123%2Fsentiment-anyalysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/SathvikNayak123%2Fsentiment-anyalysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/SathvikNayak123","download_url":"https://codeload.github.com/SathvikNayak123/sentiment-anyalysis/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247550642,"owners_count":20956984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","astronomer","distilbert","flask","huggingface-transformer","nlp","python","s3-bucket","selenium-webdriver","sentiment-analysis","tensorflow","web-scraping-python"],"created_at":"2024-11-01T21:08:03.146Z","updated_at":"2025-12-30T23:06:29.541Z","avatar_url":"https://github.com/SathvikNayak123.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Sentiment Analysis with DistilBERT\n\n## Project Overview  \nThis project focuses on building a sentiment analysis model to predict the sentiment of customer reviews using **DistilBERT**. The pipeline involves scraping reviews, preprocessing the data, training a fine-tuned DistilBERT model, and deploying it through a Flask application. The workflow is automated using **Apache Airflow**.\n\n---\n\n## What is DistilBERT?\n\n- DistilBERT is a smaller, faster, and more efficient version of BERT. It was developed through a process called knowledge distillation, where a smaller model (the student) is trained to reproduce the behavior of a larger model (the teacher) while retaining most of its performance. \n- DistilBERT retains 97% of BERT's performance on various NLP tasks while being 40% smaller and 60% faster.\n\n  ![DistilBERT](docs/0_06fQisdQnb_BPajl.png)\n\n- BERT consists of 12 transformer layers, an embedding layer and a prediction layer.\n- During distillation, DistilBERT learns from BERT by mimicking its outputs (logits) and intermediate representations.\n- As a resilt, DistilBERT has 6 transformer layers (half of BERT's 12 layers) while maintaining similar functionality.\n\n  ![BERT\u0026DistilBERT](docs/The-DistilBERT-model-architecture-and-components.png)\n\n---\n\n## Features  \n- **Data Collection**:  \n  - Scraped 20,000+ Amazon reviews using **Selenium WebDriver**.  \n  - Stored the collected data securely in an **AWS S3 bucket**.  \n\n- **Data Preprocessing**:  \n  - Cleaned and processed reviews and ratings:  \n    - Removed non-English Reviews\n    - Removed stopwords, special characters, and extra spaces.  \n    - Performed lemmatization and stemming to normalize text.\n    - Encoded ratings ranging from 0-5 stars to labels negative(0), neutral(1) and positive(2)  \n  - Stored the processed data in **S3** for further use.  \n\n- **Data Tokenization and Preparation**:  \n  - Tokenized reviews using **DistilBERT tokenizer**.  \n  - Divided the dataset into **training**, **validation**, and **testing** sets.\n\n- **Model Training**:  \n  - Imported **DistilBERT** from **Hugging Face Transformers**.  \n  - Fine-tuned the model on the dataset.  \n  - Implemented **early stopping** to optimize training.  \n  - Used **customized class weights** to handle class imbalance in training dataset.\n\n- **Deployment**:  \n  - Built a user-friendly **Flask application** for making predictions.\n    ![Flask](docs/Screenshot%202024-09-29%20171650.png)\n\n- **Workflow Automation**:  \n  - Integrated **Apache Airflow** to automate the entire pipeline. \n    ![Airflow](docs/Screenshot%202024-11-10%20215335.png)\n\n---\n\n## Result\n\n- **Accuracy**: Achieved an accuracy of 85% on the test dataset.\n- **F1-score**: Achieved f1-scores of 0.9 for positive class and 0.7 for others on the test dataset.\n- Enhanced multi-class classification performance by implementing class-weighted training, improving the Precision-Recall AUC for minority classes (e.g., 'neutral') by over 30%, resulting in more balanced and accurate predictions across all categories\n  - Before: \n    ![Before](docs/output.png)\n  - After class-weighted training:\n    ![After](docs/download.png)\n\n## Getting Started  \n\nTo get started with this project, follow these steps:\n\n1. Clone the repository:  \n   ```bash\n   git clone https://github.com/SathvikNayak123/sentiment-anyalysis\n   ```\n2. Install the necessary dependencies:\n    ```bash\n    pip install -r requirements.txt\n    ```\n3. Add AWS S3 credentials in .env file\n\n4. Run Airflow to execute scraping and training pipeline:\n    ```bash\n    astro dev init\n    astro dev start\n\n5. Run app for prediction\n    ```bash\n    python app.py\n    ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsathviknayak123%2Fsentiment-anyalysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsathviknayak123%2Fsentiment-anyalysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsathviknayak123%2Fsentiment-anyalysis/lists"}