{"id":23447698,"url":"https://github.com/techn0man1ac/toxiccommentclassification","last_synced_at":"2025-08-18T19:36:30.927Z","repository":{"id":269356073,"uuid":"905918680","full_name":"techn0man1ac/ToxicCommentClassification","owner":"techn0man1ac","description":"This project aims to develop a model capable of identifying and classifying different levels of toxicity in comments, using the power of BERT(Bidirectional Encoder Representations from Transformers) for text analysis.","archived":false,"fork":false,"pushed_at":"2025-01-06T23:28:33.000Z","size":3972,"stargazers_count":6,"open_issues_count":5,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-06T23:37:13.697Z","etag":null,"topics":["analysis","bert-model","classifying","data-science","docker","machine-learning","python","streamlit","text-classification","transformers-models"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/techn0man1ac.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-19T19:36:41.000Z","updated_at":"2025-01-06T23:28:36.000Z","dependencies_parsed_at":null,"dependency_job_id":"612ef2ac-b87a-4e9b-99b0-234ef189229c","html_url":"https://github.com/techn0man1ac/ToxicCommentClassification","commit_stats":null,"previous_names":["techn0man1ac/toxiccommentclassification"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techn0man1ac%2FToxicCommentClassification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techn0man1ac%2FToxicCommentClassification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techn0man1ac%2FToxicCommentClassification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/techn0man1ac%2FToxicCommentClassification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/techn0man1ac","download_url":"https://codeload.github.com/techn0man1ac/ToxicCommentClassification/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":239024427,"owners_count":19569523,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analysis","bert-model","classifying","data-science","docker","machine-learning","python","streamlit","text-classification","transformers-models"],"created_at":"2024-12-23T21:19:25.387Z","updated_at":"2025-02-15T16:43:34.210Z","avatar_url":"https://github.com/techn0man1ac.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Toxic Comment Classification system by \"Team 16.6\" 🛡️\n\n![Team 16.6 logo](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/frontend/imgs/team16_6_Logo.png)\n\nIn the modern era of social media, toxicity in online comments poses a significant challenge, creating a negative atmosphere for communication. From abuse to insults, toxic behavior discourages the free exchange of thoughts and ideas among users.\n\nThis project seeks to address this issue by developing a machine learning model to identify and classify varying levels of toxicity in comments. Leveraging the power of BERT (Bidirectional Encoder Representations from Transformers), this system aims to:\n\n- Analyze text for signs of toxicity\n- Classify toxicity levels effectively\n- Support moderators and users in fostering healthier and safer online communities\n- By implementing this technology, the project strives to make social media a more inclusive and positive space for interaction.\n\n# 🤝 Team\n\n`Team 16.6` means that each member has an equal contribution to the project ⚖ .\n\nThe project was divided into tasks, which in turn were assigned to the following roles:\n\n`Desing director` - [Polina Mamchur](https://github.com/polinamamchur)\n\n`Data science` - [Olena Mishchenko](https://github.com/Alena-Mishchenko)\n\n`Backend` - [Ivan Shkvyr](https://github.com/IvanShkvyr), [Oleksandr Kovalenko](https://github.com/AlexandrSergeevichKovalenko)\n\n`Frontend` - [Oleksii Yeromenko](https://github.com/oleksii-yer)\n\n`Team Lead` - [Serhii Trush](https://github.com/techn0man1ac)\n\n`Scrum Master` - [Oleksandr Kovalenko](https://github.com/AlexandrSergeevichKovalenko), [Polina Mamchur](https://github.com/polinamamchur)\n\n\n# 🎨 Desing\n\nThe project started with design development. First, design a [prototype of user interface](https://github.com/techn0man1ac/ToxicCommentClassification/tree/main/design/Pages) was developed:\n\n![Desing prototype](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/design/Pages/Analysis%20Page.png)\n\nCreative director create a visually appealing application, as well as to ensure the presentation of the project to stakeholders. By focusing on both UI/UX and presentation design was able to bring the team's vision to life and effectively communicate the [value of the project](https://github.com/techn0man1ac/ToxicCommentClassification/tree/main/design).\n\nA presentation on the project has been developed and can be viewed here(in PDF format):\n\nhttps://github.com/techn0man1ac/ToxicCommentClassification/blob/main/design/Presentation/ToxicCommentClassification.pdf\n\n# 🛠️ Technologies\n- 🖼️ [Figma](https://www.figma.com/): Online interface development and prototyping service with the ability to organize collaborative work\n- 🐍 [Python](https://www.python.org/): The application was developed in the [Python 3.11.8](https://www.python.org/downloads/release/python-3118/) programming language\n- 🤗 [Transformers](https://huggingface.co/docs/transformers/index): A library that provides access to BERT and other advanced machine learning models\n- 🔥 [PyTorch](https://pytorch.org/): Libraries for working with deep learning and support GPU computing\n- 📖 [BERT](https://en.wikipedia.org/wiki/BERT_(language_model)): A text analysis model used to produce contextualized word embeddings\n- ☁️ [Kaggle](https://www.kaggle.com/): To save time, we used cloud computing to train the models\n- 🌐 [Streamlit](https://streamlit.io/): To develop the user interface, used the Streamlit package in the frontend\n- 🐳 [Docker](https://www.docker.com/): A platform for building, deploying, and managing containerized applications\n\n\n# 🖥 Data Science\n\nWork was performed on dataset research and data processing to train machine learning models. \n\n## 📊 Dataset(EDA)\n\nTo train the machine learning models, we used [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/) dataset. \nThe dataset have types of toxicity:\n- Toxic \n- Severe Toxic \n- Obscene\n- Threat\n- Insult\n- Identity Hate\n\nThe primary datasets (`train.csv`, `test.csv`, and `sample_submission.csv`) are loaded into [Pandas](https://pandas.pydata.org/) `DataFrames`. \nAfter that make [Exploratory Data Analysis](https://github.com/techn0man1ac/ToxicCommentClassification/blob/main/Data_science/data_science.ipynb) of dataframes and obtained the following results:\n\n![Data toxic distribution ](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/dataToxicDistribution.png)\n\nAs you can seen from the data analysis, there is an `imbalance of classes` in the ratio of 1 to 10 (toxic/non-toxic). \n\nDistribution of classes:\n\n| Class                                   | Count       | Percentage      |\n|-----------------------------------------|-------------|-----------------|\n| Toxic                                   | 15,294      | 9.58%           |\n| Severe Toxic                            | 1,595       | 1.00%           |\n| Obscene                                 | 8,449       | 5.29%           |\n| Threat                                  | 478         | 0.30%           |\n| Insult                                  | 7,877       | 4.94%           |\n| Identity Hate                           | 1,405       | 0.88%           |\n| Non-toxic                               | **143,346** | **89.83%**      |\n| Total comments                          |   159,571   |                 |\n| Multiclass comments vs Total comments   |   18,873    | 11.8%           |\n\nAs you can see, this table shows that there is multiclassing in the data, the data of one category can belong to another category.\n\nHere is a visualization of the data from the dataset research. Dataset in bargraph representation:\n\n![Dataset in bar graph format](https://github.com/techn0man1ac/ToxicCommentClassification/blob/main/IMGs/dataSetGraphic0.png)\n\nGraphs show basic information about the dataset to understand the size and types of columns. Such a ratio in the data will have a very negative impact on the model's prediction accuracy.\n\n## 📅 Data processing\n\n![Data processing visualization](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/dataProcessing.png)\n\nBecause the original dataset includes data imbalances, this will have a bad impact on the accuracy of machine learning models, so we applied oversampling using the [Sklearn](https://scikit-learn.org/) package(`resample` function) - copying data while maintaining the balance of classes to increase the importance in the context of models recognition of a particular class.\n\n| Class          | Original dataset | Data processing | Persentage about original |\n|----------------|------------------|-----------------|---------------------------|\n| Toxic          | 15,294           | 40,216          | +262%                     |\n| Severe Toxic   | 1,595            | 16,889          | +1058%                    |\n| Obscene        | 8,449            | 38,009          | +449%                     |\n| Threat         | 478              | 16,829          | +3520%                    |\n| Insult         | 7,877            | 36,080          | +458%                     |\n| Identity Hate  | 1,405            | 19,744          | +1405%                    |\n| None toxic     | 143,346          | 143,346         | 0%                        |\n| Total          | 178,444          | 269,396         | +51%                      |\n\nAs a result, after processing the data, the balance of toxic and non-toxic for each class was ~50/50%. Thanks to this [data processing](https://github.com/techn0man1ac/ToxicCommentClassification/blob/main/Backend/Models/Model_0_bert-base-uncased/FOR_PROJECT_BERT_oversampling_with_optuna_correct_oversampling.ipynb), the accuracy of pattern recognition has increased by several percent.\n\n# ⚙️ Machine learning(Back End)\n\nTo solve the challenge, we have chosen 3 popular architectures, such as [BERT](https://github.com/techn0man1ac/ToxicCommentClassification/tree/main/Backend/Models/Model_0_bert-base-uncased), [DistilBERT](https://github.com/techn0man1ac/ToxicCommentClassification/tree/main/Backend/Models/Model_2_distilbert), [ALBERT](https://github.com/techn0man1ac/ToxicCommentClassification/tree/main/Backend/Models/Model_1_albert), each link takes you to the source code as trained by the model. \n\nHere is a visual representation of the main parameters of the models:\n\n![Model metrics comparison](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/modelMetricsComparison.png)\n\nHere is a detailed description of each of the machine learning models we trained:\n\n## BERT ֎\n\nThis project demonstrates toxic comment classification using the [bert-base-uncased](https://huggingface.co/google-bert/bert-base-uncased) model from the BERT family. To automate the process of selecting hyperparameters, hepl us [Optuna](https://optuna.org/).\n\n### 1. **Toxic Comment Classification with BERT**\n- Utilized the `bert-base-uncased` model with PyTorch for flexibility and ease of use.  \n- Seamlessly integrated with Hugging Face Transformers.  \n- Training accelerated by ~30x using a GPU, efficiently handling BERT’s computational demands.\n\n### 2. **Dataset Balancing**\n- Addressed dataset imbalance (90% non-toxic, 10% toxic) using oversampling with `sklearn`.  \n- Ensured rare toxic categories received equal attention by balancing class distributions.  \n- Improved model performance in recognizing rare toxic classes.\n\n### 3. **Key Techniques**\n- **Tokenization**: Preprocessed data tokenized using `BertTokenizer`.  \n- **Loss Function**: Used `BCEWithLogitsLoss` with weighted loss for rare class emphasis.  \n- **Gradient Clipping**: Optimized training stability with gradient clipping (`max_norm`).  \n- **Hyperparameter Tuning**: Tuned batch size, learning rate, and epochs using Optuna.  \n\n### 4. **Threshold Optimization**\n- Used `itertools.product` to find optimal thresholds for each class.  \n- Improved recall and F1-score (by 1-1.5%) for better multi-label classification.\n\n### 5. **Performance and Key Model Details**\n- **Validation Metrics**:  \n  - Accuracy: 0.95 ✅\n  - Precision: 0.97 ✅\n  - Recall: 0.96 ✅  \n\n- **Model Specifications**:  \n  - Vocabulary Size: 30,522  \n  - Hidden Size: 768  \n  - Attention Heads: 12  \n  - Hidden Layers: 12  \n  - Total Parameters: 110M  \n  - Maximum Sequence Length: 512(in this case use 128 tokens)\n  - Pre-trained Tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP).\n\n## DistilBERT ֎\n \nThis project demonstrates toxic comment classification using the [distilbert-base-uncased](https://huggingface.co/distilbert/distilbert-base-uncased) model, a lightweight and efficient version of BERT.\n\n### 1. Using PyTorch  \n- Selected for its flexibility, ease of use, and strong community support.  \n- Seamlessly integrated with Hugging Face Transformers.  \n\n### 2. Dataset Balancing  \n- Addressed dataset imbalance (90% non-toxic, 10% toxic) using `sklearn.utils.resample`.  \n- Applied stratified splitting for training and test datasets.  \n- Oversampled rare toxic classes, improving model recognition of all categories.  \n\n### 3. Key Techniques  \n- **Tokenization**: Preprocessed data with `DistilBertTokenizer`.  \n- **Loss Function**: Binary Cross-Entropy with Logits (`BCEWithLogitsLoss`).  \n- **Hyperparameter Tuning**: Optimized batch size (`16`), learning rate (`2e-5`), and epochs (`3`) with Optuna.  \n\n### 4. Accelerated Training  \n- Utilized GPU for training, achieving a ~30x speedup over CPU.\n\n### 5. Threshold Optimization  \n- Used `itertools.product` to determine optimal thresholds for each class.  \n- Improved recall and F1-score for multi-label classification. \n\n### 6. **Performance and Key Model Details** \n- **Validation Metrics**:   \n  - Accuracy: 0.92 ✅\n  - Precision: 0.79 ✅\n  - Recall: 0.78 ✅\n\n- **Model Specifications**:  \n  - Vocabulary Size: 30522  \n  - Hidden Size: 768 \n  - Attention Heads: 12 \n  - Hidden Layers: 6  \n  - Total Parameters: 66M  \n  - Maximum Sequence Length: 512(in this case use 128 tokens)\n  - Pre-trained Tasks: Masked Language Modeling (MLM).\n\n## ALBERT ֎\n\nThis project demonstrates toxic comment classification using the [albert-base-v2](https://huggingface.co/albert/albert-base-v2) model, a lightweight and efficient version of BERT designed to reduce parameters while maintaining high performance.\n\n### 1. Using PyTorch  \n- Selected for its flexibility, ease of use, and strong community support.  \n- Seamlessly integrated with Hugging Face Transformers.  \n\n### 2. Dataset Balancing  \n- Addressed dataset imbalance (90% non-toxic, 10% toxic) using `sklearn.utils.resample`.  \n- Applied stratified splitting for training and test datasets.  \n- Oversampled rare toxic classes to improve model recognition of all categories.  \n\n### 3. Key Techniques  \n- **Tokenization**: Preprocessed data with `AlbertTokenizer`.  \n- **Loss Function**: Binary Cross-Entropy with Logits (`BCEWithLogitsLoss`).  \n- **Hyperparameter Tuning**: Optimized batch size (`8`), learning rate (`2e-5`), and epochs (`3`) using Optuna.  \n\n### 4. Accelerated Training  \n- Utilized GPU for training, achieving a ~30x speedup over CPU.  \n\n### 5. Threshold Optimization  \n- Used `itertools.product` to determine optimal thresholds for each class.  \n- Enhanced recall and F1-score for multi-label classification.  \n\n### 6. **Performance and Key Model Details**\n- **Validation Metrics**:  \n  - Accuracy: 0.92 ✅  \n  - Precision: 0.84 ✅ \n  - Recall: 0.69 ✅ \n\n- **Model Specifications**:   \n  - Vocabulary Size: 30000 \n  - Hidden Size: 768  \n  - Attention Heads: 12 \n  - Hidden Layers: 12 \n  - Intermediate Size: 4096  \n  - Total Parameters: 11M  \n  - Maximum Sequence Length: 512(in this case use 128 tokens)\n  - Pre-trained Tasks: Masked Language Modeling (MLM).  \n\nWe used to Cloud computing on [Kaggle](https://www.kaggle.com/code/techn0man1ac/toxiccommentclassificationsystem/) for are speed up model training.\n\n# 💻 How to install\n\nThere are two ways to install the application on your computer:\n\n## Simple 😎\n\nDownload [Docker](https://www.docker.com/) -\u003e Log in to your profile in the application -\u003e Open the Docker terminal(bottom of the program) -\u003e Enter command:\n\n```\ndocker pull techn0man1ac/toxiccommentclassificationsystem:latest\n```\n\nAfter that, all the necessary files will be downloaded from [DockerHub](https://hub.docker.com/repository/docker/techn0man1ac/toxiccommentclassificationsystem) -\u003e Go to the `Images` tab -\u003e Launch the image by clicking `Run` -\u003e Click `Optional settings` -\u003e Set the host port `8501`\n\n![Set host port 8501](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/dockerRun.png)\n\nOpen http://localhost:8501 in your browser.\n\n## Like a pro 💪\n\nThis way need from you, to have some skills with command line, GitHub and Docker.\n\n1. Cloning a repository:\n\n```\ngit clone https://github.com/techn0man1ac/ToxicCommentClassification.git\n```\n\n2. Download the [model files from this link](https://drive.google.com/drive/folders/17kR8llTZ1yNig5xFjrnpLCKyRfQkIwz2?usp=sharing), after downloading the `albert`, `bert` and `distilbert` directories, put them in the `frontend\\saved_models` directory, like that:\n\n![Catalog with models](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/modelsCatalogExample.png)\n\n3. Open a command line/terminal and navigate to the `ToxicCommentClassification` directory, and pack the container with the command:\n\n```\ndocker-compose up\n```\n\n4. After which the application will immediately start and a browser window will open with the address http://localhost:8501\n\nTo turn off the application, run the command:\n\n```\ndocker-compose down\n```\n\n# 🚀 How to use(Front End)\n\nAfter launching the application, you will see the project's home tab with a description of the application and the technologies used in it. The program looks like this when running:\n\n![Models test](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/appTests.png)\n\nThe application interface is intuitive and user-friendly.\n\nThe structure of the tabs is as follows:\n- `Home` - Here you can find a description of the app, the technologies used for its operation, the mission and vision of the project, and acknowledgments\n- `Team` - This tab contains those without whom the app would not exist, its creators\n- `Metrics` - In this tab, you can choose one of 3 models, after selecting it, the technical characteristics of each of the machine learning models are loaded\n- `Classification` - A tab where you can test the work of models.\n\n![Models test - That f@@ing awesome](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/appClassify.png)\n\nThe main elements of the interface:\n- `Choose your model` - A drop-down list where you can select one of 3 pre-trained machine learning models\n- `Enter your comment here` - In this field you can manually write a text to test it for toxicity and further classify it in a positive case\n- `Upload your text file` - By clicking here, a dialog box will appear with the choice of a file in `txt format` (after uploading the file, the text in the text field is ignored)\n- `Display detailed toxicity` - A checkbox that displays a detailed classification by class if the model considers the text to be toxic\n\n![Classify tab interface](https://raw.githubusercontent.com/techn0man1ac/ToxicCommentClassification/refs/heads/main/IMGs/classifyInterface.png)\n\nThe application interface is intuitive and user-friendly. The application is also able to classify text files in the txt format.\n\nThe [app written](https://github.com/techn0man1ac/ToxicCommentClassification/tree/main/frontend) with the help of streamlit provides a user-friendly interface to observe and try out functionality of the included BERT-based models for comment toxicity classification.\n\n# 🎯 Mission \n\nThe mission of our project is to create a reliable and accurate machine learning model that can effectively classify different levels of toxicity in online comments. We plan to use advanced technologies to analyze text and create a system that will help moderators and users create healthier and safer social media environments.\n\n# 🌟 Vision\n\nOur vision is to make online communication safe and comfortable for everyone. We want to build a system that not only can detect toxic comments, but also helps to understand the context and tries to reduce the number of such messages. We want to create a tool that will be used not only by moderators, but also by every user to provide a safe environment for the exchange of thoughts and ideas.\n\n# 📜 Licence\n\nThis project is a group work published under the [MIT license](https://github.com/techn0man1ac/ToxicCommentClassification/blob/main/LICENSE) , and all project contributors are listed in the license text.\n\n# 👏 Acknowledgments\n\nThis project was developed by a team of professionals as a graduation thesis of the [GoIT](https://goit.global/) **Python Data Science and Machine Learning** course.\n\nThank you for exploring our project! Together, we can make online spaces healthier and more respectful.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechn0man1ac%2Ftoxiccommentclassification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftechn0man1ac%2Ftoxiccommentclassification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftechn0man1ac%2Ftoxiccommentclassification/lists"}