{"id":22026239,"url":"https://github.com/prakhar-ff13/toxic-comments-classification","last_synced_at":"2025-05-07T10:15:29.153Z","repository":{"id":111693579,"uuid":"189018399","full_name":"Prakhar-FF13/Toxic-Comments-Classification","owner":"Prakhar-FF13","description":"Predict the toxicity rating of comment made by the user.","archived":false,"fork":false,"pushed_at":"2019-05-28T11:53:06.000Z","size":1077,"stargazers_count":43,"open_issues_count":0,"forks_count":33,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-07T10:15:10.325Z","etag":null,"topics":["data-analysis","data-science","data-visualization","deep-learning","kaggle-competition","kaggle-dataset","lstm","lstm-neural-networks","machine-learning","natural-language-processing","nlp","nlp-machine-learning","python3"],"latest_commit_sha":null,"homepage":null,"language":"HTML","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Prakhar-FF13.png","metadata":{"files":{"readme":"README.MD","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-05-28T11:50:31.000Z","updated_at":"2025-04-29T07:46:55.000Z","dependencies_parsed_at":"2023-03-27T16:03:44.874Z","dependency_job_id":null,"html_url":"https://github.com/Prakhar-FF13/Toxic-Comments-Classification","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prakhar-FF13%2FToxic-Comments-Classification","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prakhar-FF13%2FToxic-Comments-Classification/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prakhar-FF13%2FToxic-Comments-Classification/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Prakhar-FF13%2FToxic-Comments-Classification/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Prakhar-FF13","download_url":"https://codeload.github.com/Prakhar-FF13/Toxic-Comments-Classification/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":252856558,"owners_count":21814858,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","data-science","data-visualization","deep-learning","kaggle-competition","kaggle-dataset","lstm","lstm-neural-networks","machine-learning","natural-language-processing","nlp","nlp-machine-learning","python3"],"created_at":"2024-11-30T07:25:57.514Z","updated_at":"2025-05-07T10:15:29.144Z","avatar_url":"https://github.com/Prakhar-FF13.png","language":"HTML","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Toxicity Classification:\n\n## 1. Business Problem:\n**Source:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification\n\n**Description:** https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/description\n\n**Problem Statement:** Given a comment made by the user, predict the toxicity of the comment.\n\n\n## 2. Machine Learning Problem Formulation:\n\n### 2.1 Data: \n\n- Source: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/data\n- We have one single csv file for training and one cvs file to test.\n- Columns in train data:\n\t- Comment_text: This is the data in string format which we have to use to find the toxicity.\n\t- target: Target values which are to be predicted (has values between 0 and 1)\n\t- Data also has additional toxicity subtype attributes: (Model does not have to predict these)\n\t\t- severe_toxicity\n\t\t- obscene\n\t\t- threat\n\t\t- insult\n\t\t- identity_attack\n\t\t- sexual_explicit\n\t- Comment_text data also has identity attributes carved out from it, some of which are:\n\t\t- male\n\t\t- female\n\t\t- homosexual_gay_or_lesbian\n\t\t- christian\n\t\t- jewish\n\t        - muslim\n\t\t- black\n\t\t- white\n\t\t- asian\n\t\t- latino\n\t\t- psychiatric_or_mental_illness\n\t- Apart from above features the train data also provides meta-data from jigsaw like:\n\t\t- toxicity_annotator_count\n\t\t- identity_anotator_count\n\t\t- article_id\n\t\t- funny\n\t\t- sad\n\t\t- wow\n\t\t- likes\n\t\t- disagree\n\t\t- publication_id\n\t\t- parent_id\n\t\t- article_id\n\t\t- created_date\n\n\n### 2.2 Example Datapoints and Labels:\n\n**Comment:** i'm a white woman in my late 60's and believe me, they are not too crazy about me either!!\n\n- Toxicity Labels: All 0.0\n- Identity Mention Labels: female: 1.0, white: 1.0 (all others 0.0)\n\n**Comment:** Why would you assume that the nurses in this story were women?\n\n- Toxicity Labels: All 0.0\n- Identity Mention Labels: female: 0.8 (all others 0.0)\n\n**Comment:** Continue to stand strong LGBT community. Yes, indeed, you'll overcome and you have.\n\n- Toxicity Labels: All 0.0\n- Identity Mention Labels: homosexual_gay_or_lesbian: 0.8, bisexual: 0.6, transgender: 0.3 (all others 0.0)\n\n\n### 2.3 Type of Machine Learning Problem:\nWe have to predict the toxicity level(target attribute). The values range from 0 to 1 inclusive. This is a regression problem. It can also be treated as a classification problem if we take every value below 0.5 to be non-toxic and above it to be toxic, we would then get a binary classification problem.\n\n\n\n### 2.4 Performance Metric:\nThe competition will use ROC_AUC as the metric after converting the numeric target variable into a categorical variable by using a threshold of 0.5. Any comment above 0.5 will be assumed to be toxic and below it non-toxic. For our training and evaluation we will use the MSE(Mean Squared Error).\nMore on evaluation: https://www.kaggle.com/c/jigsaw-unintended-bias-in-toxicity-classification/overview/evaluation\n\n### 2.5 Machine Learning Objectives and Constraints:\n\n**Objectives:** Predict the toxicity of a comment made by the user. (0 -\u003e not toxic, 1 -\u003e highest toxicity level)\n\n**Constraints:**\n\n- The model should be fast to predict the toxicity rating.\n- Interpretability is not needed.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprakhar-ff13%2Ftoxic-comments-classification","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fprakhar-ff13%2Ftoxic-comments-classification","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fprakhar-ff13%2Ftoxic-comments-classification/lists"}