{"id":29510485,"url":"https://github.com/davidcamilo0710/hate_speech_analysis","last_synced_at":"2026-05-09T06:02:48.496Z","repository":{"id":304722977,"uuid":"1019734629","full_name":"davidcamilo0710/hate_speech_analysis","owner":"davidcamilo0710","description":"Hate speech detection using NLP for linguistic analysis and machine learning (XGBoost) for classification with Python and SpaCy.","archived":false,"fork":false,"pushed_at":"2025-07-14T19:57:22.000Z","size":7178,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-25T02:51:36.107Z","etag":null,"topics":["hate-speech-detection","linguistic-analysis","nlp","scikit-learn","spacy","xgboost"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidcamilo0710.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-14T19:36:29.000Z","updated_at":"2025-07-14T20:35:41.000Z","dependencies_parsed_at":"2025-07-15T00:09:52.196Z","dependency_job_id":null,"html_url":"https://github.com/davidcamilo0710/hate_speech_analysis","commit_stats":null,"previous_names":["davidcamilo0710/hate_speech_analysis"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/davidcamilo0710/hate_speech_analysis","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidcamilo0710%2Fhate_speech_analysis","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidcamilo0710%2Fhate_speech_analysis/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidcamilo0710%2Fhate_speech_analysis/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidcamilo0710%2Fhate_speech_analysis/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidcamilo0710","download_url":"https://codeload.github.com/davidcamilo0710/hate_speech_analysis/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidcamilo0710%2Fhate_speech_analysis/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32809147,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"online","status_checked_at":"2026-05-09T02:00:06.633Z","response_time":123,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hate-speech-detection","linguistic-analysis","nlp","scikit-learn","spacy","xgboost"],"created_at":"2025-07-16T09:01:01.231Z","updated_at":"2026-05-09T06:02:48.481Z","avatar_url":"https://github.com/davidcamilo0710.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Hate Speech Analysis and Classification with NLP and Machine Learning\n\n\u003cimg width=\"2816\" height=\"1536\" alt=\"hate\" src=\"https://github.com/user-attachments/assets/f1612164-38ee-4500-a0b4-ada08dadc1ad\" /\u003e\n\nThis repository contains a comprehensive project for the analysis and automatic detection of hate speech in text. The project is divided into two main parts:\n\n1.  **Linguistic Analysis (hate-analysis):** A deep exploratory analysis to identify the features and patterns that distinguish hate speech from non-hate speech.\n2.  **Machine Learning Classification (hate-classification):** The development and evaluation of supervised learning models to classify messages as \"Hate\" or \"Non-Hate\" based on the extracted features.\n\n---\n\n## Part 1: Linguistic Analysis and Feature Extraction\n\nIn this first phase, an exhaustive analysis was conducted on a corpus of **574,639 comments** to understand the properties of hate speech. **SpaCy** and the `es_core_news_md` model were used to process the text and extract linguistic features.\n\n### Key Findings\n\nThe comparative analysis between hate comments (2.14% of the total) and non-hate comments (97.86%) revealed significant differences:\n\n* **Message Length:** Hate comments are drastically shorter.\n    * **Average words (Hate):** 15.60\n    * **Average words (Non-Hate):** 107.14\n* **Message Structure:** Hate messages are more direct and contain fewer sentences.\n    * **Average sentences (Hate):** 1.55\n    * **Average sentences (Non-Hate):** 3.99\n* **Named Entity (NER) Usage:** Hate comments tend to be less specific and more generalized.\n    * **% of comments with NER (Hate):** 36.52%\n    * **% of comments with NER (Non-Hate):** 59.38%\n    * Specifically, only **17.59%** of hate messages mention a person (`PERSON`), compared to **28.77%** in non-hate messages.\n* **Lexicon Used:** There is a clear difference in vocabulary.\n    * **Hate Lemmas:** Insults (\"mierda\", \"puta\", \"asco\"), pejorative terms (\"gentuza\", \"miserable\"), and highly charged political words (\"gobierno\", \"fascista\", \"comunista\") are predominant.\n    * **Non-Hate Lemmas:** The focus is on informative and neutral topics (\"año\", \"persona\", \"caso\", \"vacuna\", \"salud\").\n* **Morphology:** It was observed that hate speech contains a higher proportion of nouns and adjectives in the **masculine plural** (18.42%) compared to the feminine plural (9.26%), suggesting a focus on male collectives.\n\n**Conclusion of Part 1:** The analysis demonstrated that there are quantifiable linguistic features (length, structure, lexicon, etc.) that act as strong indicators for differentiating hate speech, justifying their use in building a classification model.\n\n---\n\n## Part 2: Hate Speech Classification\n\nUsing the findings from the first part, a Machine Learning pipeline was built to classify messages. For this task, a balanced dataset of **10,000 comments** (50% Hate, 50% Non-Hate) with pre-extracted numerical features was used.\n\n### Models and Evaluation\n\nThree supervised classification algorithms were trained and compared:\n1.  **Random Forest Classifier**\n2.  **Support Vector Machine (SVM)**\n3.  **XGBoost (Extreme Gradient Boosting)**\n\nThe models were evaluated using key metrics such as **F1-Score**, **Precision**, **Recall**, and **AUC-ROC**, as it is crucial in hate speech detection to balance the correct identification of toxic messages (Recall) and the avoidance of false accusations (Precision).\n\n### Winning Model: XGBoost\n\nThe model with the best overall performance was **XGBoost**, trained on the original (non-standardized) features.\n\n* **F1-Score:** **0.9811**\n* **Accuracy:** 98.10%\n* **Precision:** 97.72%\n* **Recall:** 98.50%\n* **AUC-ROC:** 0.9971\n\nThis means the model is capable of **detecting 98.5% of all hate speech messages**, with **97.7% of its alerts being correct**. In practice, for every 1,000 hate speech messages, the model would only fail to identify 15.\n\n## Results Visualization\n\nThis is an excellent place to include the graphs from the classification notebook, as they visually summarize the models' performance.\n\n#### Performance Comparison (F1-Score)\n\nThis graph shows that **XGBoost** and **Random Forest** achieve the best performance, and that data scaling (Standardized) negatively affects SVM.\n\n\u003cp align=\"center\"\u003e\n  \u003ci\u003e(Insert the F1-Score comparison image here)\u003c/i\u003e\u003cbr\u003e\n  \u003cimg width=\"8376\" height=\"3144\" alt=\"F1-Score Comparison\" src=\"https://github.com/user-attachments/assets/6c909d66-c34a-4cba-a833-502b496327da\" /\u003e\n\u003c/p\u003e\n\n#### ROC Curves\n\nThe ROC curves demonstrate the excellent ability of all models to distinguish between the two classes, with AUC values very close to 1.\n\n---\n\n## Code and Technologies\n\n* **Linguistic Analysis Notebook:** **[hate-analysis.ipynb](https://github.com/davidcamilo0710/hate_speech_analysis/blob/main/hate_analysis.ipynb)**\n* **Classification Notebook:** **[hate-classification.ipynb](https://github.com/davidcamilo0710/hate_speech_analysis/blob/main/hate_clasification.ipynb)**\n* **Language:** Python\n* **Core Libraries:**\n    * **NLP Analysis:** SpaCy, Pandas\n    * **Machine Learning:** Scikit-learn, XGBoost\n    * **Visualization:** Matplotlib, Seaborn\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidcamilo0710%2Fhate_speech_analysis","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidcamilo0710%2Fhate_speech_analysis","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidcamilo0710%2Fhate_speech_analysis/lists"}