{"id":21437812,"url":"https://github.com/ymorsi7/hatespeechnlp","last_synced_at":"2026-04-13T20:03:19.473Z","repository":{"id":63927259,"uuid":"568926923","full_name":"ymorsi7/HateSpeechNLP","owner":"ymorsi7","description":"Detecting and analyzing hate speech on videos relating to sexism on a right-wing platform (NLTK, scikit-learn, pandas).","archived":false,"fork":false,"pushed_at":"2022-12-02T15:02:14.000Z","size":8799,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-16T23:42:43.638Z","etag":null,"topics":["decision-tree-classifier","nlp","nlp-machine-learning","nltk-python","pandas","scikit-learn","tf-idf"],"latest_commit_sha":null,"homepage":"","language":"CSS","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ymorsi7.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-11-21T17:54:10.000Z","updated_at":"2024-10-16T12:42:33.000Z","dependencies_parsed_at":"2023-01-14T14:45:42.345Z","dependency_job_id":null,"html_url":"https://github.com/ymorsi7/HateSpeechNLP","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/ymorsi7/HateSpeechNLP","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymorsi7%2FHateSpeechNLP","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymorsi7%2FHateSpeechNLP/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymorsi7%2FHateSpeechNLP/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymorsi7%2FHateSpeechNLP/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ymorsi7","download_url":"https://codeload.github.com/ymorsi7/HateSpeechNLP/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ymorsi7%2FHateSpeechNLP/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31768667,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-13T15:25:13.801Z","status":"ssl_error","status_checked_at":"2026-04-13T15:25:09.162Z","response_time":93,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["decision-tree-classifier","nlp","nlp-machine-learning","nltk-python","pandas","scikit-learn","tf-idf"],"created_at":"2024-11-23T00:29:27.253Z","updated_at":"2026-04-13T20:03:19.453Z","avatar_url":"https://github.com/ymorsi7.png","language":"CSS","funding_links":[],"categories":[],"sub_categories":[],"readme":"# NLP Hate Speech Detection\n\n## About\n\nThis is my first NLP project, where I try to detect hate speech on videos relating to sexism on the right-wing platform, Rumble.\n\nFor the website I created for this project, click [here](https://cute-rugelach-1cd3d3.netlify.app/).\n\nFor the video presentation, click [here](https://youtu.be/txlp-4BAxR8).\n\n\nThe following is a list of my initial thoughts:\n\n- NLP can be used to automatically analyze and structure text (quickly and cost-effectively).\n- Using TF-IDF vectorization, we’ll locate the most significant words in each comment section, and overall.\n- Subsequently, we will train a model to classify hate speech using data extracted from the site using a decision tree.\n\n\n\n\n## Table of Contents\n\u003cdetails open\u003e\n\u003csummary\u003eClick to Expand/Collapse\u003c/summary\u003e\n\n- [NLP Hate Speech Detection](#nlp-hate-speech-detection)\n  - [About](#about)\n  - [Table of Contents](#table-of-contents)\n  - [Scraping](#scraping)\n    - [Case Studies:](#case-studies)\n  - [Data Cleaning](#data-cleaning)\n  - [Term Frequency Inverse Document Frequency (TF-IDF)](#term-frequency-inverse-document-frequency-tf-idf)\n  - [TF-IDF Results](#tf-idf-results)\n      - [Video 1](#video-1)\n      - [Video 2](#video-2)\n      - [Video 3](#video-3)\n      - [Video 4](#video-4)\n      - [Video 5](#video-5)\n      - [Overall Top 15 Words](#overall-top-15-words)\n  - [Decision Tree Hate Speech Detection Model](#decision-tree-hate-speech-detection-model)\n  - [Hate Speech Detection Results](#hate-speech-detection-results)\n  - [Sentiment Analysis](#sentiment-analysis)\n  - [Conclusion](#conclusion)\n  - [Bonus: Word Clouds](#bonus-word-clouds)\n    - [Deep Categorization](#deep-categorization)\n    - [Text Clustering](#text-clustering)\n    - [IPTC Text Classification](#iptc-text-classification)\n    - [Topics Extraction](#topics-extraction)\n\u003c/details\u003e\n\n\u003chr\u003e\n\n## Scraping\n\nBefore scraping, I checked Rumble's site, which stated the following.\n\n\u003e \"Systematic retrieval of data or Content from the Rumble Service to create or compile, directly or indirectly, a collection, compilation, library, database or directory without prior written permission from Rumble is prohibited.\"\n\nBecause they specified that \"systematic\" retrieval of data is prohibited, I needed to scrape the data manually.\n\nAs someone who loves automation, this was a tough task, but it needed to be done.\n\nTo find sexist content on Rumble, I selected five videos relating to sexism on the platform:\n\n- [1] [\"Women Should Not Be in Combat Roles: Change My Mind\"](https://rumble.com/v1r74qk-women-should-not-be-in-combat-roles-change-my-mind.html)\n- [2] [\"The Problem With Modern Women\"](https://rumble.com/v1wqypw-the-problem-with-modern-women-w-layah-heilpern-jedediah-bila-live-episode-6.html)\n- [3] [\"Tucker Carlson Gives CNN Some Tips About Sexism in Hilarious Segment\"](https://rumble.com/vfjlp5-tucker-carlson-gives-cnn-some-tips-about-sexism-in-hilarious-segment.html)\n- [4] [\"WOMAN DEFENDS ANDREW TATE AND ARGUES WITH FEMINISTS AND TRANGENDERS\"](https://rumble.com/v1q566l-woman-defends-andrew-tate-and-argues-with-feminists-and-trangenders-must-wa.html)\n- [5] [\"Massive Feminist March Against Gender Violence in Rome\"](https://rumble.com/v1xflms-massive-feminist-march-against-gender-violence-in-rome.html)\n\nI decided to select the top 50 comments on each video, which totaled 250 comments.\n\nAn advantage of manually scraping the data is that I was able to notice certain sexist comments and analyze them.\n\n### Case Studies:\n\nEx.\n\u003e \"Have you noticed it’s all the weird fat looking ones that hate Tate.\"\n\n\u003e \"Those are some fine daughters they've been raising in Europe.\"\n\n\u003e \"Women are equipped to be h*es.\"\n\n\u003e \"Women today do not know what a woman is supposed to be, they are to busy trying to be like men.\"\n\n\u003cbr\u003e\nThe above comments show that there are different types of sexism, and that it is not always obvious. Such types seen above include employing derogatory terms, complimenting appearance, insulting appearance, and making generalizations.\n\n\u003chr\u003e\n\n## Data Cleaning\n\nAfter scraping the data, I needed to clean it. Thankfully, because I manually attained my data, it was already in a clean format, and all that I needed to do was drop the unnecessary columns.\n\n\u003chr\u003e\n\n## Term Frequency Inverse Document Frequency (TF-IDF)\n\n\u003cbr\u003e\nBelow is how the TF (term frequency) is found:\n\n$TF = \\frac{(number\\ of\\ instances\\ of\\ word\\ in\\ document)}{(total\\ number\\ of\\ words\\ in\\ document)}$\n\n\u003cbr\u003e\n\nThe following is the method to calculate the IDF (inverse document frequency):\n\n$IDF = log(\\frac{(number\\ of\\ documents\\ in\\ corpus)}{(number\\ of\\ documents\\ containing\\ term)})$\n\n\u003cbr\u003e\n\nThe TF-IDF is then calculated by multiplying the TF and IDF.\n\n\n\u003chr\u003e\n\nBelow is the function I used to find the top n words in a corpus. In our case, the corpus was the comments section of the selected videos.\n\nFirst, we create a TF-IDF vectorizer object. Then, we fit the vectorizer to the corpus. After that, we create a dataframe of the top n words in the corpus, and return it to the user.\n\n\n\n```python\ndef get_top_n_words(corpus, n=None):\n    '''\n    List the top n words in a vocabulary according to occurrence in a text corpus.\n    \n    Args:\n        corpus (list): a list of text documents.\n        n (int): number of top words to return.\n    '''\n    assert isinstance(corpus, list), \"This must be a list!\"\n    assert isinstance(n, int), \"This must be an integer!\"\n\n    tfidf_vectorizer = TfidfVectorizer(use_idf=True)\n    tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)\n    first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[1]\n    df_tfidfvectorizer = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=[\"tfidf\"])\n\n    commentsTF_IDF = df_tfidfvectorizer.sort_values(by=[\"tfidf\"],ascending=False)\n    return commentsTF_IDF.head(n)\n\n```\n\n## TF-IDF Results\n\nBelow are the top 15 words in each of the five videos.\n\n#### Video 1\n\n```\n             tfidf\nmore      0.270853\nbe        0.241720\ninjured   0.230368\naffected  0.230368\nwould     0.214345\nthose     0.208332\ntheir     0.206575\nwoman     0.206575\nin        0.190632\nbeing     0.180569\neven      0.180569\nby        0.180569\nthan      0.155024\nit        0.150067\nmen       0.124610\n```\n\n#### Video 2\n```\n             tfidf\nmore      0.270853\nbe        0.241720\ninjured   0.230368\naffected  0.230368\nwould     0.214345\nthose     0.208332\ntheir     0.206575\nwoman     0.206575\nin        0.190632\nbeing     0.180569\neven      0.180569\nby        0.180569\nthan      0.155024\nit        0.150067\nmen       0.124610\n```\n\n#### Video 3\n```\n             tfidf\nmore      0.270853\nbe        0.241720\ninjured   0.230368\naffected  0.230368\nwould     0.214345\nthose     0.208332\ntheir     0.206575\nwoman     0.206575\nin        0.190632\nbeing     0.180569\neven      0.180569\nby        0.180569\nthan      0.155024\nit        0.150067\nmen       0.124610\n```\n\n#### Video 4\n```\n             tfidf\nmore      0.270853\nbe        0.241720\ninjured   0.230368\naffected  0.230368\nwould     0.214345\nthose     0.208332\ntheir     0.206575\nwoman     0.206575\nin        0.190632\nbeing     0.180569\neven      0.180569\nby        0.180569\nthan      0.155024\nit        0.150067\nmen       0.124610\n```\n\n#### Video 5\n```\n               tfidf\ntheir        0.37430\ntruckers     0.20114\nleaders      0.20114\nfreezing     0.20114\nefforts      0.20114\ngovt         0.20114\nottawa       0.20114\nbank         0.20114\ncanada       0.20114\nleast        0.20114\narresting    0.20114\nfundraising  0.20114\nisn          0.20114\naccounts     0.20114\nassociated   0.20114\n```\n\n#### Overall Top 15 Words\n```\n             tfidf\nbe        0.245173\ninjured   0.243024\naffected  0.243024\nmore      0.234565\nwould     0.223824\nwoman     0.211882\nin        0.198536\ntheir     0.189318\neven      0.185259\nthose     0.185259\nbeing     0.175961\nthan      0.161941\nby        0.151469\ncombat    0.149216\nit        0.145994\n```\n\nThe TF-IDF localization proved to be useful, as it shows the prevelance of words such as \"woman,\" \"injured,\" and \"affected\" in the comments. This is useful, as it shows that the users are engaged in their respective videos, and have a lot to say about women.\n\n\n\u003chr\u003e\n\n\n## Decision Tree Hate Speech Detection Model\n\nTo train the hate speech detection model, I used Kaggle's [\"Hate Speech and Offensive Language Dataset.\"](https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset) \n\nAfter importing NLTK, I loaded in the labeled Kaggle data, preprocessed it, and split it, as done in [this tutorial.](https://copyassignment.com/hate-speech-detection/)\n\nAfter using scikit-learn's DecisionTreeClassifier() function, I trained the model on the data, and then tested it on the test data. The model achieved an accuracy of about 0.89, which is pretty good.\n\n```python\nbinClfr = [] # list (0 or 1)\nnumHate = 0 # a counter for the number of hate comments\nfor i in range(len(commentsList)):\n    inp = cv.transform([commentsList[i]]).toarray()\n    if (model.predict(inp) == ['Offensive Speech']):\n        binClfr.append(1) # add one if offensive\n        numHate += 1\n    elif (model.predict(inp) == ['No Hate and Offensive Speech']):\n        binClfr.append(0) # add zero if comment is not hate speech\n    else:\n        binClfr.append(9) # Add 9 if output it neither (shouldn't happen; means that there's an error)\n    print(model.predict(inp))\n```\n\n## Hate Speech Detection Results\n```python\nprint(\"Percentage of hate speech comments: \" + str(numHate/len(binClfr)))\n```\nThe above code, meant to find the percentage of hate speech comments, yielded .26, which shows that over one quarter of the comments on these videos are hate speech.\n\n\n\u003chr\u003e\n\n\n## Sentiment Analysis\n\nUsing [MeaningCloud](https://www.meaningcloud.com/), I was able to conduct sentiment analysis on the comments, which can be found [here](https://github.com/ymorsi7/HateSpeechNLP/tree/main/csv/sentimentAnalysis.csv).\n\nThe main two numbers I observed were the irony and the agreement/disagreement proportions, but I spent more attention on the former. I was able to find that over 97% of the comments were unironic, and that 35% of the comments showed disagreement.\n\nThe percentage of unironic comments shows that the users are serious about their beliefs, and not just joking around, which is a rather common justification for hate speech.\n\n\u003chr\u003e\n\n\n## Conclusion\n\nThe NLP analysis conducted shows us that a significant percentage of comments on the videos relating to sexism on Rumble contain hate speech. Case studies above show that there are various types of sexism on the platform, but all in all, they come together to form a staggering 26% of the comments found on the videos; most of which are unironic.\n\n\n\u003chr\u003e\n\n\n## Bonus: Word Clouds\n\nUsing the Google Sheets extensions [MeaningCloud](https://www.meaningcloud.com/) and [ChartExpo](https://chartexpo.com/), I was able to create word clouds from extracted topics, deep categorization, text clustering, and IPTC text classification.\n\n### Deep Categorization\n\n![Deep Categorization](wordclouds/deepCategorization.png)\n\n### Text Clustering\n![Text Clustering](wordclouds/textClustering.png)\n\n### IPTC Text Classification\n![IPTC Text Classification](wordclouds/iptc.png)\n\n\n### Topics Extraction\n![Topics Extraction](wordclouds/topicExt.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fymorsi7%2Fhatespeechnlp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fymorsi7%2Fhatespeechnlp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fymorsi7%2Fhatespeechnlp/lists"}