Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ananyaarun/structure-based-hate-speech-detection
Building a structure-based hate-speech detection system using NLP tools and ML models as part of IRE 2020 Final Project.
https://github.com/ananyaarun/structure-based-hate-speech-detection
Last synced: 17 days ago
JSON representation
Building a structure-based hate-speech detection system using NLP tools and ML models as part of IRE 2020 Final Project.
- Host: GitHub
- URL: https://github.com/ananyaarun/structure-based-hate-speech-detection
- Owner: ananyaarun
- Created: 2020-10-17T15:18:05.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2020-11-21T16:24:40.000Z (about 4 years ago)
- Last Synced: 2024-11-09T09:53:11.397Z (2 months ago)
- Language: Jupyter Notebook
- Homepage:
- Size: 56.6 MB
- Stars: 3
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Structure-Based-Hate-Speech-Detection
## IRE Major Project### Team 26
- Ananya Arun
- Vijayraj S
- Sumit Bhuin
- Virat Mishra### Abstract
The main objective of our project is “Structure-based hate speech detection”. Traditional methods for hate speech detection use tons of training data to mine the hateful structure but due to disproportionate use of different terms, they are prone towards learning bias against specific objects, personalities or groups. Idea is to propose a method that takes into account the grammatical structure of the sentence to predict hatefulness.
Hate speech is commonly defined as any communication that disparages a person or a group on the basis of some characteristic such as race, colour, ethnicity, gender, sexual orientation, nationality, religion, or other characteristics. With the rise of social media and user generated content, detecting and classifying hate speech is becoming quite important. To automate the process of hate-speech detection, we look at a system that tries to utilize the grammatical structure of the system as features in order to classify a sentence as hate-speech or not, in order to avoid bias towards certain named entities.
### Datasets
First dataset being used is text extracted from Stormfront, a white supremacist forum. (https://github.com/Vicomtech/hate-speech-dataset). The dataset has 10495 sentences labelled either as hate or non-hate. We also used a second dataset (https://github.com/t-davidson/hate-speech-and-offensive-language) which was used in the paper Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. "Automated Hate Speech Detection and the Problem of Offensive Language." ICWSM. This consists of 25,297 tweets which have been labelled whether the tweet comes under hate-speech, uses offensive language or falls under neither of the two categories.
Forboth the datasets we ran all the models for both naive and oversampled data. To include structure based analysis we also ran our models after stemming and/or performing POS Tagging.
### Models Implemented
We built baseline models for our reference and analysis which include
- Naive Bayes
- SVM
- Logistic Regression
- Decision Trees
- N GramsThe DL models we have implemented are
- LSTM
- CNN
- BERT
- Tree LSTMComplete results and analysis are avilable in our Report.