Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/gesiscss/sexism_custom_classifier

Custom classifiers to detect sexist language.
https://github.com/gesiscss/sexism_custom_classifier

bert natural-language-processing nlp sckit-learn sexism-detection

Last synced: about 2 months ago
JSON representation

Custom classifiers to detect sexist language.

Host: GitHub
URL: https://github.com/gesiscss/sexism_custom_classifier
Owner: gesiscss
License: mit
Created: 2020-08-13T10:29:34.000Z (over 4 years ago)
Default Branch: master
Last Pushed: 2021-02-11T11:36:22.000Z (almost 4 years ago)
Last Synced: 2023-03-04T03:38:43.412Z (almost 2 years ago)
Topics: bert, natural-language-processing, nlp, sckit-learn, sexism-detection
Language: Jupyter Notebook
Homepage:
Size: 32.2 MB
Stars: 5
Watchers: 13
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE

Awesome Lists containing this project

README

[![Binder](https://notebooks.gesis.org/binder/badge.svg)](https://notebooks.gesis.org/binder/v2/gh/gesiscss/sexism_custom_classifier/master?filepath=notebooks)

# Sexism Custom Classifier

## Introduction

Sexism custom classifier is implemented to detect sexism automatically in the field of natural language processing. In this work, the dataset from [Samory et al. (2020)](https://arxiv.org/abs/2004.12764) was used to train three models on top of four different feature sets (both individual and combinations). The aim of the experiments is to answer two research questions:

* What is the informativeness of different feature sets on the sexism detection task?
* What are the improvements on the sexism classifiers’ performance when different feature sets are introduced?

## Support Features

* Sentiment : The sentiment intensity provided by [VADER](http://eegilbert.org/papers/icwsm14.vader.hutto.pdf) sentiment analysis method.
* Word N-grams : Term Frequency/Inverse Frequency (TF-IDF) weights of word n-grams proviided by [scikit-learn](https://jmlr.csail.mit.edu/papers/v12/pedregosa11a.html).
* Type Dependency : The type dependency relationships provided by [Stanford Parser](https://nlp.stanford.edu/~wcmac/papers/td-lrec06.pdf). The parser can be dowloaded from [here](https://nlp.stanford.edu/software/lex-parser.shtml#Download).
* Document Embeddings : The document embeddings provided by [BERT](https://arxiv.org/abs/1810.04805) language representation model.

## Support Models

* Logistic Regression
* Convolutional Neural Network
* Support Vector Machine

## Experiments

This example code trains Logistic Regression on top of sentiment and uni-gram features. To replicate the results, run the scripts in the 'experiments/scripts.txt'

```shell
export PARAMS_FILE='experiments/params.json'
export HYPERPARAMS_FILE='experiments/hyperparams.json'

python run.py \
--params_file=$PARAMS_FILE \
--hyperparams_file=$HYPERPARAMS_FILE \
```

This example code obtains the pre-trained BERT embeddings.

```shell
export DATA_FILE=/path/data.csv

python run_bert_feature_extraction.py \
--data_file=$DATA_FILE \
```