https://github.com/ubisoft/ubisoft-laforge-toxplainerdataset

Dataset supporting "Unveiling Identity Biases in Toxicity Detection : A Game-Focused Dataset and Reactivity Analysis Approach"
https://github.com/ubisoft/ubisoft-laforge-toxplainerdataset

Last synced: 9 months ago
JSON representation

Dataset supporting "Unveiling Identity Biases in Toxicity Detection : A Game-Focused Dataset and Reactivity Analysis Approach"

Host: GitHub
URL: https://github.com/ubisoft/ubisoft-laforge-toxplainerdataset
Owner: ubisoft
License: other
Created: 2024-02-22T22:02:24.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-02-22T22:06:05.000Z (over 2 years ago)
Last Synced: 2025-07-28T04:17:04.006Z (11 months ago)
Size: 176 KB
Stars: 2
Watchers: 3
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: license.txt

Awesome Lists containing this project

README

# About this dataset

This dataset was created with two purposes :
1. Unveil identity biases in a toxicity detection model for written videogame chat
2. Reflect the type of lines found in a written game chat

The dataset contains a total of 16,008 lines, created from 22 sentence templates and a set of 46 identity-related terms.
A full description of the dataset creation method is available in the [EMNLP 2023 article](https://aclanthology.org/2023.emnlp-industry.26/).

# Structure of the dataset

The dataset contains 10 columns :

- **chat_line** : synthetic chat line made from a sentence template and a term or combination of terms that may convey identity biases
- **template** : the sentence template used to create this chatline
- **word1**, **word2** : the words used to fill the tag in the sentence template
- **lem1**, **lem2** : the lemmatized version of word1, word2
- **cat1**, **cat2** : the categories associated to word1, word2
- **manual_annotations** : toxicity annotations that were obtained from human annotators. Only 1,363 lines have a value in this column.
- **annotations** : the ground truth labels. These labels were obtained from a propagation using a random forest algorithm, trained on the 1,363 manually annotated lines.

For both the columns **manual_annotations** and **annotations** :

- 0 = non-toxic line
- 1 = toxic line

# Cite this dataset

If you use this dataset, please cite the following paper :

Van Dorpe, J., Yang, Z., Grenon-Godbout, N., & Winterstein, G. (2023). Unveiling Identity Biases in Toxicity Detection: A Game-Focused Dataset and Reactivity Analysis Approach. In M. Wang & I. Zitouni (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track (pp. 263–274). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.emnlp-industry.26

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ubisoft/ubisoft-laforge-toxplainerdataset

Awesome Lists containing this project

README