https://github.com/Brand24-AI/mms_benchmark
The most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected from over 350 datasets reported in the scientific literature based on strict quality criteria and covers 27 languages.
https://github.com/Brand24-AI/mms_benchmark
benchmarking data-quality dataset datasets leaderboard llm nlp sentiment-analysis sentiment-classification
Last synced: 2 months ago
JSON representation
The most extensive open massively multilingual corpus of datasets for training sentiment models. The corpus consists of 79 manually selected from over 350 datasets reported in the scientific literature based on strict quality criteria and covers 27 languages.
- Host: GitHub
- URL: https://github.com/Brand24-AI/mms_benchmark
- Owner: Brand24-AI
- License: other
- Created: 2023-06-09T10:36:24.000Z (about 3 years ago)
- Default Branch: main
- Last Pushed: 2023-11-14T11:02:37.000Z (over 2 years ago)
- Last Synced: 2024-01-29T18:15:27.331Z (over 2 years ago)
- Topics: benchmarking, data-quality, dataset, datasets, leaderboard, llm, nlp, sentiment-analysis, sentiment-classification
- Language: Jupyter Notebook
- Homepage: https://brand24-ai.github.io/mms_benchmark/
- Size: 2.6 MB
- Stars: 14
- Watchers: 0
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- awesome-sentiment-analysis - [github
README
MMS Dataset and Benchmark
================
Despite impressive advancements in multilingual corpora collection and
model training, developing large-scale deployments of multilingual
models still presents a significant challenge. This is particularly true
for language tasks that are culture-dependent. One such example is the
area of multilingual sentiment analysis, where affective markers can be
subtle and deeply ensconced in culture.
This work presents the most extensive open massively multilingual corpus
of datasets for training sentiment models. The corpus consists of 79
manually selected datasets from over 350 datasets reported in the
scientific literature based on strict quality criteria. The corpus
covers 27 languages representing 6 language families. Datasets can be
queried using several linguistic and functional features. In addition,
we present a multi-faceted sentiment classification benchmark
summarizing hundreds of experiments conducted on different base models,
training objectives, dataset collections, and fine-tuning strategies.
## Dataset
[Massively Multilingual Sentiment
Datasets](https://huggingface.co/datasets/Brand24/mms)
## Analysis and benchmarking
[HuggingFace Spaces with Analysis and
Benchmark](https://huggingface.co/spaces/Brand24/mms_benchmark)
## General statistics about the dataset
> It may take some time to download the dataset and generate train set
> inside HuggingFace dataset. Please be patient.
``` python
mms_dataset = datasets.load_dataset("Brand24/mms")
```
``` python
mms_dataset_df = mms_dataset["train"].to_pandas()
```
How many examples do we have?
``` python
mms_dataset.num_rows
```
{'train': 6164762}
## Features
We provide not only texts and sentiment labels but we assigned many
additional dimensions for datasets and languages, hence it is possible
to splice and dice them as you want and need.
``` python
mms_dataset["train"].features
```
{'_id': Value(dtype='int32', id=None),
'text': Value(dtype='string', id=None),
'label': ClassLabel(names=['negative', 'neutral', 'positive'], id=None),
'original_dataset': Value(dtype='string', id=None),
'domain': Value(dtype='string', id=None),
'language': Value(dtype='string', id=None),
'Family': Value(dtype='string', id=None),
'Genus': Value(dtype='string', id=None),
'Definite articles': Value(dtype='string', id=None),
'Indefinite articles': Value(dtype='string', id=None),
'Number of cases': Value(dtype='string', id=None),
'Order of subject, object, verb': Value(dtype='string', id=None),
'Negative morphemes': Value(dtype='string', id=None),
'Polar questions': Value(dtype='string', id=None),
'Position of negative word wrt SOV': Value(dtype='string', id=None),
'Prefixing vs suffixing': Value(dtype='string', id=None),
'Coding of nominal plurality': Value(dtype='string', id=None),
'Grammatical genders': Value(dtype='string', id=None),
'cleanlab_self_confidence': Value(dtype='float32', id=None)}
### Example
``` python
mms_dataset["train"][2001000]
```
{'_id': 2001000,
'text': 'I was a tomboy and this has such great memories for me. They fit exactly how I remember, PERFECTLY!!',
'label': 2,
'original_dataset': 'en_amazon',
'domain': 'reviews',
'language': 'en',
'Family': 'Indo-European',
'Genus': 'Germanic',
'Definite articles': 'definite word distinct from demonstrative',
'Indefinite articles': 'indefinite word distinct from one',
'Number of cases': '2',
'Order of subject, object, verb': 'SVO',
'Negative morphemes': 'negative particle',
'Polar questions': 'interrogative word order',
'Position of negative word wrt SOV': 'SNegVO',
'Prefixing vs suffixing': 'strongly suffixing',
'Coding of nominal plurality': 'plural suffix',
'Grammatical genders': 'no grammatical gender',
'cleanlab_self_confidence': 0.9978116750717163}
### Classes
``` python
labels = mms_dataset["train"].features["label"].names
labels
```
['negative', 'neutral', 'positive']
``` python
mms_dataset_df["label_name"] = mms_dataset_df["label"].apply(lambda x: labels[x])
```
### Classes distribution
``` python
labels_stats_df = pd.DataFrame(mms_dataset_df.label_name.value_counts())
labels_stats_df["percentage"] = (labels_stats_df["label_name"] / labels_stats_df["label_name"].sum()).round(3)
labels_stats_df
```
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
label_name
percentage
positive
3494478
0.567
neutral
1341354
0.218
negative
1328930
0.216
## Sentiment orientation for each language
``` python
cols = ['language', 'label_name']
mms_dataset_df[cols].value_counts().to_frame().reset_index().rename(columns={0: 'count'}).sort_values(by=cols, ascending=True)
```
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
language
label_name
count
7
ar
negative
138899
4
ar
neutral
192774
1
ar
positive
600402
53
bg
negative
13930
41
bg
neutral
28657
...
...
...
...
62
ur
neutral
8585
67
ur
positive
5836
9
zh
negative
117967
21
zh
neutral
69016
6
zh
positive
144719
81 rows × 3 columns
## Per language
``` python
cols = ['language']
mms_dataset_df[cols].value_counts().to_frame().reset_index().rename(columns={0: 'count'}).sort_values(by=cols, ascending=True)
```
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
language
count
1
ar
932075
15
bg
62150
20
bs
36183
8
cs
196287
4
de
315887
0
en
2330486
2
es
418712
23
fa
13525
6
fr
210631
25
he
8619
22
hi
16999
12
hr
77594
16
hu
56682
24
it
12065
7
ja
209780
26
lv
5790
5
pl
236688
9
pt
157834
11
ru
110930
17
sk
56623
10
sl
113543
18
sq
44284
13
sr
76368
19
sv
41346
14
th
72319
21
ur
19660
3
zh
331702
## Example of filtering datasets
### Choose only Polish
``` python
pl = mms_dataset.filter(lambda row: row['language'] == 'pl')
```
Filter: 0%| | 0/6164762 [00:00, ? examples/s]
``` python
pl["train"].to_pandas().sample(5)
```
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
_id
text
label
original_dataset
domain
language
Family
Genus
Definite articles
Indefinite articles
Number of cases
Order of subject, object, verb
Negative morphemes
Polar questions
Position of negative word wrt SOV
Prefixing vs suffixing
Coding of nominal plurality
Grammatical genders
cleanlab_self_confidence
215921
5119386
Typujcie jaki dziś będzie wynik St.Pats - Legi...
2
pl_twitter_sentiment
social_media
pl
Indo-European
Slavic
no article
no article
6-7
SVO
negative particle
question particle
SNegVO
strongly suffixing
plural suffix
masculine, feminine, neuter
0.589098
86525
4989990
@KaczmarSF Przyjemne ciarki mam, gdy patrzę na...
2
pl_twitter_sentiment
social_media
pl
Indo-European
Slavic
no article
no article
6-7
SVO
negative particle
question particle
SNegVO
strongly suffixing
plural suffix
masculine, feminine, neuter
0.950756
66031
4969496
szkoda bylo czasu i kasy .
0
pl_polemo
reviews
pl
Indo-European
Slavic
no article
no article
6-7
SVO
negative particle
question particle
SNegVO
strongly suffixing
plural suffix
masculine, feminine, neuter
0.940540
137768
5041233
@shinyvalentine mam ja w dupie lecz bylo to kr...
0
pl_twitter_sentiment
social_media
pl
Indo-European
Slavic
no article
no article
6-7
SVO
negative particle
question particle
SNegVO
strongly suffixing
plural suffix
masculine, feminine, neuter
0.220028
118766
5022231
@itiNieWracaj pokazują to gdzieś?
2
pl_twitter_sentiment
social_media
pl
Indo-European
Slavic
no article
no article
6-7
SVO
negative particle
question particle
SNegVO
strongly suffixing
plural suffix
masculine, feminine, neuter
0.139179
## Use cases
### Case 1
Thus, when training a sentiment classifier using our dataset, one may
download different facets of the collection. For instance, one can
download all datasets in `Slavic` languages in which polar questions are
formed using the interrogative word order or download all datasets from
the `Afro-Asiatic` language family with no morphological case-making.
``` python
slavic = mms_dataset.filter(lambda row: row["Genus"] == "Slavic" and row["Polar questions"] == "interrogative word order")
```
Filter: 0%| | 0/6164762 [00:00, ? examples/s]
``` python
slavic
```
DatasetDict({
train: Dataset({
features: ['_id', 'text', 'label', 'original_dataset', 'domain', 'language', 'Family', 'Genus', 'Definite articles', 'Indefinite articles', 'Number of cases', 'Order of subject, object, verb', 'Negative morphemes', 'Polar questions', 'Position of negative word wrt SOV', 'Prefixing vs suffixing', 'Coding of nominal plurality', 'Grammatical genders', 'cleanlab_self_confidence'],
num_rows: 252910
})
})
### Case 2
``` python
afro_asiatic = mms_dataset.filter(lambda row: row["Family"] == "Afro-Asiatic" and row["Number of cases"] == "no morphological case-making")
```
Filter: 0%| | 0/6164762 [00:00, ? examples/s]
``` python
afro_asiatic
```
DatasetDict({
train: Dataset({
features: ['_id', 'text', 'label', 'original_dataset', 'domain', 'language', 'Family', 'Genus', 'Definite articles', 'Indefinite articles', 'Number of cases', 'Order of subject, object, verb', 'Negative morphemes', 'Polar questions', 'Position of negative word wrt SOV', 'Prefixing vs suffixing', 'Coding of nominal plurality', 'Grammatical genders', 'cleanlab_self_confidence'],
num_rows: 8619
})
})
## Dataset Curators
The corpus was put together by
- [@laugustyniak](https://www.linkedin.com/in/lukaszaugustyniak/)
- [@swozniak](https://www.linkedin.com/in/wscode/)
- [@mgruza](https://www.linkedin.com/in/marcin-gruza-276b2512b/)
- [@pgramacki](https://www.linkedin.com/in/piotrgramacki/)
- [@krajda](https://www.linkedin.com/in/krzysztof-rajda/)
- [@mmorzy](https://www.linkedin.com/in/mikolajmorzy/)
- [@tkajdanowicz](https://www.linkedin.com/in/kajdanowicz/)
## Citation
``` bibtex
@misc{augustyniak2023massively,
title={Massively Multilingual Corpus of Sentiment Datasets and Multi-faceted Sentiment Classification Benchmark},
author={Łukasz Augustyniak and Szymon Woźniak and Marcin Gruza and Piotr Gramacki and Krzysztof Rajda and Mikołaj Morzy and Tomasz Kajdanowicz},
year={2023},
eprint={2306.07902},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
```
## Acknowledgements
- BRAND24 - https://brand24.com
- CLARIN-PL-Biz - https://clarin.biz
## Licensing Information
These data are released under this licensing scheme. We do not own any
text from which these data and datasets have been extracted.
We license the actual packaging of these data under the
Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)
https://creativecommons.org/licenses/by-nc/4.0/
This work is published from Poland.
Should you consider that our data contains material that is owned by you
and should, therefore not be reproduced here, please: \* Clearly
identify yourself with detailed contact data such as an address,
telephone number, or email address at which you can be contacted. \*
Clearly identify the copyrighted work claimed to be infringed. \*
Clearly identify the material claimed to be infringing and the
information reasonably sufficient to allow us to locate the material.
We will comply with legitimate requests by removing the affected sources
from the next release of the corpus.