Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/plandes/edusenti

EduSenti: Education Review Sentiment in Albanian
https://github.com/plandes/edusenti

ai albanian albanian-language model natural-language-processing paper sentiment-analysis

Last synced: about 2 months ago
JSON representation

EduSenti: Education Review Sentiment in Albanian

Awesome Lists containing this project

README

        

# EduSenti: Education Review Sentiment in Albanian

[![PyPI][pypi-badge]][pypi-link]
[![Python 3.10][python3100-badge]][python3100-link]
[![Python 3.11][python311-badge]][python311-link]

Pretraining and sentiment student to instructor review corpora and analysis in
Albanian. This repository contains the code base to be used for the paper
[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]. To
reproduce the results, see the paper [reproduction repository]. If you use our
model or API, please [cite](#citation) our paper.

## Table of Contents

- [Obtaining](#obtaining)
- [Usage](#usage)
- [API](#api)
- [Models](#models)
- [Differences from the Paper Repository](#differences-from-the-paper-repository)
- [Documentation](#documentation)
- [Changelog](#changelog)
- [Citation](#citation)
- [License](#license)

## Obtaining

The library can be installed with pip from the [pypi] repository:
```bash
pip3 install zensols.edusenti
```

The [models](#models) are downloaded on the first use of the command-line or
API.

## Usage

Command line:
```bash
$ edusenti predict sq.txt
(+):
(-):
(+):
...
```

Use the `csv` action to write all predictions to a comma-delimited file (use
`edusent --help`).

## API

```python
>>> from zensols.edusenti import (
>>> ApplicationFactory, Application, SentimentFeatureDocument
>>> )
>>> app: Application = ApplicationFactory.get_application()
>>> doc: SentimentFeatureDocument
>>> for doc in app.predict(['Kjo gjendje ka vazhduar edhe në kohën e provimeve']):
>>> print(f'sentence: {doc.text}')
>>> print(f'prediction: {doc.pred}')
>>> print(f'prediction: {doc.softmax_logit}')

sentence: Kjo gjendje ka vazhduar edhe në kohën e provimeve
prediction: +
logits: {'+': 0.70292175, '-': 0.17432323, 'n': 0.12275504}
```

## Models

The [models] are downloaded the first time the API is used. To change the
model (by default `xlm-roberta-base` is used) on the command-line, use
`--override esi_default.model_namel=xlm-roberta-large`. You can also create a
`~/.edusentirc` file with the following:

```ini
[esi_default]
model_namel = xlm-roberta-large
```

Performance of the models on the test set when trained and validated are below.

| Model | F1 | Precision | Recall |
|:--------------------|-----:|----------:|-------:|
| `xlm-roberta-base` | 78.1 | 80.7 | 79.7 |
| `xlm-roberta-large` | 83.5 | 84.9 | 84.7 |

However, the distributed models were trained on the training and test sets
combined. The validation metrics of those trained models are available on the
command line with `edusenti info`.

## Differences from the Paper Repository

The paper [reproduction repository] has quite a few differences, mostly around
reproducibility. However, this repository is designed to be a package used for
research that applies the model. To reproduce the results of the paper, please
refer to the [reproduction repository]. To use the best performing model
(XLM-RoBERTa Large) from that paper, then use this repository.

The primary difference is this repo has significantly better performance in
Albanian, which climbed from from F1 71.9 to 83.5 (see [models](#models)).
However, this repository has no English sentiment model since it was only used
for comparing methods.

Changes include:

* Python was upgraded from 3.9.9 to 3.11.6
* PyTorch was upgraded from 1.12.1 to 2.1.1
* HuggingFace transformers was upgraded from 4.19 to 4.35
* [zensols.deepnlp] was upgraded from 1.8 to 1.13
* The dataset was re-split and stratified.

## Documentation

See the [full documentation](https://plandes.github.io/edusenti/index.html).
The [API reference](https://plandes.github.io/edusenti/api.html) is also
available.

## Changelog

An extensive changelog is available [here](CHANGELOG.md).

## Citation

If you use this project in your research please use the following BibTeX entry:

```bibtex
@inproceedings{nuci-etal-2024-roberta-low,
title = "{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian",
author = "Nuci, Krenare Pireva and
Landes, Paul and
Di Eugenio, Barbara",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italy",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1233",
pages = "14146--14151"
}
```

## License

[MIT License](LICENSE.md)

Copyright (c) 2023 - 2024 Paul Landes and Krenare Pireva Nuci

[pypi]: https://pypi.org/project/zensols.edusenti/
[pypi-link]: https://pypi.python.org/pypi/zensols.edusenti
[pypi-badge]: https://img.shields.io/pypi/v/zensols.edusenti.svg
[python3100-badge]: https://img.shields.io/badge/python-3.10-blue.svg
[python3100-link]: https://www.python.org/downloads/release/python-3100
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
[python311-link]: https://www.python.org/downloads/release/python-3110

[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]: https://aclanthology.org/2024.lrec-main.1233.pdf
[reproduction repository]: https://github.com/uic-nlp-lab/edusenti
[models]: https://zenodo.org/records/10795173
[zensols.deepnlp]: https://github.com/plandes/deepnlp