Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/plandes/edusenti
EduSenti: Education Review Sentiment in Albanian
https://github.com/plandes/edusenti
ai albanian albanian-language model natural-language-processing paper sentiment-analysis
Last synced: 6 days ago
JSON representation
EduSenti: Education Review Sentiment in Albanian
- Host: GitHub
- URL: https://github.com/plandes/edusenti
- Owner: plandes
- License: mit
- Created: 2024-03-08T04:16:41.000Z (8 months ago)
- Default Branch: master
- Last Pushed: 2024-05-20T03:47:59.000Z (6 months ago)
- Last Synced: 2024-09-26T09:26:06.285Z (about 2 months ago)
- Topics: ai, albanian, albanian-language, model, natural-language-processing, paper, sentiment-analysis
- Language: Python
- Homepage:
- Size: 194 KB
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE.md
Awesome Lists containing this project
README
# EduSenti: Education Review Sentiment in Albanian
[![PyPI][pypi-badge]][pypi-link]
[![Python 3.10][python3100-badge]][python3100-link]
[![Python 3.11][python311-badge]][python311-link]Pretraining and sentiment student to instructor review corpora and analysis in
Albanian. This repository contains the code base to be used for the paper
[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]. To
reproduce the results, see the paper [reproduction repository]. If you use our
model or API, please [cite](#citation) our paper.## Table of Contents
- [Obtaining](#obtaining)
- [Usage](#usage)
- [API](#api)
- [Models](#models)
- [Differences from the Paper Repository](#differences-from-the-paper-repository)
- [Documentation](#documentation)
- [Changelog](#changelog)
- [Citation](#citation)
- [License](#license)## Obtaining
The library can be installed with pip from the [pypi] repository:
```bash
pip3 install zensols.edusenti
```The [models](#models) are downloaded on the first use of the command-line or
API.## Usage
Command line:
```bash
$ edusenti predict sq.txt
(+):
(-):
(+):
...
```Use the `csv` action to write all predictions to a comma-delimited file (use
`edusent --help`).## API
```python
>>> from zensols.edusenti import (
>>> ApplicationFactory, Application, SentimentFeatureDocument
>>> )
>>> app: Application = ApplicationFactory.get_application()
>>> doc: SentimentFeatureDocument
>>> for doc in app.predict(['Kjo gjendje ka vazhduar edhe në kohën e provimeve']):
>>> print(f'sentence: {doc.text}')
>>> print(f'prediction: {doc.pred}')
>>> print(f'prediction: {doc.softmax_logit}')sentence: Kjo gjendje ka vazhduar edhe në kohën e provimeve
prediction: +
logits: {'+': 0.70292175, '-': 0.17432323, 'n': 0.12275504}
```## Models
The [models] are downloaded the first time the API is used. To change the
model (by default `xlm-roberta-base` is used) on the command-line, use
`--override esi_default.model_namel=xlm-roberta-large`. You can also create a
`~/.edusentirc` file with the following:```ini
[esi_default]
model_namel = xlm-roberta-large
```Performance of the models on the test set when trained and validated are below.
| Model | F1 | Precision | Recall |
|:--------------------|-----:|----------:|-------:|
| `xlm-roberta-base` | 78.1 | 80.7 | 79.7 |
| `xlm-roberta-large` | 83.5 | 84.9 | 84.7 |However, the distributed models were trained on the training and test sets
combined. The validation metrics of those trained models are available on the
command line with `edusenti info`.## Differences from the Paper Repository
The paper [reproduction repository] has quite a few differences, mostly around
reproducibility. However, this repository is designed to be a package used for
research that applies the model. To reproduce the results of the paper, please
refer to the [reproduction repository]. To use the best performing model
(XLM-RoBERTa Large) from that paper, then use this repository.The primary difference is this repo has significantly better performance in
Albanian, which climbed from from F1 71.9 to 83.5 (see [models](#models)).
However, this repository has no English sentiment model since it was only used
for comparing methods.Changes include:
* Python was upgraded from 3.9.9 to 3.11.6
* PyTorch was upgraded from 1.12.1 to 2.1.1
* HuggingFace transformers was upgraded from 4.19 to 4.35
* [zensols.deepnlp] was upgraded from 1.8 to 1.13
* The dataset was re-split and stratified.## Documentation
See the [full documentation](https://plandes.github.io/edusenti/index.html).
The [API reference](https://plandes.github.io/edusenti/api.html) is also
available.## Changelog
An extensive changelog is available [here](CHANGELOG.md).
## Citation
If you use this project in your research please use the following BibTeX entry:
```bibtex
@inproceedings{nuci-etal-2024-roberta-low,
title = "{R}o{BERT}a Low Resource Fine Tuning for Sentiment Analysis in {A}lbanian",
author = "Nuci, Krenare Pireva and
Landes, Paul and
Di Eugenio, Barbara",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italy",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.1233",
pages = "14146--14151"
}
```## License
[MIT License](LICENSE.md)
Copyright (c) 2023 - 2024 Paul Landes and Krenare Pireva Nuci
[pypi]: https://pypi.org/project/zensols.edusenti/
[pypi-link]: https://pypi.python.org/pypi/zensols.edusenti
[pypi-badge]: https://img.shields.io/pypi/v/zensols.edusenti.svg
[python3100-badge]: https://img.shields.io/badge/python-3.10-blue.svg
[python3100-link]: https://www.python.org/downloads/release/python-3100
[python311-badge]: https://img.shields.io/badge/python-3.11-blue.svg
[python311-link]: https://www.python.org/downloads/release/python-3110[RoBERTa Low Resource Fine Tuning for Sentiment Analysis in Albanian]: https://aclanthology.org/2024.lrec-main.1233.pdf
[reproduction repository]: https://github.com/uic-nlp-lab/edusenti
[models]: https://zenodo.org/records/10795173
[zensols.deepnlp]: https://github.com/plandes/deepnlp