Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/AlessandroGianfelici/awesome-italian

A list of awesome NLP resources for Italian language.
https://github.com/AlessandroGianfelici/awesome-italian

List: awesome-italian

Last synced: 16 days ago
JSON representation

A list of awesome NLP resources for Italian language.

Awesome Lists containing this project

README

        

# Awesome Italian
A list of awesome NLP resources for Italian language.

## Corpora
### Sentiment Analysis
* [Sentipolc2016](http://www.di.unito.it/~tutreeb/sentipolc-evalita16/data.html) - Dataset for the Evalita Sentipolc competition, ed.2016.

* [Absita2018](http://sag.art.uniroma2.it/absita/data/) - Booking-crawled dataset for the Evalita Absita competition, ed.2018.

* [Italian review dataset](https://github.com/AlessandroGianfelici/italian_reviews_dataset) - Trustpilot-crawled dataset with 146,910 reviews.

* [Happy Parents](https://github.com/mirkolai/Happy-Parents) - Annotated datasets of parent to parent and parents to children dialogues.

* [Italian Sentiment Analysis](https://github.com/nicolaCirillo/italian-sentiment-analysis) - Smartphone review dataset.

* [Distributional Polarity Lexicon](http://sag.art.uniroma2.it/demo-software/distributional-polarity-lexicon/) - Annotated dataset of sentiment polarity for short (i.e. few words) expressions.

* [SentiML](http://corpus.leeds.ac.uk/marilena/SentiML/) - a collection of documents annotatated to identify sentiment at the sentence level.

* [Sentic](https://sentic.net/downloads/) - multi-lingual sentiment analysis dataset.

* [TWITA](http://valeriobasile.github.io/twita/downloads.html) - dataset of Italian tweets.

### Hate speech recognition
* [HaSpeeDe](https://github.com/msang/haspeede) - Dataset for the Evalita Hate Speech Detection competition, ed.2018 and 2020.

* [IHSC](https://github.com/msang/hate-speech-corpus) - Twitter corpus built with the aim of representing and analyzing hate speech against some minority groups in Italy.

* [WhatsApp Dataset](https://github.com/dhfbk/WhatsApp-Dataset) - WhatsApp dataset to study cyberbullying among Italian students aged 12-13 in the context of the CREEP EIT project

### Irony detection
* [Irony and Tweets](https://github.com/Jihen-Karoui/French-Italian-and-English-Corpora) - labeled dataset of ironic tweets in several languages.

* [IronITA 2018](http://www.di.unito.it/~tutreeb/ironita-evalita18/data.html) - dataset for the IronITA (Irony Detection in Italian Tweets) competition, organised within Evalita 2018.

### Part of speech tagging
* [PoS-Tagging Evalita 2009](http://medialab.di.unipi.it/evalita/) - Annotated PoS tagging dataset for the Evalita 2009 competition.

### Named Entity Recognition
* [I-CAB](https://ontotext.fbk.eu/icab.html/) - Corpora of annotated articles from "L'Adige" for NER tasks.

* [PAISA](https://www.corpusitaliano.it/) - Corpora of annotated articles scraped from the web.

* [itWaC](https://wacky.sslmit.unibo.it/doku.php?id=corpora) - a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds.

### Linguistic Complexity
* [Italian Complexity Dataset](http://www.italianlp.it/resources/corpus-of-sentences-rated-with-human-complexity-judgments/) - 1,123 Italian sentences rated by humans with a judgment of complexity.

### Parallel corpora
* [Europarl](https://www.statmt.org/europarl/) - parallel sentences between Italian and English from the European Parlament.

* [PaCCSS-IT](http://www.italianlp.it/resources/paccss-it-parallel-corpus-of-complex-simple-sentences-for-italian/) - Parallel Corpus of Complex-Simple Sentences for ITalian.

### Spoken language corpora
* [kiparla](http://kiparla.it/il-corpus/) - The largest corpus of spoken Italian available so far (for research purpose only).

### Word collections
* [paroleitaliane](https://github.com/napolux/paroleitaliane) - Lists of italian words about different topics and from several sources.

## Models
### Sentiment Analysis
* [SentITA](https://github.com/NicGian/SentITA/) - a Bidirectional LSTM-CNN that operates at word level for sentiment polarty classification.

* [Feel-IT](https://github.com/MilaNLProc/feel-it/) - a BERT-based sentiment and emotion classifier for Italian.

### Text summarization
* [multilang-summarizer](https://pypi.org/project/multilang-summarizer/) - A multilingual text summarization model partially supported by the National Council of Science and Technology (CONACYT) of Mexico.

### Language Models
* [UmBERTo](https://github.com/musixmatchresearch/umberto/) - a Roberta-based Language Model trained on large Italian Corpora.

## Useful libraries
### Only Italian
* [italian-dictionary](https://pypi.org/project/italian-dictionary/) - a Python library to retrieve the meaning of italian lemmas
### Multilingual (supporting also Italian)
* [Spacy](https://spacy.io/) - a Python general purpose NLP library
* [NLTK](https://www.nltk.org/) - Natural Language ToolKit library

## Contributing
If you know any other awesome Italian language resource that is not included in this list yet, feel free to fork this repo and open a pull request or, if you prefer, to write me [here](https://github.com/AlessandroGianfelici/awesome-italian/issues).