awesome-italian

A list of awesome NLP resources for Italian language.
https://github.com/AlessandroGianfelici/awesome-italian

Last synced: 1 day ago
JSON representation

Corpora
- Sentiment Analysis
  - Italian review dataset - Trustpilot-crawled dataset with 146,910 reviews.
  - Happy Parents - Annotated datasets of parent to parent and parents to children dialogues.
  - Italian Sentiment Analysis - Smartphone review dataset.
  - Sentipolc2016 - Dataset for the Evalita Sentipolc competition, ed.2016.
  - Absita2018 - Booking-crawled dataset for the Evalita Absita competition, ed.2018.
  - Distributional Polarity Lexicon - Annotated dataset of sentiment polarity for short (i.e. few words) expressions.
  - SentiML - a collection of documents annotatated to identify sentiment at the sentence level.
  - Sentic - multi-lingual sentiment analysis dataset.
  - TWITA - dataset of Italian tweets.
- Hate speech recognition
  - HaSpeeDe - Dataset for the Evalita Hate Speech Detection competition, ed.2018 and 2020.
  - IHSC - Twitter corpus built with the aim of representing and analyzing hate speech against some minority groups in Italy.
  - WhatsApp Dataset - WhatsApp dataset to study cyberbullying among Italian students aged 12-13 in the context of the CREEP EIT project
- Irony detection
  - Irony and Tweets - labeled dataset of ironic tweets in several languages.
  - IronITA 2018 - dataset for the IronITA (Irony Detection in Italian Tweets) competition, organised within Evalita 2018.
- Word collections
  - paroleitaliane - Lists of italian words about different topics and from several sources.
- Part of speech tagging
  - PoS-Tagging Evalita 2009 - Annotated PoS tagging dataset for the Evalita 2009 competition.
- Named Entity Recognition
  - I-CAB - Corpora of annotated articles from "L'Adige" for NER tasks.
  - PAISA - Corpora of annotated articles scraped from the web.
  - itWaC - a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds.
- Linguistic Complexity
  - Italian Complexity Dataset - 1,123 Italian sentences rated by humans with a judgment of complexity.
- Parallel corpora
  - Europarl - parallel sentences between Italian and English from the European Parlament.
  - PaCCSS-IT - Parallel Corpus of Complex-Simple Sentences for ITalian.
- Spoken language corpora
  - kiparla - The largest corpus of spoken Italian available so far (for research purpose only).
Models
- Sentiment Analysis
  - SentITA - a Bidirectional LSTM-CNN that operates at word level for sentiment polarty classification.
  - Feel-IT - a BERT-based sentiment and emotion classifier for Italian.
- Language Models
  - UmBERTo - a Roberta-based Language Model trained on large Italian Corpora.
- Text summarization
  - multilang-summarizer - A multilingual text summarization model partially supported by the National Council of Science and Technology (CONACYT) of Mexico.
Useful libraries
- Only Italian
  - italian-dictionary - a Python library to retrieve the meaning of italian lemmas
- Multilingual (supporting also Italian)
  - Spacy - a Python general purpose NLP library
  - NLTK - Natural Language ToolKit library

Programming Languages

Python 4

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

awesome-italian

Corpora

Sentiment Analysis

Hate speech recognition

Irony detection

Word collections

Part of speech tagging

Named Entity Recognition

Linguistic Complexity

Parallel corpora

Spoken language corpora

Models

Sentiment Analysis

Language Models

Text summarization

Useful libraries

Only Italian

Multilingual (supporting also Italian)