Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-italian

A list of awesome NLP resources for Italian language.
https://github.com/AlessandroGianfelici/awesome-italian

Last synced: about 4 hours ago
JSON representation

  • Corpora

    • Sentiment Analysis

      • Italian review dataset - Trustpilot-crawled dataset with 146,910 reviews.
      • Happy Parents - Annotated datasets of parent to parent and parents to children dialogues.
      • Italian Sentiment Analysis - Smartphone review dataset.
      • Sentipolc2016 - Dataset for the Evalita Sentipolc competition, ed.2016.
      • Absita2018 - Booking-crawled dataset for the Evalita Absita competition, ed.2018.
      • Distributional Polarity Lexicon - Annotated dataset of sentiment polarity for short (i.e. few words) expressions.
      • SentiML - a collection of documents annotatated to identify sentiment at the sentence level.
      • Sentic - multi-lingual sentiment analysis dataset.
      • TWITA - dataset of Italian tweets.
    • Hate speech recognition

      • HaSpeeDe - Dataset for the Evalita Hate Speech Detection competition, ed.2018 and 2020.
      • IHSC - Twitter corpus built with the aim of representing and analyzing hate speech against some minority groups in Italy.
      • WhatsApp Dataset - WhatsApp dataset to study cyberbullying among Italian students aged 12-13 in the context of the CREEP EIT project
    • Irony detection

      • Irony and Tweets - labeled dataset of ironic tweets in several languages.
      • IronITA 2018 - dataset for the IronITA (Irony Detection in Italian Tweets) competition, organised within Evalita 2018.
    • Word collections

      • paroleitaliane - Lists of italian words about different topics and from several sources.
    • Part of speech tagging

    • Named Entity Recognition

      • I-CAB - Corpora of annotated articles from "L'Adige" for NER tasks.
      • PAISA - Corpora of annotated articles scraped from the web.
      • itWaC - a 2 billion word corpus constructed from the Web limiting the crawl to the .it domain and using medium-frequency words from the Repubblica corpus and basic Italian vocabulary lists as seeds.
    • Linguistic Complexity

    • Parallel corpora

      • Europarl - parallel sentences between Italian and English from the European Parlament.
      • PaCCSS-IT - Parallel Corpus of Complex-Simple Sentences for ITalian.
    • Spoken language corpora

      • kiparla - The largest corpus of spoken Italian available so far (for research purpose only).
  • Models

    • Sentiment Analysis

      • SentITA - a Bidirectional LSTM-CNN that operates at word level for sentiment polarty classification.
      • Feel-IT - a BERT-based sentiment and emotion classifier for Italian.
      • SentITA - a Bidirectional LSTM-CNN that operates at word level for sentiment polarty classification.
      • Feel-IT - a BERT-based sentiment and emotion classifier for Italian.
    • Language Models

      • UmBERTo - a Roberta-based Language Model trained on large Italian Corpora.
      • UmBERTo - a Roberta-based Language Model trained on large Italian Corpora.
    • Text summarization

      • multilang-summarizer - A multilingual text summarization model partially supported by the National Council of Science and Technology (CONACYT) of Mexico.
  • Useful libraries

    • Only Italian

    • Multilingual (supporting also Italian)

      • Spacy - a Python general purpose NLP library
      • NLTK - Natural Language ToolKit library