Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/ebonnal/annotweet

Sentiment Analysis project on tweets.
https://github.com/ebonnal/annotweet

classification nlp nlp-machine-learning scala sentiment-analysis spark spark-ml tweets twitter

Last synced: 25 days ago
JSON representation

Sentiment Analysis project on tweets.

Host: GitHub
URL: https://github.com/ebonnal/annotweet
Owner: ebonnal
Created: 2019-02-27T09:13:30.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2019-04-29T15:43:26.000Z (almost 6 years ago)
Last Synced: 2025-01-13T12:51:00.465Z (about 1 month ago)
Topics: classification, nlp, nlp-machine-learning, scala, sentiment-analysis, spark, spark-ml, tweets, twitter
Language: Scala
Homepage:
Size: 8.75 MB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 3
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# Annotweet
## Overview
It's a **Sentiment Analysis** project on tweets.

Project structure (leveraging simple Builder and Factory patterns) aims at making easy the creation of predictive models out of a fully configurable **Spark ML pipeline** (choose pre-processings, transformations, and classification algorithm).

We have also convenient class for **tweet extracting** through *twitter4j* facilities
We made our own annotation on an extracted dataset.

## With
- Scala 2.11.8
- Spark 2.4.0
- Java 8

See detail on how to use trained model in `Guide_Utilisation.pdf`in `Compte_Rendu.pdf` (in French).

# Rendering: Sentiment Analysis Approaches to tweets about EM

## Extraction
Our code used for extraction is based on the twitter4j API and is contained in the package of our project `com.enzobnl.annotweet.extraction`
## Annotation
We annotated the tweets each one on our side and we made an automatic concensus based on rules.

## Techno used
Spark ML in Scala

## Tokenization
This is the part on which we have invested the most time.
Here is a preview of the steps implemented from an example:

|Original|“Macron il a une bonne testas,on dirait un Makroudh! Mais lui,il a #rien de bon:( #aie https://t.co/FiOiho7”|
|--|--|
| smileys | “Macron il a une bonne testas,on dirait un Makroudh! Mais lui,il a #rien de bon **sadsmiley** #aie https://t.co/FiOiho7” |
|URLs|“Macron il a une bonne testas,on dirait un Makroudh! Mais lui,il a #rien de bon sadsmiley #aie **http FiOiho7**“|
|Punctuation|“Macron il a une bonne testas on dirait un Makroudh **!** Mais lui il a #rien de bon sadsmiley #aie http FiOiho7”|
|Split|[“Macron”, “il”, “a”, “une”, “bonne”, “testas”, “on”, “dirait”, “un”, “Makroudh”, “!”, “Mais”, “lui”, “il”, “a”, “#rien”, “de”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”]|
|"But"/"Mais" filter|[“lui”, “il”, “a”, “#rien”, “de”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”]|
|Fillers removing|[“lui”, “il”, “a”, “#rien”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”]|
|Hashtags|[“lui”, “il”, “a”, **“rien”**, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”, **“#rien”**]|
|word pairs|[“luiil”, “ila”, “arien”, “rienbon”, “bonsadsmiley”, “lui”, “il”, “a”, “rien”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”, “#rien”]|

Word pairs are especially effective when the rest of the system (vectorization and / or classification) do not allow to take into account the context of the words, for example TF-IDF + Logistic Regression.

Each step was selected because it provided an improvement in accuracy, however we could spend more time trying to understand the influence of each step on the results to refine them: What are the errors they remove? what are the errors they add?

## Vectorization
For vectorization, we were initially on a ** TF-IDF **, then we drastically increased our performance by 7 points by passing on ** word2vec **.

## Classification
For the choice of the classification algorithm, we tested (ordered from the best to the least good):
1. Gradient Boosting Tree (best)
2. Logistic Regression
3. Khiops (Orange proprietary implementation of the MODL approach)
4. Random Forest

We could not tune our GBT as much as we would have liked (only the number of iteration and the learning rate were optimized).

## Model Selection Methodology
Whether for tokenization modifications, vectorization solutions or classification, we evaluated our global models at each step using cross-validations (number of blocks = 10).

## Results
On our cross-validation we get an accuracy of ** 68% + - 2% ** with the previously presented tokenization + Word2Vec + GBT.
Special mention to the time spent on tokenization which allowed us to gain between 2 and 7 points of accuracy, according to the vectorization and the classification that am, the most important improvement being observed when we follow with TF-IDF + Regression Logistics.