Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ebonnal/annotweet
Sentiment Analysis project on tweets.
https://github.com/ebonnal/annotweet
classification nlp nlp-machine-learning scala sentiment-analysis spark spark-ml tweets twitter
Last synced: 25 days ago
JSON representation
Sentiment Analysis project on tweets.
- Host: GitHub
- URL: https://github.com/ebonnal/annotweet
- Owner: ebonnal
- Created: 2019-02-27T09:13:30.000Z (almost 6 years ago)
- Default Branch: master
- Last Pushed: 2019-04-29T15:43:26.000Z (almost 6 years ago)
- Last Synced: 2025-01-13T12:51:00.465Z (about 1 month ago)
- Topics: classification, nlp, nlp-machine-learning, scala, sentiment-analysis, spark, spark-ml, tweets, twitter
- Language: Scala
- Homepage:
- Size: 8.75 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# Annotweet
## Overview
It's a **Sentiment Analysis** project on tweets.Project structure (leveraging simple Builder and Factory patterns) aims at making easy the creation of predictive models out of a fully configurable **Spark ML pipeline** (choose pre-processings, transformations, and classification algorithm).
We have also convenient class for **tweet extracting** through *twitter4j* facilities
We made our own annotation on an extracted dataset.## With
- Scala 2.11.8
- Spark 2.4.0
- Java 8See detail on how to use trained model in `Guide_Utilisation.pdf`in `Compte_Rendu.pdf` (in French).
# Rendering: Sentiment Analysis Approaches to tweets about EM
## Extraction
Our code used for extraction is based on the twitter4j API and is contained in the package of our project `com.enzobnl.annotweet.extraction`
## Annotation
We annotated the tweets each one on our side and we made an automatic concensus based on rules.## Techno used
Spark ML in Scala## Tokenization
This is the part on which we have invested the most time.
Here is a preview of the steps implemented from an example:|Original|“Macron il a une bonne testas,on dirait un Makroudh! Mais lui,il a #rien de bon:( #aie https://t.co/FiOiho7”|
|--|--|
| smileys | “Macron il a une bonne testas,on dirait un Makroudh! Mais lui,il a #rien de bon **sadsmiley** #aie https://t.co/FiOiho7” |
|URLs|“Macron il a une bonne testas,on dirait un Makroudh! Mais lui,il a #rien de bon sadsmiley #aie **http FiOiho7**“|
|Punctuation|“Macron il a une bonne testas on dirait un Makroudh **!** Mais lui il a #rien de bon sadsmiley #aie http FiOiho7”|
|Split|[“Macron”, “il”, “a”, “une”, “bonne”, “testas”, “on”, “dirait”, “un”, “Makroudh”, “!”, “Mais”, “lui”, “il”, “a”, “#rien”, “de”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”]|
|"But"/"Mais" filter|[“lui”, “il”, “a”, “#rien”, “de”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”]|
|Fillers removing|[“lui”, “il”, “a”, “#rien”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”]|
|Hashtags|[“lui”, “il”, “a”, **“rien”**, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”, **“#rien”**]|
|word pairs|[“luiil”, “ila”, “arien”, “rienbon”, “bonsadsmiley”, “lui”, “il”, “a”, “rien”, “bon”, “sadsmiley”, “#aie”, “http”, “FiOiho7”, “#rien”]|Word pairs are especially effective when the rest of the system (vectorization and / or classification) do not allow to take into account the context of the words, for example TF-IDF + Logistic Regression.
Each step was selected because it provided an improvement in accuracy, however we could spend more time trying to understand the influence of each step on the results to refine them: What are the errors they remove? what are the errors they add?
## Vectorization
For vectorization, we were initially on a ** TF-IDF **, then we drastically increased our performance by 7 points by passing on ** word2vec **.## Classification
For the choice of the classification algorithm, we tested (ordered from the best to the least good):
1. Gradient Boosting Tree (best)
2. Logistic Regression
3. Khiops (Orange proprietary implementation of the MODL approach)
4. Random ForestWe could not tune our GBT as much as we would have liked (only the number of iteration and the learning rate were optimized).
## Model Selection Methodology
Whether for tokenization modifications, vectorization solutions or classification, we evaluated our global models at each step using cross-validations (number of blocks = 10).## Results
On our cross-validation we get an accuracy of ** 68% + - 2% ** with the previously presented tokenization + Word2Vec + GBT.
Special mention to the time spent on tokenization which allowed us to gain between 2 and 7 points of accuracy, according to the vectorization and the classification that am, the most important improvement being observed when we follow with TF-IDF + Regression Logistics.