Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/coreysutphin/semeval-spanglish-2019
This repo contains the data, scripts, and results from the SemEval 2019 Sentimix Spanglish challenge.
https://github.com/coreysutphin/semeval-spanglish-2019
Last synced: 9 days ago
JSON representation
This repo contains the data, scripts, and results from the SemEval 2019 Sentimix Spanglish challenge.
- Host: GitHub
- URL: https://github.com/coreysutphin/semeval-spanglish-2019
- Owner: CoreySutphin
- Created: 2019-09-24T20:06:55.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2022-12-08T06:59:13.000Z (about 2 years ago)
- Last Synced: 2024-10-12T18:33:02.672Z (3 months ago)
- Language: Python
- Size: 4.28 MB
- Stars: 2
- Watchers: 2
- Forks: 0
- Open Issues: 10
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# semeval-spanglish-2019
This repo contains the data, scripts, and results from the SemEval 2019/2020 Sentimix Spanglish challenge.
## Sentiment Analysis on Code-Mixed Tweets
The challenge deals with the problem of identifying the sentiment of a set of 'Spanglish' tweets, where both English and Spanish are used in a single tweet. Misspellings, new words, mixed grammar, and the short length of tweets make this task difficult, and currently prevailing methods of using pre-trained contextual word embeddings may not be as effective. Vocabulary and embedding sizes will have to be large to accommodate both languages, and the odds of running into an out-of-vocabulary word are very high.
We have ran experiments using one-hot encoded character embeddings and concatenated Spanish + English word embeddings.
Spanish + English Word Embeddings
3 CNN layers, Max Pooling, 2 Dense Layers for classification| Precision | Recall | F1 |
| --------- | ------ | ------ |
| 0.61 | 0.3348 | 0.4323 |One-Hot Encoded Character Embeddings
1 CNN layer, Max Pooling, 2 Dense Layers for classification| Precision | Recall | F1 |
| --------- | ------ | ---- |
| 0.42 | 0.56 | 0.41 |## Roles
Corey Sutphin - Preprocessing scripts, model utilizing Spanish word embeddings, English word embeddings, and then the two concatenated on each other.
Cove Soyars - Preprocessing scripts, bash script, model using one-hot encoded character embeddings with a CNN.
Nick Rodriguez - Preprocessing scripts, BiLSTM model