Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/coreysutphin/semeval-spanglish-2019

This repo contains the data, scripts, and results from the SemEval 2019 Sentimix Spanglish challenge.
https://github.com/coreysutphin/semeval-spanglish-2019

Last synced: 9 days ago
JSON representation

This repo contains the data, scripts, and results from the SemEval 2019 Sentimix Spanglish challenge.

Host: GitHub
URL: https://github.com/coreysutphin/semeval-spanglish-2019
Owner: CoreySutphin
Created: 2019-09-24T20:06:55.000Z (over 5 years ago)
Default Branch: master
Last Pushed: 2022-12-08T06:59:13.000Z (about 2 years ago)
Last Synced: 2024-10-12T18:33:02.672Z (3 months ago)
Language: Python
Size: 4.28 MB
Stars: 2
Watchers: 2
Forks: 0
Open Issues: 10
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# semeval-spanglish-2019

This repo contains the data, scripts, and results from the SemEval 2019/2020 Sentimix Spanglish challenge.

## Sentiment Analysis on Code-Mixed Tweets

The challenge deals with the problem of identifying the sentiment of a set of 'Spanglish' tweets, where both English and Spanish are used in a single tweet. Misspellings, new words, mixed grammar, and the short length of tweets make this task difficult, and currently prevailing methods of using pre-trained contextual word embeddings may not be as effective. Vocabulary and embedding sizes will have to be large to accommodate both languages, and the odds of running into an out-of-vocabulary word are very high.

We have ran experiments using one-hot encoded character embeddings and concatenated Spanish + English word embeddings.

Spanish + English Word Embeddings
3 CNN layers, Max Pooling, 2 Dense Layers for classification

| Precision | Recall | F1 |
| --------- | ------ | ------ |
| 0.61 | 0.3348 | 0.4323 |

One-Hot Encoded Character Embeddings
1 CNN layer, Max Pooling, 2 Dense Layers for classification

| Precision | Recall | F1 |
| --------- | ------ | ---- |
| 0.42 | 0.56 | 0.41 |

## Roles

Corey Sutphin - Preprocessing scripts, model utilizing Spanish word embeddings, English word embeddings, and then the two concatenated on each other.
Cove Soyars - Preprocessing scripts, bash script, model using one-hot encoded character embeddings with a CNN.
Nick Rodriguez - Preprocessing scripts, BiLSTM model