https://github.com/sagorbrur/codeswitch
CodeSwitch is a NLP tool, can use for language identification, pos tagging, name entity recognition, sentiment analysis of code mixed data.
https://github.com/sagorbrur/codeswitch
code-mixed code-switching codeswitch hindi-english huggingface language-identification ner nlp pos pos-tagging sentiment-analysis spanish-english transformers
Last synced: about 1 month ago
JSON representation
CodeSwitch is a NLP tool, can use for language identification, pos tagging, name entity recognition, sentiment analysis of code mixed data.
- Host: GitHub
- URL: https://github.com/sagorbrur/codeswitch
- Owner: sagorbrur
- License: mit
- Created: 2020-08-22T07:12:26.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2020-11-02T12:55:48.000Z (almost 5 years ago)
- Last Synced: 2025-07-28T21:51:12.493Z (3 months ago)
- Topics: code-mixed, code-switching, codeswitch, hindi-english, huggingface, language-identification, ner, nlp, pos, pos-tagging, sentiment-analysis, spanish-english, transformers
- Language: Jupyter Notebook
- Homepage: https://codeswitch.readthedocs.io
- Size: 23.4 KB
- Stars: 35
- Watchers: 3
- Forks: 6
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Code Switch
[](https://codeswitch.readthedocs.io/en/latest/?badge=latest)
[](https://pypi.org/project/codeswitch/)
[](https://github.com/sagorbrur/codeswitch/blob/master/notebook/codeswitch.ipynb)
[](https://pepy.tech/project/codeswitch)**CodeSwitch** is an NLP tool, can use for language identification, pos tagging, name entity recognition, sentiment analysis of code mixed data.
## Supported Code-Mixed Language
We used [LinCE](https://ritual.uh.edu/lince/home) dataset for training **multilingual BERT** model using huggingface [transformers](https://github.com/huggingface/transformers). `LinCE` has four language mixed data. We took three of it `spanish-english`, `hindi-english` and `nepali-english`. Hope we will train and add other language and task too.* Spanish-English(spa-eng)
* Hindi-English(hin-eng)
* Nepali-English(nep-eng)### Language Code
* `spa-eng` for spanish-english
* `hin-eng` for hindi-english
* `nep-eng` for nepali-english## Installation
```
pip install codeswitch
```
## Dependency
* pytorch >=1.6.0## Training Details
* All three(lid, ner, pos) sequence tagging model was trainend with huggingface [token classification](https://github.com/huggingface/transformers/tree/master/examples/token-classification)
* Sentiment Analysis Model trained with huggingface [text classification](https://github.com/huggingface/transformers/tree/master/examples/text-classification)
* You can find every model and evaluation results [here](https://huggingface.co/sagorsarker)## Features & Supported Language
* Language Identification
- spanish-english
- hindi-english
- nepali-english
* POS
- spanish-english
- hindi-english
* NER
- spanish-english
- hindi-english
* Sentiment Analysis
- spanish-english## Language Identification
```py
from codeswitch.codeswitch import LanguageIdentification
lid = LanguageIdentification('spa-eng')
# for hindi-english use 'hin-eng',
# for nepali-english use 'nep-eng'
text = "" # your code-mixed sentence
result = lid.identify(text)
print(result)
```## POS Tagging
```py
from codeswitch.codeswitch import POS
pos = POS('spa-eng')
# for hindi-english use 'hin-eng'
text = "" # your mixed sentence
result = pos.tag(text)
print(result)```
## NER Tagging
```py
from codeswitch.codeswitch import NER
ner = NER('spa-eng')
# for hindi-english use 'hin-eng'
text = "" # your mixed sentence
result = ner.tag(text)
print(result)```
## Sentiment Analysis
```py
from codeswitch.codeswitch import SentimentAnalysis
sa = SentimentAnalysis('spa-eng')
sentence = "El perro le ladraba a La Gatita .. .. lol #teamlagatita en las playas de Key Biscayne este Memorial day"
result = sa.analyze(sentence)
print(result)
# [{'label': 'LABEL_1', 'score': 0.9587041735649109}]```
## Acknowledgement
* [LinCE](https://ritual.uh.edu/lince/home)
* [BERT](https://arxiv.org/abs/1810.04805)
* [huggingface](https://github.com/huggingface)