{"id":20334145,"url":"https://github.com/javadr/pytorch-detect-code-switching","last_synced_at":"2025-04-11T21:51:08.506Z","repository":{"id":45784204,"uuid":"513434822","full_name":"javadr/PyTorch-Detect-Code-Switching","owner":"javadr","description":"Implementation of a deep learning model (BiLSTM) to detect code-switching","archived":false,"fork":false,"pushed_at":"2025-01-27T22:24:02.000Z","size":10063,"stargazers_count":6,"open_issues_count":0,"forks_count":0,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-03-25T17:49:29.483Z","etag":null,"topics":["bilstm","character-based-model","code-switching","language-detection","natural-language-processing","nlp","paper-implementations","pytorch"],"latest_commit_sha":null,"homepage":"https://javadr-pytorch-detect-code-switching-codeapp-wmvbur.streamlit.app/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/javadr.png","metadata":{"files":{"readme":"README.md","changelog":"Changelog.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-07-13T08:10:18.000Z","updated_at":"2025-01-27T22:24:06.000Z","dependencies_parsed_at":"2024-07-13T10:42:35.943Z","dependency_job_id":"c584d006-7be3-4e1f-a2a2-31b481a69a54","html_url":"https://github.com/javadr/PyTorch-Detect-Code-Switching","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javadr%2FPyTorch-Detect-Code-Switching","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javadr%2FPyTorch-Detect-Code-Switching/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javadr%2FPyTorch-Detect-Code-Switching/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/javadr%2FPyTorch-Detect-Code-Switching/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/javadr","download_url":"https://codeload.github.com/javadr/PyTorch-Detect-Code-Switching/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248487733,"owners_count":21112188,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bilstm","character-based-model","code-switching","language-detection","natural-language-processing","nlp","paper-implementations","pytorch"],"created_at":"2024-11-14T20:36:06.458Z","updated_at":"2025-04-11T21:51:08.494Z","avatar_url":"https://github.com/javadr.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PyTorch-Detect-Code-Switching\n\n## [TL;DR ![Streamlit App](https://static.streamlit.io/badges/streamlit_badge_black_white.svg)](https://pytorch-detect-code-switching-code.streamlit.app/)\nThe PyTorch-Detect-Code-Switching repository implements a BiLSTM model for detecting code-switching (language alternation in text). Using a dataset of annotated tweets (English, Spanish, and others), the model achieves over 90% precision, recall, and F1 Scores. It is ideal for multilingual NLP research. \n\n## Task Description\nCurrently, research in NLP focuses on dealing with types of multilingual content. Thus, the first thing we need to learn for working on different NLP tasks, such as Question Answering, is to accurately identify the languages on texts. This repository uses the idea behind the paper [A Neural Model for Language Identification in Code-Switched Tweets](https://homes.cs.washington.edu/~nasmith/papers/jaech+mulcaire+hathi+ostendorf+smith.lics16.pdf).\n\n## Data\n\nhttp://www.care4lang.seas.gwu.edu/cs2/call.html\n\nThis data is a collection of tweets; in particular, three files for the training set and three for the validation set:\n\n\n* `offsets_mod.tsv`:\n```\ntweet_id, user_id, start, end, gold label\n```\n\n* `tweets.tsv`:\n```\ntweet_id, user_id, tweet text\n```\n\n* `data.tsv`:\n```\ntweet_id, user_id, start, end, token, gold label\n```\n\nThe gold labels can be one of three:\n\n* en\n* es\n* other\n\n### Data Analysis\n\n* As it can be seen in the following table, data are imbalanced in both the training and test set. While the number of `English` tokens in training data is about 50%, the number of `Spanish` tokens prevails in the test set.\n\n    | label | train | dev |\n    | --- | --- | --- |\n    | `en` | **46042** | 3028 |\n    | `es` | 25563 | **4185** |\n    | `other` | 20257 | 2370 |\n    | sum | 91862 | 9583 |\n\n* The number of tweets in the training set is `7400` and in the test set is `832`. The tweets in both sets are wholly from two disjoint groups. The training set includes tweets of 6 persons and the test set has 8 persons' tweets.\n\n    | user id | train | dev |\n    | :---: | :---: | :---: |\n    | 1 | 1160520883 | 156036283 |\n    | 2 | 1520815188 | 21327323 |\n    | 3 | 1651154684 | 270181505 |\n    | 4 | 169403434 | 28261811 |\n    | 5 | 304724626 | 364263613 |\n    | 6 | 336199483 | 382890691 |\n    | 7 |  | 418322879 |\n    | 8 |  | 76523773 |\n* distribution of unique tokens and characters.\n    | | unique token | unique token (lower case) | unique characters |\n    | :---: | :---: | :---: | :---: |\n    | train | 14366 | 12220 | 50 |\n    | dev | 2771 | 2559 | 28\n* The distribution of the length of the tokens is depicted below, which are taken by the following one-liner Linux command:\n    ```bash\n    cut -f5 train_data.tsv|awk '{print length}'|sort -n |uniq -c|awk -F\" \" '{print $NF\" \" $(NF-1)}'|R --slave -e 'x \u003c- scan(file=\"stdin\", quiet=TRUE,  what=list(numeric(), numeric())); png(\"Histogram of tokens length-train.png\");plot(x[[1]],x[[2]], xlab=\"length\", ylab=\"frequency\", main=\"Train\");'\n    ```\n    \u003cimg src=\"./Results/images/Histogram%20of%20token%20length-train.png\" alt=\"token length distribution in training set\" width=\"45%\"/\u003e\n    \u003cimg src=\"./Results/images/Histogram%20of%20token%20length-dev.png\" alt=\"token length distribution in dev set\" width=\"45%\"/\u003e\n\n    It is evident that both data sets have the same distribution of tokens' lengths with a slight shift. There are several outliers in both datasets as users tend to repeat the characters on social media. The weighted average tokens' lengths for the training and test sets are `3.93` and `4.11`, respectively. I've used the following to compute these numbers:\n    ```bash\n    cut -f5 train_data.tsv|awk '{print length}'|sort -n |uniq -c|awk -F\" \" '{print $NF\" \" $(NF-1)}'|tr \" \" \"*\"|paste -sd+|bc -l\n    ```\n\n### Preprocessing\n* Some rows in `[train|dev]_data.csv` include `\"` resulting in a weird issue with `pandas.read_csv`. Actually, it reads the next lines till reaches another `\"`, so I set `quotechar` option to `'\\0'`(=NULL) in `pandas.read_csv` to solve this issue.\n* I've also checked the availability of the Null in those files with the following command:\n    ```bash\n    grep -Pa '\\x00' data/train_data.tsv\n    grep -Pa '\\x00' data/dev_data.tsv\n    ```\n* Another solution to the previous issue is the `quoting` option with `3` as its value, which means `QUOTE_NONE`.\n* As it is mentioned in the paper, the data contains many long and repetitive character sequences such as “hahahaha...”. To deal with these, we restricted any sequence of repeating characters to at most five repetitions with a maximum length of 20 for each token.\n    ```python\n    df['token'] = df['token'].apply(lambda t: re.sub(r'(.)\\1{4,}',r'\\1\\1\\1\\1', t)[:20])\n    ```\n\n## Installing dependencies\n\nYou can use the `pip` program to install the dependencies on your own. They are all listed in the `requirements.txt` file.\n\nTo use this method, you would proceed as:\n\n```pip install -r requirements.txt```\n\n## Model Architecture\n![Char2Vec](./Results/images/Char2Vec.png)\n```python\nBiLSTMtagger(\n  (word_embeddings): Char2Vec(\n    (embeds): Embedding(300, 9, padding_idx=0)\n    (conv1): Sequential(\n      (0): Conv1d(9, 21, kernel_size=(3,), stride=(1,))\n      (1): ReLU()\n      (2): Dropout(p=0.1, inplace=False)\n    )\n    (convs2): ModuleList(\n      (0): Sequential(\n        (0): Conv1d(21, 5, kernel_size=(3,), stride=(1,))\n        (1): ReLU()\n      )\n      (1): Sequential(\n        (0): Conv1d(21, 5, kernel_size=(4,), stride=(1,))\n        (1): ReLU()\n      )\n      (2): Sequential(\n        (0): Conv1d(21, 5, kernel_size=(5,), stride=(1,))\n        (1): ReLU()\n      )\n    )\n    (linear): Sequential(\n      (0): Linear(in_features=15, out_features=15, bias=True)\n      (1): ReLU()\n    )\n  )\n  (lstm): LSTM(15, 128, num_layers=2, batch_first=True, dropout=0.3, bidirectional=True)\n  (hidden2tag): Linear(in_features=256, out_features=4, bias=True)\n)\n```\n#### Model Summary\n```bash\n=================================================================\nLayer (type:depth-idx)                   Param #\n=================================================================\n├─Char2Vec: 1-1                          --\n|    └─Embedding: 2-1                    2,700\n|    └─Sequential: 2-2                   --\n|    |    └─Conv1d: 3-1                  588\n|    |    └─ReLU: 3-2                    --\n|    |    └─Dropout: 3-3                 --\n|    └─ModuleList: 2-3                   --\n|    |    └─Sequential: 3-4              320\n|    |    └─Sequential: 3-5              425\n|    |    └─Sequential: 3-6              530\n|    └─Sequential: 2-4                   --\n|    |    └─Linear: 3-7                  240\n|    |    └─ReLU: 3-8                    --\n├─LSTM: 1-2                              543,744\n├─Linear: 1-3                            1,028\n=================================================================\nTotal params: 549,575\nTrainable params: 549,575\nNon-trainable params: 0\n=================================================================\n```\n\n## How to use the code\n\n### Training\n\nJust run `train.py` from `code` directory. It assumes that the `cwd` is in the `code` directory.\n\n### Prediction\n\nLaunch `predict.py` with the following arguments:\n\n- `model`: path of the pre-trained model\n- `text`: input text\n\nExample usage:\n```bash\npython predict.py --model pretrained_model.pth --text=\"@lililium This is an audio book !\"\n```\n\n# Result\n\n Running the model on the Google Colab with `Tesla T4 GPU` and 100 epochs, achieved the `validation f1-score` of `0.92`.\n\n ![plot](./Results/images/plot[2207191218]-Ep100B64BiLSTM+Char2Vec,%202Layers,%20Adam,%20lre-3,%20wde-5.png)\n\n### classification Report\n\n```bash\n              precision    recall  f1-score   support\n\n          en       0.93      0.93      0.93      3028\n          es       0.94      0.96      0.95      4185\n       other       0.95      0.90      0.93      2370\n\n    accuracy                           0.94      9583\n   macro avg       0.94      0.93      0.94      9583\nweighted avg       0.94      0.94      0.94      9583\n```\n\n### Confusion Matrix\n\u003cimg src=\"./Results/images/ConfusionMatrix.png\" alt=\"confusion matrix\" width=\"45%\"/\u003e\n\n## TODO\n - [ ] Data augmentation\n - [ ] Fine tunning the model to find the best hyper-parameters\n - [X] Prediction GUI\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjavadr%2Fpytorch-detect-code-switching","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjavadr%2Fpytorch-detect-code-switching","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjavadr%2Fpytorch-detect-code-switching/lists"}