https://github.com/labrijisaad/language-identifier-svm

Language identification script that can detect the language of a given text. Currently supports Swahili, Wolof, French, English, Arabic, and Dyula. Customizable language support.
https://github.com/labrijisaad/language-identifier-svm

arabic-identification dyula english-identification french-identification language-identifier python support-vector-machines svm swahili-identification wolof-identification

Last synced: 3 months ago
JSON representation

Language identification script that can detect the language of a given text. Currently supports Swahili, Wolof, French, English, Arabic, and Dyula. Customizable language support.

Host: GitHub
URL: https://github.com/labrijisaad/language-identifier-svm
Owner: labrijisaad
Created: 2022-07-28T10:02:31.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2022-12-19T08:52:12.000Z (over 2 years ago)
Last Synced: 2025-03-23T08:51:12.642Z (4 months ago)
Topics: arabic-identification, dyula, english-identification, french-identification, language-identifier, python, support-vector-machines, svm, swahili-identification, wolof-identification
Language: Jupyter Notebook
Homepage:
Size: 1.75 MB
Stars: 4
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

        # 📙 `Language-Identifier with SVM in python` 🐍



  



- 🎯 In this project, I developed a script that can identify the language used in a given text. 

- 🛠️ The script currently supports the following languages: **`Swahili`**, **`Wolof`**, **`French`**, **`English`**, **`Arabic`** and **`Dyula`**.

- ⚠️ To obtain accurate results, the input text should be relatively long (at least 4-5 words). The script can be easily modified to add or modify the supported languages by adding a training dataset for the desired language, this dataset can be found by example on [HuggingFace Datesets](https://huggingface.co/datasets?sort=downloads).




- You can find the **`model`** and the **`vectorizer`** in the **`/model`** directory. (you can also find the **`python script`**: will be used in **`meth2`**)

- Here are **TWO** ways to use the trained model in notebook: (You must before install the requirements)

```py

    !pip install pickle sys pandas

```

##### meth 1

> via model and vectorizer import

```py

    import pickle

    import pandas as pd

    SVM_model = pickle.load(open('model/SVM_model_language_identifier.pkl', 'rb'))

    SVM_vectorizer = pickle.load(open("model/SVM_vectorizer.pk","rb"))

    def predict_language(text):

        serie = pd.Series(text)

        vector = SVM_vectorizer.transform(serie)

        return str(SVM_model.predict(vector)[0])

    

    text = "Na nga def ?" 

    print(predict_language(text))

    

    >>> wolof

 ```

##### meth 2

> by calling a script that does all the work for us

```py

    text = "I'm not really into the birthday thing honestly but I admit this was a really chill"

    var = !python model/language_identifier.py $text 

    print(var[-1])

    

    >>> english

```

- 💪 Model performance: Here are the results obtained after training the model

      wolof:  {'precision': 0.9956011730205279, 'recall': 0.9883551673944687, 'f1-score': 0.9919649379108838, 'support': 687}

      french:  {'precision': 0.9971264367816092, 'recall': 0.9788434414668548, 'f1-score': 0.9879003558718862, 'support': 709}

      swahili:  {'precision': 1.0, 'recall': 0.9849108367626886, 'f1-score': 0.9923980649619903, 'support': 729}

      english:  {'precision': 0.9683195592286501, 'recall': 0.9736842105263158, 'f1-score': 0.9709944751381215, 'support': 722}

      arabic:  {'precision': 0.9363354037267081, 'recall': 0.9741518578352181, 'f1-score': 0.9548693586698337, 'support': 619}

      dyula:  {'precision': 1.0, 'recall': 1.0, 'f1-score': 1.0, 'support': 691}

      

 > - Overall, it the SVM model is performing very well for all the languages, with high values for all performance metrics.

- 📫 Feel free to contact me if anything is wrong or if anything needs to be changed 😎!  **[email protected]**



> - 🙌 Notebook made by [@labriji_saad](https://github.com/labrijisaad)

> - 🔗 Linledin [@labriji_saad](https://www.linkedin.com/in/labrijisaad/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/labrijisaad/language-identifier-svm

Awesome Lists containing this project

README