https://github.com/andrianllmm/taglid

A word-level Language Identification (LID) tool for Tagalog-English (Taglish) text
https://github.com/andrianllmm/taglid

code-mixing code-switching english language-identification linguistics nlp tagalog taglish

Last synced: 4 months ago
JSON representation

A word-level Language Identification (LID) tool for Tagalog-English (Taglish) text

Host: GitHub
URL: https://github.com/andrianllmm/taglid
Owner: andrianllmm
License: mit
Created: 2024-08-09T11:01:08.000Z (almost 2 years ago)
Default Branch: main
Last Pushed: 2025-06-22T16:55:28.000Z (about 1 year ago)
Last Synced: 2025-10-23T01:02:05.499Z (9 months ago)
Topics: code-mixing, code-switching, english, language-identification, linguistics, nlp, tagalog, taglish
Language: Python
Homepage: https://andrianllmm.github.io/projects/taglid
Size: 617 KB
Stars: 2
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          


# TagLID

**A word-level Language Identification (LID) tool for Tagalog-English (Taglish)

text**

[![Demo](https://asciinema.org/a/674332.svg)](https://asciinema.org/a/674332)



## About

TagLID is a library that labels each word in a Taglish (Tagalog-English mix)

text by language. It gives either a simple tag (`tgl` or `eng`) or detailed

frequency info with flags indicating how the word was identified. It is a

rule-based and opinionated system that mostly uses dictionary lookups. It also

handles cases like skipping numbers, names, and interjections, and includes

logic for dealing with slang, abbreviations, contractions, stemming or

lemmatizing inflected words, intrawords, and correcting misspellings.

## Installation

```sh

pip install git+https://github.com/andrianllmm/taglid.git@main

```

## Usage

TagLID can act as a standalone library that can be imported via `import taglid`

or as a CLI application via `python -m taglid`.

### Library Mode

#### Textual data

Use the `lid` module for textual data.

Use `lang_identify` to identify each word in a text. This takes any string and

returns a list of words and their corresponding English and Tagalog values,

flag, and correction.

```python

from taglid.lid import lang_identify

labeled_text = lang_identify("hello, mundo")

print(labeled_text)

```

Output:

```

[{'Word': 'hello', 'eng': 1.0, 'tgl': 0.0, 'Flag': 'DICT', 'Correction': None}, {'Word': 'mundo', 'eng': 0.0, 'tgl': 1.0, 'Flag': 'DICT', 'Correction': None}]

```

Use [`tabulate`](https://pypi.org/project/tabulate/) to view output in tabular

format.

```python

from tabulate import tabulate

print(tabulate(labeled_text, headers="keys"))

```

Output:

```

word      eng    tgl  flag    correction

------  -----  -----  ------  ------------

hello       1      0  DICT

mundo       0      1  DICT

```

Use `simplify` to only show the words and their language. This takes the return

value of `lang_identify` and returns a list of tuples containing the word and

its language.

```python

from taglid.lid import simplify

simplified_text = simplify(labeled_text)

print(simplified_text)

```

Output:

```

[('hello', 'eng'), ('mundo', 'tgl')]

```

#### Datasets

Use the `lid_dataset` module for datasets.

Use `lang_identify_df` to label each word in each cell in a

[`pandas`](https://pypi.org/project/pandas/) DataFrame. This takes a DataFrame

of multiple rows and columns with each cell containing textual data and returns

a labeled DataFrame where each token is a row labeled by its original row,

original column, and token index.

```python

import pandas as pd

from taglid.lid_dataset import lang_identify_df

data = [['hello po', 'ano?'], ['mag-aask lang po', 'what?']]

df = pd.DataFrame(data)

labeled_df = lang_identify_df(df)

print(labeled_df)

```

Output:

```

     col  token_index      word  eng  tgl  flag correction

row

0      0            1     hello  1.0  0.0  DICT       None

0      0            2        po  0.0  1.0  DICT       None

0      1            1       ano  0.0  1.0  FREQ       None

1      0            1  mag-aask  0.5  0.5  INTW       None

1      0            2      lang  0.0  1.0  FREQ       None

1      0            3        po  0.0  1.0  DICT       None

1      1            1      what  1.0  0.0  DICT       None

```

### CLI Mode

Run TagLID from the terminal.

```sh

python -m taglid.lid

```

Then type a sentence when prompted.

```

text: hello, mundo

```

Output:

```

word      eng    tgl  flag    correction

------  -----  -----  ------  ------------

hello       1      0  DICT

mundo       0      1  DICT

```

Add `--simplify` to only show the words and their language.

```sh

python -m taglid.lid --simplify --text hello, mundo

```

Output:

```

-----  ---

hello  eng

mundo  tgl

-----  ---

```

Use `lid_dataset` with Excel files to directly label spreadsheets.

```sh

python -m taglid.lid_dataset in_path out_path

```

## Accuracy

The accuracy hasn't been tested yet.

## Contributing

Contributions are welcome! To get started:

1. Fork the project

2. Create your feature branch (`git checkout -b feature/AmazingFeature`)

3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)

4. Push to the branch (`git push origin feature/AmazingFeature`)

5. Open a pull request

## Issues

Found a bug or issue? Report it on the

[issues page](https://github.com/andrianllmm/taglid/issues).

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/andrianllmm/taglid

Awesome Lists containing this project

README