https://github.com/stephantul/unitoken

Tokenization across languages. Useful as preprocessing for subword tokenization.
https://github.com/stephantul/unitoken

Last synced: about 2 months ago
JSON representation

Tokenization across languages. Useful as preprocessing for subword tokenization.

Host: GitHub
URL: https://github.com/stephantul/unitoken
Owner: stephantul
License: mit
Created: 2023-01-17T10:54:32.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2023-02-07T07:59:17.000Z (over 2 years ago)
Last Synced: 2025-04-23T07:15:00.512Z (about 2 months ago)
Language: Python
Size: 24.4 KB
Stars: 22
Watchers: 5
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # unitoken

Working with large volumes of webtext can be a bit arduous, as the web contains many different languages. This package provides an automated way of tokenizing webtext in different languages by first detecting the language of a string, and then using an appropriate language specific tokenizer to tokenize it.

It exposes two functions:

* `tokenize(str)`: tokenizes your text into words.

* `sent_tokenize(str)`: tokenizes your text into sentences, which are again split into words.

The functions also take a dictionary mapping from languages to spacy models. This is handy if you'd like to use models you defined yourself.

The return of the models is a tuple: the first item is the tokens in the text, while the second item is a LanguageResult, indicating the language, the confidence, and whether the language was valid in the current context. If the language is invalid, i.e., `valid == False`, then the return list of tokens will be empty.

## Example

```python

from unitoken import tokenize

my_sentences = ["日本、朝鮮半島、中国東部まで広く分布し、その姿や鳴き声はよく知られている。",

"ประเทศโครเอเชียปรับมาใช้สกุลเงินยูโรและเข้าร่วมเขตเชงเกน", "The dog walked home and had a nice cookie."]

tokenized = []

for sentence in my_sentences:

    tokenized.append(tokenize(sentence))

```

Using a pre-defined model:

```python

import spacy

from unitoken import tokenize

# Assumes en_core_web_sm is installed.

predefined_models = {"en": spacy.load("en_core_web_sm")}

my_sentences = ["日本、朝鮮半島、中国東部まで広く分布し、その姿や鳴き声はよく知られている。",

"ประเทศโครเอเชียปรับมาใช้สกุลเงินยูโรและเข้าร่วมเขตเชงเกน", "The dog walked home and had a nice cookie."]

tokenized = []

for sentence in my_sentences:

    tokenized.append(tokenize(sentence, models=predefined_models))

```

## Installation

It can be installed by cloning through `setup.py`:

```

python3 setup.py install

```

 or by using `git`:

```

pip install git+https://github.com/stephantul/unitoken.git

```

## Credits

This package builds on top of [spacy](spacy.io) and [fasttext](https://fasttext.cc/docs/en/language-identification.html). I didn't really do a lot work.

## Improvements

* The current implementation assumes that a text consists of a single language. This is obviously false. We could try a window approach to find windows in which languages differ, but this requires some experimentation.

## Author

Stéphan Tulkens

## License

MIT

## Version

0.2.0

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/stephantul/unitoken

Awesome Lists containing this project

README