Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/stephantul/unitoken
Tokenization across languages. Useful as preprocessing for subword tokenization.
https://github.com/stephantul/unitoken
Last synced: about 6 hours ago
JSON representation
Tokenization across languages. Useful as preprocessing for subword tokenization.
- Host: GitHub
- URL: https://github.com/stephantul/unitoken
- Owner: stephantul
- License: mit
- Created: 2023-01-17T10:54:32.000Z (about 2 years ago)
- Default Branch: main
- Last Pushed: 2023-02-07T07:59:17.000Z (about 2 years ago)
- Last Synced: 2023-08-01T05:42:23.918Z (over 1 year ago)
- Language: Python
- Size: 24.4 KB
- Stars: 19
- Watchers: 5
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# unitoken
Working with large volumes of webtext can be a bit arduous, as the web contains many different languages. This package provides an automated way of tokenizing webtext in different languages by first detecting the language of a string, and then using an appropriate language specific tokenizer to tokenize it.
It exposes two functions:
* `tokenize(str)`: tokenizes your text into words.
* `sent_tokenize(str)`: tokenizes your text into sentences, which are again split into words.The functions also take a dictionary mapping from languages to spacy models. This is handy if you'd like to use models you defined yourself.
The return of the models is a tuple: the first item is the tokens in the text, while the second item is a LanguageResult, indicating the language, the confidence, and whether the language was valid in the current context. If the language is invalid, i.e., `valid == False`, then the return list of tokens will be empty.
## Example
```python
from unitoken import tokenizemy_sentences = ["日本、朝鮮半島、中国東部まで広く分布し、その姿や鳴き声はよく知られている。",
"ประเทศโครเอเชียปรับมาใช้สกุลเงินยูโรและเข้าร่วมเขตเชงเกน", "The dog walked home and had a nice cookie."]tokenized = []
for sentence in my_sentences:
tokenized.append(tokenize(sentence))
```Using a pre-defined model:
```python
import spacy
from unitoken import tokenize# Assumes en_core_web_sm is installed.
predefined_models = {"en": spacy.load("en_core_web_sm")}my_sentences = ["日本、朝鮮半島、中国東部まで広く分布し、その姿や鳴き声はよく知られている。",
"ประเทศโครเอเชียปรับมาใช้สกุลเงินยูโรและเข้าร่วมเขตเชงเกน", "The dog walked home and had a nice cookie."]tokenized = []
for sentence in my_sentences:
tokenized.append(tokenize(sentence, models=predefined_models))
```## Installation
It can be installed by cloning through `setup.py`:
```
python3 setup.py install
```or by using `git`:
```
pip install git+https://github.com/stephantul/unitoken.git
```## Credits
This package builds on top of [spacy](spacy.io) and [fasttext](https://fasttext.cc/docs/en/language-identification.html). I didn't really do a lot work.
## Improvements
* The current implementation assumes that a text consists of a single language. This is obviously false. We could try a window approach to find windows in which languages differ, but this requires some experimentation.
## Author
Stéphan Tulkens
## License
MIT
## Version
0.2.0