Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ysenarath/sinling
A collection of NLP tools for Sinhalese (සිංහල).
https://github.com/ysenarath/sinling
joiner language-processing morphological-analyser natural-language-processing nlp part-of-speech pos-tagging sinhala sinhala-nlp sinhala-stemmer sinhala-tokenizer splitter tokenizer tool toolkit
Last synced: 2 days ago
JSON representation
A collection of NLP tools for Sinhalese (සිංහල).
- Host: GitHub
- URL: https://github.com/ysenarath/sinling
- Owner: ysenarath
- License: apache-2.0
- Created: 2019-03-27T23:21:01.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2021-06-28T20:34:18.000Z (over 3 years ago)
- Last Synced: 2024-10-30T17:43:25.702Z (9 days ago)
- Topics: joiner, language-processing, morphological-analyser, natural-language-processing, nlp, part-of-speech, pos-tagging, sinhala, sinhala-nlp, sinhala-stemmer, sinhala-tokenizer, splitter, tokenizer, tool, toolkit
- Language: Jupyter Notebook
- Homepage: https://sinling.ysenarath.com
- Size: 44.4 MB
- Stars: 50
- Watchers: 7
- Forks: 17
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# A language processing tool for Sinhalese (සිංහල).
`Update 2020.11.01: Fixed pypi package. Use 'pip install sinling' to install sinling directly from repository.`
`Update 2020.08.16: Add pypi package @ https://pypi.org/project/sinling/.`
`Update 2020.08.16: Integrated Part of speech tagger and stemmer tool.`
`Update 2019.07.21: This tool no longer requires java to run sinhala tokenizer.
All java code is ported to Python implementation for convenience.`[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/ysenarath/sinling.git/master?filepath=notebooks%2Fexamples.ipynb)
[![PyPI version](https://badge.fury.io/py/sinling.svg)](https://badge.fury.io/py/sinling)## Installation
Run the following command in your virtualenv to install this package.
`pip install sinling`
## How to use
### Sinhala Tokenizer
```python
from sinling import SinhalaTokenizertokenizer = SinhalaTokenizer()
sentence = '...' # your sentence
tokenizer.tokenize(sentence)
```### Sinhala Stemmer (Experimental)
```python
from sinling import SinhalaStemmerstemmer = SinhalaStemmer()
word = '...' # your sentence
stemmer.stem(word)
```Please cite [sinhala-stemmer](https://github.com/rksk/sinhala-news-analysis/tree/master/sinhala-stemmer) if you are using this implementation.
### Part-of-Speech Tagger
```python
from sinling import SinhalaTokenizer, POSTaggertokenizer = SinhalaTokenizer()
document = '...' # may contain multiple sentences
tokenized_sentences = [tokenizer.tokenize(f'{ss}.') for ss in tokenizer.split_sentences(document)]
tagger = POSTagger()
pos_tags = tagger.predict(tokenized_sentences)
```### Word Joiner (Morphological Joiner)
```python
from sinling import preprocess, word_joinerw1 = preprocess('මුනි')
w2 = preprocess('උතුමා')
results = word_joiner.join(w1, w2)
# Returns a list of possible results after applying join rules ['මුනිතුමා', ...]
```### Word Splitter (Morphological Splitter) / corpus based - *experimental*
```python
from sinling import word_splitterword = '...'
results = word_splitter.split(word)
# Returns a dict containing debug information, base word and affix
```Visit [here](https://github.com/ysenarath/sinling/blob/master/notebooks/splitter.ipynb) to see some sample splits.
## Contributions
- Contact `[email protected]` if you would like to contribute to this project.## License
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/