https://github.com/apdullahyayik/TrTokenizer
🧩 A simple sentence tokenizer.
https://github.com/apdullahyayik/TrTokenizer
regular-expression sentence-tokenizer turkish-language turkish-nlp word-segmentation word-tokenizing
Last synced: 3 months ago
JSON representation
🧩 A simple sentence tokenizer.
- Host: GitHub
- URL: https://github.com/apdullahyayik/TrTokenizer
- Owner: apdullahyayik
- License: mit
- Created: 2020-01-03T19:18:53.000Z (over 5 years ago)
- Default Branch: master
- Last Pushed: 2023-10-31T07:54:03.000Z (over 1 year ago)
- Last Synced: 2024-10-11T12:46:05.697Z (8 months ago)
- Topics: regular-expression, sentence-tokenizer, turkish-language, turkish-nlp, word-segmentation, word-tokenizing
- Language: Python
- Homepage:
- Size: 480 KB
- Stars: 20
- Watchers: 3
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
- turkish-nlp-resources - TrTokenizer
README
# 🧩 TrTokenizer
A simple sentence tokenizer.
[](https://pypi.org/project/trtokenizer/)
[](https://pypi.org/project/trtokenizer/)## Overview
**TrTokenizer** is a comprehensive solution for Turkish sentence and word tokenization, tailored to accommodate extensive language conventions. If you're seeking robust, fast, and accurate tokenization for natural language models, you've come to the right place. Our sentence tokenization approach employs a list of non-prefix keywords found in the 'tr_non_suffixes' file. Developers can conveniently expand this file, and lines starting with '#' are treated as comments. We've designed regular expressions that are pre-compiled for optimal performance.
## Installation
You can install **TrTokenizer** via pip:
```sh
pip install trtokenizer
```## Usage
Here's how you can use **TrTokenizer** in your Python projects:
```python
from trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer# Initialize a SentenceTokenizer object
sentence_tokenizer = SentenceTokenizer()# Tokenize a given paragraph as a string
sentence_tokenizer.tokenize("Your paragraph goes here.")# Initialize a WordTokenizer object
word_tokenizer = WordTokenizer()# Tokenize a given sentence as a string
word_tokenizer.tokenize("Your sentence goes here.")
```## To-do List
Our to-do list includes:
- Usage examples (Complete)
- Cython C-API for enhanced performance (Complete, see `build/tr_tokenizer.c`)
- Release platform-specific shared dynamic libraries (Complete, e.g., `build/tr_tokenizer.cpython-38-x86_64-linux-gnu.so`, available for Debian Linux with GCC compiler)
- Document any limitations
- Provide a straightforward guide for contributing## Additional Resources
Explore more about natural language processing and related topics:
- [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)
- [Bogazici University CMPE-561](https://www.cmpe.boun.edu.tr/tr/courses/cmpe561)