https://github.com/apdullahyayik/TrTokenizer

🧩 A simple sentence tokenizer.
https://github.com/apdullahyayik/TrTokenizer

regular-expression sentence-tokenizer turkish-language turkish-nlp word-segmentation word-tokenizing

Last synced: 1 day ago
JSON representation

🧩 A simple sentence tokenizer.

Host: GitHub
URL: https://github.com/apdullahyayik/TrTokenizer
Owner: apdullahyayik
License: mit
Created: 2020-01-03T19:18:53.000Z (almost 6 years ago)
Default Branch: master
Last Pushed: 2023-10-31T07:54:03.000Z (almost 2 years ago)
Last Synced: 2024-10-11T12:46:05.697Z (about 1 year ago)
Topics: regular-expression, sentence-tokenizer, turkish-language, turkish-nlp, word-segmentation, word-tokenizing
Language: Python
Homepage:
Size: 480 KB
Stars: 20
Watchers: 3
Forks: 1
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE.txt

Awesome Lists containing this project

turkish-nlp-resources - TrTokenizer

README

# 🧩 TrTokenizer

A simple sentence tokenizer.

[![Python Version](https://img.shields.io/pypi/pyversions/trtokenizer.svg?style=for-the-badge)](https://pypi.org/project/trtokenizer/)
[![PyPI Version](https://img.shields.io/pypi/v/trtokenizer.svg?style=for-the-badge)](https://pypi.org/project/trtokenizer/)

## Overview

**TrTokenizer** is a comprehensive solution for Turkish sentence and word tokenization, tailored to accommodate extensive language conventions. If you're seeking robust, fast, and accurate tokenization for natural language models, you've come to the right place. Our sentence tokenization approach employs a list of non-prefix keywords found in the 'tr_non_suffixes' file. Developers can conveniently expand this file, and lines starting with '#' are treated as comments. We've designed regular expressions that are pre-compiled for optimal performance.

## Installation

You can install **TrTokenizer** via pip:

```sh
pip install trtokenizer
```

## Usage

Here's how you can use **TrTokenizer** in your Python projects:

```python
from trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer

# Initialize a SentenceTokenizer object
sentence_tokenizer = SentenceTokenizer()

# Tokenize a given paragraph as a string
sentence_tokenizer.tokenize("Your paragraph goes here.")

# Initialize a WordTokenizer object
word_tokenizer = WordTokenizer()

# Tokenize a given sentence as a string
word_tokenizer.tokenize("Your sentence goes here.")
```

## To-do List

Our to-do list includes:

- Usage examples (Complete)
- Cython C-API for enhanced performance (Complete, see `build/tr_tokenizer.c`)
- Release platform-specific shared dynamic libraries (Complete, e.g., `build/tr_tokenizer.cpython-38-x86_64-linux-gnu.so`, available for Debian Linux with GCC compiler)
- Document any limitations
- Provide a straightforward guide for contributing

## Additional Resources

Explore more about natural language processing and related topics:

- [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)
- [Bogazici University CMPE-561](https://www.cmpe.boun.edu.tr/tr/courses/cmpe561)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/apdullahyayik/TrTokenizer

Awesome Lists containing this project

README