{"id":16225674,"url":"https://github.com/apdullahyayik/TrTokenizer","last_synced_at":"2025-10-25T00:31:16.345Z","repository":{"id":41239806,"uuid":"231650353","full_name":"apdullahyayik/TrTokenizer","owner":"apdullahyayik","description":"🧩  A simple sentence tokenizer.","archived":false,"fork":false,"pushed_at":"2023-10-31T07:54:03.000Z","size":492,"stargazers_count":20,"open_issues_count":0,"forks_count":1,"subscribers_count":3,"default_branch":"master","last_synced_at":"2024-10-11T12:46:05.697Z","etag":null,"topics":["regular-expression","sentence-tokenizer","turkish-language","turkish-nlp","word-segmentation","word-tokenizing"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/apdullahyayik.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-03T19:18:53.000Z","updated_at":"2024-08-20T08:08:28.000Z","dependencies_parsed_at":"2022-08-31T10:21:39.237Z","dependency_job_id":null,"html_url":"https://github.com/apdullahyayik/TrTokenizer","commit_stats":null,"previous_names":[],"tags_count":2,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apdullahyayik%2FTrTokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apdullahyayik%2FTrTokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apdullahyayik%2FTrTokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/apdullahyayik%2FTrTokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/apdullahyayik","download_url":"https://codeload.github.com/apdullahyayik/TrTokenizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":238053514,"owners_count":19408699,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["regular-expression","sentence-tokenizer","turkish-language","turkish-nlp","word-segmentation","word-tokenizing"],"created_at":"2024-10-10T12:45:55.546Z","updated_at":"2025-10-25T00:31:11.012Z","avatar_url":"https://github.com/apdullahyayik.png","language":"Python","readme":"# 🧩 TrTokenizer\n\nA simple sentence tokenizer.\n\n[![Python Version](https://img.shields.io/pypi/pyversions/trtokenizer.svg?style=for-the-badge)](https://pypi.org/project/trtokenizer/)\n[![PyPI Version](https://img.shields.io/pypi/v/trtokenizer.svg?style=for-the-badge)](https://pypi.org/project/trtokenizer/)\n\n## Overview\n\n**TrTokenizer** is a comprehensive solution for Turkish sentence and word tokenization, tailored to accommodate extensive language conventions. If you're seeking robust, fast, and accurate tokenization for natural language models, you've come to the right place. Our sentence tokenization approach employs a list of non-prefix keywords found in the 'tr_non_suffixes' file. Developers can conveniently expand this file, and lines starting with '#' are treated as comments. We've designed regular expressions that are pre-compiled for optimal performance.\n\n## Installation\n\nYou can install **TrTokenizer** via pip:\n\n```sh\npip install trtokenizer\n```\n\n## Usage\n\nHere's how you can use **TrTokenizer** in your Python projects:\n\n```python\nfrom trtokenizer.tr_tokenizer import SentenceTokenizer, WordTokenizer\n\n# Initialize a SentenceTokenizer object\nsentence_tokenizer = SentenceTokenizer()\n\n# Tokenize a given paragraph as a string\nsentence_tokenizer.tokenize(\"Your paragraph goes here.\")\n\n# Initialize a WordTokenizer object\nword_tokenizer = WordTokenizer()\n\n# Tokenize a given sentence as a string\nword_tokenizer.tokenize(\"Your sentence goes here.\")\n```\n\n## To-do List\n\nOur to-do list includes:\n\n- Usage examples (Complete)\n- Cython C-API for enhanced performance (Complete, see `build/tr_tokenizer.c`)\n- Release platform-specific shared dynamic libraries (Complete, e.g., `build/tr_tokenizer.cpython-38-x86_64-linux-gnu.so`, available for Debian Linux with GCC compiler)\n- Document any limitations\n- Provide a straightforward guide for contributing\n\n## Additional Resources\n\nExplore more about natural language processing and related topics:\n\n- [Speech and Language Processing](https://web.stanford.edu/~jurafsky/slp3/)\n- [Bogazici University CMPE-561](https://www.cmpe.boun.edu.tr/tr/courses/cmpe561)\n","funding_links":[],"categories":["Tools/Libraries"],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapdullahyayik%2FTrTokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fapdullahyayik%2FTrTokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fapdullahyayik%2FTrTokenizer/lists"}