{"id":20457831,"url":"https://github.com/pythainlp/nlpo3","last_synced_at":"2026-02-01T03:02:22.700Z","repository":{"id":41892707,"uuid":"365686584","full_name":"PyThaiNLP/nlpo3","owner":"PyThaiNLP","description":"Thai natural language processing library in Rust, with Python and Node bindings.","archived":false,"fork":false,"pushed_at":"2024-11-13T16:55:08.000Z","size":1142,"stargazers_count":35,"open_issues_count":7,"forks_count":8,"subscribers_count":4,"default_branch":"main","last_synced_at":"2025-03-30T00:06:21.073Z","etag":null,"topics":["hacktoberfest","natural-language-processing","nodejs","python","rust","text-processing","thai-language","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/PyThaiNLP.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-05-09T06:47:35.000Z","updated_at":"2025-02-17T04:13:22.000Z","dependencies_parsed_at":"2024-06-21T14:17:27.184Z","dependency_job_id":"434f871c-e841-4eaa-a927-238bdb38dde2","html_url":"https://github.com/PyThaiNLP/nlpo3","commit_stats":{"total_commits":341,"total_committers":7,"mean_commits":"48.714285714285715","dds":0.4897360703812317,"last_synced_commit":"0c537684e2f9dc65a651c4032a08707bc1b9913b"},"previous_names":[],"tags_count":15,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fnlpo3","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fnlpo3/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fnlpo3/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/PyThaiNLP%2Fnlpo3/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/PyThaiNLP","download_url":"https://codeload.github.com/PyThaiNLP/nlpo3/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247419859,"owners_count":20936012,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["hacktoberfest","natural-language-processing","nodejs","python","rust","text-processing","thai-language","tokenizer"],"created_at":"2024-11-15T12:09:22.958Z","updated_at":"2026-02-01T03:02:22.691Z","avatar_url":"https://github.com/PyThaiNLP.png","language":"Rust","readme":"---\nSPDX-FileCopyrightText: 2024-2026 PyThaiNLP Project\nSPDX-License-Identifier: Apache-2.0\n---\n\n# nlpO3\n\n[![crates.io](https://img.shields.io/crates/v/nlpo3.svg \"crates.io\")](https://crates.io/crates/nlpo3/)\n[![Apache-2.0](https://img.shields.io/badge/License-Apache%202.0-blue.svg \"Apache-2.0\")](https://opensource.org/license/apache-2-0)\n[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.14082448.svg)](https://doi.org/10.5281/zenodo.14082448)\n\nA Thai natural language processing library written in Rust with optional\nPython and Node.js bindings. Formerly known as `oxidized-thainlp`.\n\nUsing in a Rust project\n\n```shell\ncargo add nlpo3\n```\n\nUsing in a Python project\n\n```shell\npip install nlpo3\n```\n\n## Table of contents\n\n- [Features](#features)\n- [Use](#use)\n  - [Node.js binding](#nodejs-binding)\n  - [Python binding](#python-binding)\n  - [Rust library](#rust-library)\n  - [Command-line interface](#command-line-interface)\n  - [Dictionary](#dictionary)\n- [Build](#build)\n- [Develop](#develop)\n- [License](#license)\n\n## Features\n\n- Thai word tokenizer\n  - Uses a maximal-matching, dictionary-based tokenization algorithm\n    and respects [Thai Character Cluster][tcc] boundaries.\n    - Approximately [2.5× faster][benchmark] than the comparable pure-Python\n      implementation (PyThaiNLP's `newmm`).\n  - Load a dictionary from a plain text file (one word per line)\n    or from `Vec\u003cString\u003e`\n\n[tcc]: https://dl.acm.org/doi/10.1145/355214.355225\n[benchmark]: ./nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb\n\n## Use\n\n### Node.js binding\n\nSee [nlpo3-nodejs](./nlpo3-nodejs/).\n\n### Python binding\n\n[![PyPI](https://img.shields.io/pypi/v/nlpo3.svg \"PyPI\")](https://pypi.python.org/pypi/nlpo3)\n\nExample:\n\n```python\nfrom nlpo3 import load_dict, segment\n\nload_dict(\"path/to/dict.file\", \"dict_name\")\nsegment(\"สวัสดีครับ\", \"dict_name\")\n```\n\nSee more at [nlpo3-python](./nlpo3-python/).\n\n### Rust library\n\n[![crates.io](https://img.shields.io/crates/v/nlpo3.svg \"crates.io\")](https://crates.io/crates/nlpo3/)\n\n#### Add as a dependency\n\nTo add `nlpo3` to your project's dependencies:\n\n```shell\ncargo add nlpo3\n```\n\nThis updates `Cargo.toml` with:\n\n```toml\n[dependencies]\nnlpo3 = \"1.4.0\"\n```\n\n#### Example\n\nCreate a tokenizer from a dictionary file and use it to tokenize a string\n(safe mode = true, parallel mode = false):\n\n```rust\nuse nlpo3::tokenizer::newmm::NewmmTokenizer;\nuse nlpo3::tokenizer::tokenizer_trait::Tokenizer;\n\nlet tokenizer = NewmmTokenizer::new(\"path/to/dict.file\");\nlet tokens = tokenizer.segment(\"ห้องสมุดประชาชน\", true, false).unwrap();\n```\n\nCreate a tokenizer from a vector of strings:\n\n```rust\nlet words = vec![\"ปาลิเมนต์\".to_string(), \"คอนสติติวชั่น\".to_string()];\nlet tokenizer = NewmmTokenizer::from_word_list(words);\n```\n\nAdd words to an existing tokenizer:\n\n```rust\ntokenizer.add_word(\u0026[\"มิวเซียม\"]);\n```\n\nRemove words from an existing tokenizer:\n\n```rust\ntokenizer.remove_word(\u0026[\"กระเพรา\", \"ชานชลา\"]);\n```\n\n### Command-line interface\n\n[![crates.io](https://img.shields.io/crates/v/nlpo3-cli.svg \"crates.io\")](https://crates.io/crates/nlpo3-cli/)\n\nExample:\n\n```bash\necho \"ฉันกินข้าว\" | nlpo3 segment\n```\n\nSee more at [nlpo3-cli](./nlpo3-cli/).\n\n### Dictionary\n\n- To keep the library small, `nlpO3` does not include a dictionary; users should\n  provide one when using the dictionary-based tokenizer.\n  - A dictionary is required for the dictionary-based word tokenizer.\n- For tokenization dictionary, try\n  - [words_th.tx][dict-pythainlp] from [PyThaiNLP][pythainlp]\n    - ~62,000 words\n    - CC0-1.0\n  - [word break dictionary][dict-libthai] from [libthai][libthai]\n    - consists of dictionaries in different categories, with a make script\n    - LGPL-2.1\n\n[pythainlp]: https://github.com/PyThaiNLP/pythainlp\n[libthai]: https://github.com/tlwg/libthai/\n[dict-pythainlp]: https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/corpus/words_th.txt\n[dict-libthai]: https://github.com/tlwg/libthai/tree/master/data\n\n## Build\n\n### Requirements\n\n- [Rust 2018 Edition](https://www.rust-lang.org/tools/install)\n\n### Steps\n\nGeneric test:\n\n```bash\ncargo test\n```\n\nBuild API document and open it to check:\n\n```bash\ncargo doc --open\n```\n\nBuild (remove `--release` to keep debug information):\n\n```bash\ncargo build --release\n```\n\nCheck `target/` for build artifacts.\n\n## Develop\n\n### Development document\n\n- [Notes on custom string](src/NOTE_ON_STRING.md)\n\n### Issues\n\n- Please report issues at \u003chttps://github.com/PyThaiNLP/nlpo3/issues\u003e\n\n## License\n\nnlpO3 is copyrighted by its authors\nand licensed under terms of the Apache Software License 2.0 (Apache-2.0).\nSee file [LICENSE](./LICENSE) for details.\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fnlpo3","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpythainlp%2Fnlpo3","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpythainlp%2Fnlpo3/lists"}