{"id":19162888,"url":"https://github.com/centre-for-humanities-computing/chinese-tokenizer","last_synced_at":"2026-06-17T11:30:19.941Z","repository":{"id":99222232,"uuid":"234040852","full_name":"centre-for-humanities-computing/chinese-tokenizer","owner":"centre-for-humanities-computing","description":"A Rusty way of tokenizing Chinese texts","archived":false,"fork":false,"pushed_at":"2020-01-20T12:37:22.000Z","size":10,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-01-03T21:42:12.791Z","etag":null,"topics":["jieba","rust","tokenizer"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/centre-for-humanities-computing.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.md","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-01-15T09:04:07.000Z","updated_at":"2020-02-04T14:09:21.000Z","dependencies_parsed_at":null,"dependency_job_id":"a7c93b9f-abd8-4460-a01d-9ba63e81ab37","html_url":"https://github.com/centre-for-humanities-computing/chinese-tokenizer","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fchinese-tokenizer","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fchinese-tokenizer/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fchinese-tokenizer/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/centre-for-humanities-computing%2Fchinese-tokenizer/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/centre-for-humanities-computing","download_url":"https://codeload.github.com/centre-for-humanities-computing/chinese-tokenizer/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240245885,"owners_count":19771029,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["jieba","rust","tokenizer"],"created_at":"2024-11-09T09:13:26.383Z","updated_at":"2026-06-17T11:30:19.897Z","avatar_url":"https://github.com/centre-for-humanities-computing.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# A Rust-y tokenizer for Chinese texts #\n\nThis is a short program for tokenizing Chinese text, using a Rust port of jieba.\n\nThe default tokenizer is a maximum likelihood matching algorithm working from a Chinese lexicon (i.e. dictionary-based). However, jieba-rs also implements a Hidden Markov Model tokenizer. The preferred tokenizer can be easily selected by making the necessary changes in src/main.rs.\n\n## Getting started\n\nIn order to run on your machine, you'll need to first install Rust and the Cargo package manager. This is done a number of different ways, depending on whether you use macOS, Linux, or Windows. You can find more information on how to do this [here](https://www.rust-lang.org/tools/install) and [here](https://doc.rust-lang.org/cargo/getting-started/installation.html).\n\nOnce that's completed, you'll need to copy your data into the empty 'data' folder. Note that the current structure of this program only allows for folder structures one level deep. In other words:\n\n``` \ndata/subfolder/file.txt\n```\n\nBe sure to check the comments at the beginning of src/main.rs. Some paths and variables may need to be modified to suit your needs.\n\n## Building the program\n\nWith Rust, you have two options when running the program. Firstly, you can simply do the following in the root directory:\n\n```\ncargo run --release\n``` \n\nThis builds the local package and executes the binary. However, you can also run these steps seperately. \n\nFirst build:\n\n```\ncargo build --release\n```\n\nThen run:\n```\n./target/release/chinese\n```\n\nNote that in both cases, we're using the --release flag. This prompts the compiler to perform optimisations which substantially improve performance of the tokenizer.\n\n## NB!\n\nThis was written quite quickly to solve a specific problem and is still essentially work-in-progress. It will work for any collection of Chinese texts, as long as the corpus structured in the format outlined above. However, I hope at some point to return to this and make it more flexible, as well as offering the user the chance to set certain flags. \n\n\n## Author\nAuthor:\t\t[rdkm89](https://github.com/rdkm89) \u003cbr\u003e\nDate:\t\t2020-01-13\n\n## Built with\n\nThis tokenizer pipeline is dependent on _jieba-rs_ by Github user [messense](https://github.com/messense). The original repo for that project can be found [here](https://github.com/messense/jieba-rs)\n\n## License\n\nThis project is licensed under the MIT License - see the [LICENSE.md](LICENSE.md) file for details\n\n \n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcentre-for-humanities-computing%2Fchinese-tokenizer","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcentre-for-humanities-computing%2Fchinese-tokenizer","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcentre-for-humanities-computing%2Fchinese-tokenizer/lists"}