{"id":22089922,"url":"https://github.com/dhchenx/rsnltk","last_synced_at":"2025-07-24T19:31:42.361Z","repository":{"id":57663732,"uuid":"454286842","full_name":"dhchenx/rsnltk","owner":"dhchenx","description":"Rust-based Natural Language Toolkit using Python Bindings","archived":false,"fork":false,"pushed_at":"2022-02-04T18:31:15.000Z","size":3427,"stargazers_count":16,"open_issues_count":0,"forks_count":1,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-08-10T19:03:08.539Z","etag":null,"topics":["human-language","natural-language-processing","nlp-in-rust","rsnltk","rust-text-analysis","stanza","text-analysis"],"latest_commit_sha":null,"homepage":"https://crates.io/crates/rsnltk","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dhchenx.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2022-02-01T06:40:23.000Z","updated_at":"2024-08-08T05:41:43.000Z","dependencies_parsed_at":"2022-08-28T02:10:53.746Z","dependency_job_id":null,"html_url":"https://github.com/dhchenx/rsnltk","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhchenx%2Frsnltk","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhchenx%2Frsnltk/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhchenx%2Frsnltk/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dhchenx%2Frsnltk/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dhchenx","download_url":"https://codeload.github.com/dhchenx/rsnltk/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":227470068,"owners_count":17778930,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["human-language","natural-language-processing","nlp-in-rust","rsnltk","rust-text-analysis","stanza","text-analysis"],"created_at":"2024-12-01T02:14:44.159Z","updated_at":"2024-12-01T02:14:44.750Z","avatar_url":"https://github.com/dhchenx.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Rust-based Natural Language Toolkit (rsnltk)\nA Rust library to support natural language processing with pure Rust implementation and Python bindings\n\n[Rust Docs](https://docs.rs/rsnltk/0.1.1) | [Crates Home Page](https://crates.io/crates/rsnltk) | [Tests](https://github.com/dhchenx/rsnltk/tree/main/tests) | [NER-Kit](https://pypi.org/project/ner-kit/)\n\n![example workflow](https://github.com/dhchenx/rsnltk/actions/workflows/rust.yml/badge.svg)\n\n## Features\nThe `rsnltk` library integrates various existing Python-written NLP toolkits for powerful text analysis in Rust-based applications. \n\n## Functions\nThis toolkit is based on the Python-written [Stanza](https://stanfordnlp.github.io/stanza/) and other important NLP crates.\n\nA list of functions from Stanza and others we bind here include:\n- Tokenize\n- Sentence Segmentation\n- Multi-Word Token Expansion\n- Part-of-Speech \u0026 Morphological Features\n- Named Entity Recognition\n- Sentiment Analysis\n- Language Identification\n- Dependency Tree Analysis\n\nSome amazing crates are also included in `rsnltk` but with simplified APIs for actual use:\n- [word2vec](https://crates.io/crates/word2vec)\n- [natural](https://crates.io/crates/natural), [yn](https://crates.io/crates/yn), [whatlang](https://crates.io/crates/whatlang). \n\nAdditionally, we can calculate the similarity between words based on WordNet though the `semantic-kit` PyPI project via `pip install semantic-kit`.\n\n## Installation\n\n1. Make sure you install Python 3.6.6+ and PIP environment in your computer. Type `python -V` in the Terminal should print no error message;\n\n2. Install our Python-based [ner-kit](https://pypi.org/project/ner-kit/) (version\u003e=0.0.5a2) for binding the `Stanza` package via `pip install ner-kit==0.0.5a2`;\n\n3. Then, Rust should be also installed in your computer. I use IntelliJ to develop Rust-based applications, where you can write Rust codes;\n\n4. Create a simple Rust application project with a `main()` function. \n\n5. Add the `rsnltk` dependency to the `Cargo.toml` file, keep up the Latest version.\n\n6. After you add the `rsnltk` dependency in the `toml file`, install necessary language models from Stanza using the following Rust code for the first time you use this package.\n\n```rust\nfn init_rsnltk_and_test(){\n    // 1. first install the necessary language models \n    // using language codes\n    let list_lang=vec![\"en\",\"zh\"]; \n    //e.g. you install two language models, \n    // namely, for English and Chinese text analysis.\n    download_langs(list_lang);\n    // 2. then do test NLP tasks\n    let text=\"I like Beijing!\";\n    let lang=\"en\";\n    // 2. Uncomment the below codes for Chinese NER\n    // let text=\"我喜欢北京、上海和纽约！\";\n    // let lang=\"zh\";\n    let list_ner=ner(text,lang);\n    for ner in list_ner{\n        println!(\"{:?}\",ner);\n    }\n}\n```\n\nOr you can manually install those [language models](https://stanfordnlp.github.io/stanza/available_models.html) via the Python-written `ner-kit` package which provides more features in using Stanza. Go to: [ner-kit](https://pypi.org/project/ner-kit/)\n\nIf no error occurs in the above example, then it works. Finally, you can try the following advanced example usage.\n\nCurrently, we tested the use of English and Chinese language models; however, other language models should work as well. \n\n## Examples with Stanza Bindings\n\nExample 1: Part-of-speech Analysis\n\n```rust\n    fn test_pos(){\n    //let text=\"我喜欢北京、上海和纽约！\";\n    //let lang=\"zh\";\n    let text=\"I like apple\";\n    let lang=\"en\";\n    let list_result=pos(text,lang);\n    for word in list_result{\n        println!(\"{:?}\",word);\n    }\n}\n```\n\nExample 2: Sentiment Analysis\n```rust\n    fn test_sentiment(){\n        //let text=\"I like Beijing!\";\n        //let lang=\"en\";\n        let text=\"我喜欢北京\";\n        let lang=\"zh\";\n        let sentiments=sentiment(text,lang);\n        for sen in sentiments{\n            println!(\"{:?}\",sen);\n        }\n    }\n```\n\nExample 3: Named Entity Recognition\n\n```rust\n    fn test_ner(){\n        // 1. for English NER\n        let text=\"I like Beijing!\";\n        let lang=\"en\";\n        // 2. Uncomment the below codes for Chinese NER\n        // let text=\"我喜欢北京、上海和纽约！\";\n        // let lang=\"zh\";\n        let list_ner=ner(text,lang);\n        for ner in list_ner{\n            println!(\"{:?}\",ner);\n        }\n    }\n```\n\nExample 4: Tokenize for Multiple Languages\n\n```rust\n    fn test_tokenize(){\n        let text=\"我喜欢北京、上海和纽约！\";\n        let lang=\"zh\";\n        let list_result=tokenize(text,lang);\n        for ner in list_result{\n            println!(\"{:?}\",ner);\n        }\n    }\n```\n\nExample 5: Tokenize Sentence\n\n```rust\n    fn test_tokenize_sentence(){\n        let text=\"I like apple. Do you like it? No, I am not sure!\";\n        let lang=\"en\";\n        let list_sentences=tokenize_sentence(text,lang);\n        for sentence in list_sentences{\n            println!(\"Sentence: {}\",sentence);\n        }\n    }\n```\n\nExample 6: Language Identification\n\n```rust\nfn test_lang(){\n    let list_text = vec![\"I like Beijing!\",\n                         \"我喜欢北京！\", \n                         \"Bonjour le monde!\"];\n    let list_result=lang(list_text);\n    for lang in list_result{\n        println!(\"{:?}\",lang);\n    }\n}\n```\n\nExample 7: MWT expand\n\n```rust\n    fn test_mwt_expand(){\n        let text=\"Nous avons atteint la fin du sentier.\";\n        let lang=\"fr\";\n        let list_result=mwt_expand(text,lang);\n    }\n```\n\nExample 8: Estimate the similarity between words in WordNet\n\nYou need to firstly install `semantic-kit` PyPI package!\n\n```rust\n    fn test_wordnet_similarity(){\n        let s1=\"dog.n.1\";\n        let s2=\"cat.n.2\";\n        let sims=wordnet_similarity(s1,s2);\n        for sim in sims{\n            println!(\"{:?}\",sim);\n        }\n    }\n```\n\nExample 9: Obtain a dependency tree from a text\n```rust\nfn test_dependency_tree(){\n    let text=\"I like you. Do you like me?\";\n    let lang=\"en\";\n    let list_results=dependency_tree(text,lang);\n    for list_token in list_results{\n        for token in list_token{\n            println!(\"{:?}\",token)\n        }\n\n    }\n}\n```\n\n## Examples in Pure Rust\n\nExample 1: Word2Vec similarity\n\n```rust\nfn test_open_wv_bin(){\n    let wv_model=wv_get_model(\"GoogleNews-vectors-negative300.bin\");\n    let positive = vec![\"woman\", \"king\"];\n    let negative = vec![\"man\"];\n    println!(\"analogy: {:?}\", wv_analogy(\u0026wv_model,positive, negative, 10));\n    println!(\"cosine: {:?}\", wv_cosine(\u0026wv_model,\"man\", 10));\n}\n```\n\nExample 2: Text summarization\n\n```rust\n    use rsnltk::native::summarizer::*;\n    fn test_summarize(){\n        let text=\"Some large txt...\";\n        let stopwords=\u0026[];\n        let summarized_text=summarize(text,stopwords,5);\n        println!(\"{}\",summarized_text);\n    }\n```\n\nExample 3: Get token list from English strings\n```rust\nuse rsnltk::native::token::get_token_list;\nfn test_get_token_list(){\n        let s=\"Hello, Rust. How are you?\";\n        let result=get_token_list(s);\n        for r in result{\n            println!(\"{}\\t{:?}\",r.text,r);\n        }\n}\n```\n\nExample 4: Word segmentation for some language where no space exists between terms, e.g. Chinese text.\n\nWe implement three word segmentation methods in this version:\n\n- Forward Maximum Matching (fmm), which is baseline method\n- Backward Maximum Matching (bmm), which is considered better\n- Bidirectional Maximum Matching (bimm), high accuracy but low speed\n\n```rust\nuse rsnltk::native::segmentation::*;\nfn test_real_word_segmentation(){\n    let dict_path=\"30wdict.txt\"; // empty if only for tokenizing\n    let stop_path=\"baidu_stopwords.txt\";// empty when no stop words\n    let _sentence=\"美国太空总署希望，在深海的探险发现将有助于解开一些外太空的秘密，\\\n    同时也可以测试前往太阳系其他星球探险所需的一些设备和实验。\";\n    let meaningful_words=get_segmentation(_sentence,dict_path,stop_path, \"bimm\");\n    // bimm can be changed to fmm or bmm. \n    println!(\"Result: {:?}\",meaningful_words);\n}\n```\n\n## Credits\n\nThank [Stanford NLP Group](https://github.com/stanfordnlp/stanza) for their hard work in [Stanza](https://stanfordnlp.github.io/stanza/). \n\n## License\nThe `rsnltk` library with MIT License is provided by [Donghua Chen](https://github.com/dhchenx). \n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdhchenx%2Frsnltk","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdhchenx%2Frsnltk","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdhchenx%2Frsnltk/lists"}