{"id":18772740,"url":"https://github.com/sctg-development/sentencepiece-js","last_synced_at":"2025-04-13T08:32:04.824Z","repository":{"id":257811193,"uuid":"867779303","full_name":"sctg-development/sentencepiece-js","owner":"sctg-development","description":"sentencepiece port to webassembly with browser compatibility","archived":false,"fork":false,"pushed_at":"2024-10-07T16:33:29.000Z","size":8872,"stargazers_count":13,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-10T10:42:06.512Z","etag":null,"topics":["ai","sentencepiece","tokenizer"],"latest_commit_sha":null,"homepage":"https://sctg-development.github.io/sentencepiece-js/","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sctg-development.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["sctg-development"]}},"created_at":"2024-10-04T17:48:11.000Z","updated_at":"2025-03-06T10:24:46.000Z","dependencies_parsed_at":null,"dependency_job_id":"d5aafe44-f842-4c26-8f00-00e196dd5cc6","html_url":"https://github.com/sctg-development/sentencepiece-js","commit_stats":null,"previous_names":["sctg-development/sentencepiece-js"],"tags_count":6,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sctg-development%2Fsentencepiece-js","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sctg-development%2Fsentencepiece-js/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sctg-development%2Fsentencepiece-js/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sctg-development%2Fsentencepiece-js/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sctg-development","download_url":"https://codeload.github.com/sctg-development/sentencepiece-js/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248684351,"owners_count":21145061,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","sentencepiece","tokenizer"],"created_at":"2024-11-07T19:30:02.650Z","updated_at":"2025-04-13T08:32:04.818Z","avatar_url":"https://github.com/sctg-development.png","language":"TypeScript","funding_links":["https://github.com/sponsors/sctg-development"],"categories":[],"sub_categories":[],"readme":"# Javascript wrapper for the sentencepiece library\n\n![Build React App](https://github.com/sctg-development/sentencepiece-js/actions/workflows/build_react.yaml/badge.svg)\n![Publish to npmjs registry](https://github.com/sctg-development/sentencepiece-js/actions/workflows/build.yaml/badge.svg)\n\n## Browser Demo\n\nYou can see Sentencepiece-js in action for counting and displaying tokens using the [Meta Llama 3.1 tokenizer model](https://huggingface.co/spaces/Xanthius/llama-token-counter/blob/main/tokenizer.model) on GitHub Pages: https://sctg-development.github.io/sentencepiece-js/. All computations are performed in your browser, and no data is sent to the server. To display the tokens, click on the `tokens` link.\n\nThis simple React app is located in the `tokenCount` directory of this repository. It is built with React 18, Vite, and the Fluent UI v9 framework.\n\n## Build\n\nSentencepiece is compiled to webassembly using emscripten.\n\nTo rebuild this project\n\n```bash\n\nnpm install\n\ngit clone --recurse-submodules  https://github.com/sctg-development/sentencepiece-js.git\n\nnpm run build\n\n```\n\n## Use\n\nTo use this tool in nodejs, you can use the following code:\n\n```js\n\nconst { SentencePieceProcessor, cleanText } = require(\"../dist\");\nconst ROOT = require('app-root-path')\n\nasync function main() {\n\n    let text = \"I am still waiting on my card?\"\n    let cleaned = cleanText(text)\n\n    let spp = new SentencePieceProcessor()\n    await spp.load(`${ROOT}/test/30k-clean.model`)\n    let ids = spp.encodeIds(cleaned)\n    console.log(ids)\n    let str = spp.decodeIds(ids) // list ids-\u003enumber\n    console.log(str)\n\n    let pieces = spp.encodePieces(cleaned) // list tokens-\u003estring\n    console.log(pieces)\n}\nmain()\n\n```\n\nIn the browser, you can use the following code (see the `tokenCount` directory for a full example):\n\n```js\nimport { SentencePieceProcessor, cleanText, llama_3_1_tokeniser_b64 } from \"@sctg/sentencepiece-js\";\n\n// built in models: llama_3_1_tokeniser_b64, clean_30k_b64, smart_b64\nasync function main() {\n\n    let text = \"I am still waiting on my card?\"\n    let cleaned = cleanText(text)\n\n    let spp = new SentencePieceProcessor()\n    await spp.loadFromB64StringModel(llama_3_1_tokeniser_b64);\n    let ids = spp.encodeIds(cleaned)\n    console.log(ids)\n    let str = spp.decodeIds(ids) // list ids-\u003enumber\n    console.log(str)\n\n    let pieces = spp.encodePieces(cleaned) // list tokens-\u003estring\n    console.log(pieces)\n}\nmain()\n```\n\nSee https://github.com/sctg-development/ai-outlook/blob/HEAD/src/aipane/aipane.ts#L12-L34 for an example of how to use this in a react app.  \nLook also at webpack.config.js for the configuration of the webpack bundler.\n\n- devilyouwei updated this repo to make this module support the js `require` keyword and added the using example.\n- 2023-1-10, devilyouwei added `encodePieces`.\n- original author: https://github.com/JanKaul/sentencepiece\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsctg-development%2Fsentencepiece-js","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsctg-development%2Fsentencepiece-js","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsctg-development%2Fsentencepiece-js/lists"}