{"id":33123220,"url":"https://github.com/cedricrupb/code_tokenize","last_synced_at":"2025-11-19T23:02:17.150Z","repository":{"id":41082775,"uuid":"422895387","full_name":"cedricrupb/code_tokenize","owner":"cedricrupb","description":"Fast tokenization and structural analysis of any programming language","archived":false,"fork":false,"pushed_at":"2025-01-14T09:15:05.000Z","size":156,"stargazers_count":59,"open_issues_count":2,"forks_count":10,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-10-29T18:59:54.557Z","etag":null,"topics":["ast","code-analysis","language","parser","tokenization"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/cedricrupb.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2021-10-30T13:55:02.000Z","updated_at":"2025-08-24T05:06:36.000Z","dependencies_parsed_at":"2022-07-21T04:38:50.068Z","dependency_job_id":null,"html_url":"https://github.com/cedricrupb/code_tokenize","commit_stats":null,"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/cedricrupb/code_tokenize","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2Fcode_tokenize","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2Fcode_tokenize/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2Fcode_tokenize/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2Fcode_tokenize/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/cedricrupb","download_url":"https://codeload.github.com/cedricrupb/code_tokenize/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/cedricrupb%2Fcode_tokenize/sbom","scorecard":{"id":270308,"data":{"date":"2025-08-11","repo":{"name":"github.com/cedricrupb/code_tokenize","commit":"6797bcf682edea672677bf3bce708d38f9d20dd0"},"scorecard":{"version":"v5.2.1-40-gf6ed084d","commit":"f6ed084d17c9236477efd66e5b258b9d4cc7b389"},"score":2.9,"checks":[{"name":"Code-Review","score":0,"reason":"Found 0/12 approved changesets -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project requires human code review before pull requests (aka merge requests) are merged.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#code-review"}},{"name":"Dangerous-Workflow","score":-1,"reason":"no workflows found","details":null,"documentation":{"short":"Determines if the project's GitHub Action workflows avoid dangerous patterns.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#dangerous-workflow"}},{"name":"Token-Permissions","score":-1,"reason":"No tokens found","details":null,"documentation":{"short":"Determines if the project's workflows follow the principle of least privilege.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#token-permissions"}},{"name":"Maintained","score":0,"reason":"0 commit(s) and 0 issue activity found in the last 90 days -- score normalized to 0","details":null,"documentation":{"short":"Determines if the project is \"actively maintained\".","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#maintained"}},{"name":"Packaging","score":-1,"reason":"packaging workflow not detected","details":["Warn: no GitHub/GitLab publishing workflow detected."],"documentation":{"short":"Determines if the project is published as a package that others can easily download, install, easily update, and uninstall.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#packaging"}},{"name":"Binary-Artifacts","score":10,"reason":"no binaries found in the repo","details":null,"documentation":{"short":"Determines if the project has generated executable (binary) artifacts in the source repository.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#binary-artifacts"}},{"name":"Pinned-Dependencies","score":-1,"reason":"no dependencies found","details":null,"documentation":{"short":"Determines if the project has declared and pinned the dependencies of its build process.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#pinned-dependencies"}},{"name":"CII-Best-Practices","score":0,"reason":"no effort to earn an OpenSSF best practices badge detected","details":null,"documentation":{"short":"Determines if the project has an OpenSSF (formerly CII) Best Practices Badge.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#cii-best-practices"}},{"name":"Security-Policy","score":0,"reason":"security policy file not detected","details":["Warn: no security policy file detected","Warn: no security file to analyze","Warn: no security file to analyze","Warn: no security file to analyze"],"documentation":{"short":"Determines if the project has published a security policy.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#security-policy"}},{"name":"Fuzzing","score":0,"reason":"project is not fuzzed","details":["Warn: no fuzzer integrations found"],"documentation":{"short":"Determines if the project uses fuzzing.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#fuzzing"}},{"name":"License","score":10,"reason":"license file detected","details":["Info: project has a license file: LICENSE:0","Info: FSF or OSI recognized license: MIT License: LICENSE:0"],"documentation":{"short":"Determines if the project has defined a license.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#license"}},{"name":"Signed-Releases","score":-1,"reason":"no releases found","details":null,"documentation":{"short":"Determines if the project cryptographically signs release artifacts.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#signed-releases"}},{"name":"Branch-Protection","score":0,"reason":"branch protection not enabled on development/release branches","details":["Warn: branch protection not enabled for branch 'main'"],"documentation":{"short":"Determines if the default and release branches are protected with GitHub's branch protection settings.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#branch-protection"}},{"name":"Vulnerabilities","score":9,"reason":"1 existing vulnerabilities detected","details":["Warn: Project is vulnerable to: GHSA-9hjg-9r4m-mvj7"],"documentation":{"short":"Determines if the project has open, known unfixed vulnerabilities.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#vulnerabilities"}},{"name":"SAST","score":0,"reason":"SAST tool is not run on all commits -- score normalized to 0","details":["Warn: 0 commits out of 19 are checked with a SAST tool"],"documentation":{"short":"Determines if the project uses static code analysis.","url":"https://github.com/ossf/scorecard/blob/f6ed084d17c9236477efd66e5b258b9d4cc7b389/docs/checks.md#sast"}}]},"last_synced_at":"2025-08-17T13:10:11.098Z","repository_id":41082775,"created_at":"2025-08-17T13:10:11.098Z","updated_at":"2025-08-17T13:10:11.098Z"},"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":285342137,"owners_count":27155385,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-11-19T02:00:05.673Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ast","code-analysis","language","parser","tokenization"],"created_at":"2025-11-15T05:00:43.636Z","updated_at":"2025-11-19T23:02:17.144Z","avatar_url":"https://github.com/cedricrupb.png","language":"Python","readme":"\u003cp align=\"center\"\u003e\n  \u003cimg height=\"150\" src=\"https://github.com/cedricrupb/ptokenizers/raw/main/resources/code_tokenize.svg\" /\u003e\n\u003c/p\u003e\n\n------------------------------------------------\n\u003e Fast tokenization and structural analysis of\nany programming language in Python\n\nProgramming Language Processing (PLP) brings the capabilities of modern NLP systems to the world of programming languages. \nTo achieve high performance PLP systems, existing methods often take advantage of the fully defined nature of programming languages. Especially the syntactical structure can be exploited to gain knowledge about programs.\n\n**code.tokenize** provides easy access to the syntactic structure of a program. The tokenizer converts a program into a sequence of program tokens ready for further end-to-end processing.\nBy relating each token to an AST node, it is possible to extend the program representation easily with further syntactic information.\n\n## Installation\nThe package is tested under Python 3. It can be installed via:\n```\npip install code-tokenize\n```\n\n## Usage\ncode.tokenize can tokenize nearly any program code in a few lines of code:\n```python\nimport code_tokenize as ctok\n\n# Python\nctok.tokenize(\n    '''\n        def my_func():\n            print(\"Hello World\")\n    ''',\nlang = \"python\")\n\n# Output: [def, my_func, (, ), :, #NEWLINE#, ...]\n\n# Java\nctok.tokenize(\n    '''\n        public static void main(String[] args){\n          System.out.println(\"Hello World\");\n        }\n    ''',\nlang = \"java\", \nsyntax_error = \"ignore\")\n\n# Output: [public, static, void, main, (, String, [, ], args), {, System, ...]\n\n# JavaScript\nctok.tokenize(\n    '''\n        alert(\"Hello World\");\n    ''',\nlang = \"javascript\", \nsyntax_error = \"ignore\")\n\n# Output: [alert, (, \"Hello World\", ), ;]\n\n\n```\n\n## Supported languages\ncode.tokenize employs [tree-sitter](https://tree-sitter.github.io/tree-sitter/) as a backend. Therefore, in principal, any language supported by tree-sitter is also\nsupported by a tokenizer in code.tokenize.\n\nFor some languages, this library supports additional\nfeatures that are not directly supported by tree-sitter.\nTherefore, we distinguish between three language classes\nand support the following language identifier:\n\n- `native`: python\n- `advanced`: java\n- `basic`: javascript, go, ruby, cpp, c, swift, rust, ...\n\nLanguages in the `native` class support all features \nof this library and are extensively tested. `advanced` languages are tested but do not support the full feature set. Languages of the `basic` class are not tested and\nonly support the feature set of the backend. They can still be used for tokenization and AST parsing.\n\n## How to contribute\n**Your language is not natively supported by code.tokenize or the tokenization seems to be incorrect?** Then change it!\n\nWhile code.tokenize is developed mainly as an helper library for internal research projects, we welcome pull requests of any sorts (if it is a new feature or a bug fix). \n\n**Want to help to test more languages?**\nOur goal is to support as many languages as possible at a `native` level. However, languages on `basic` level are completly untested. You can help by testing `basic` languages and reporting issues in the tokenization process!\n\n## Release history\n* 0.2.0\n    * Major API redesign!\n    * CHANGE: AST parsing is now done by an external library: [code_ast](https://github.com/cedricrupb/code_ast)\n    * CHANGE: Visitor pattern instead of custom tokenizer\n    * CHANGE: Custom visitors for language dependent tokenization\n* 0.1.0\n    * The first proper release\n    * CHANGE: Language specific tokenizer configuration\n    * CHANGE: Basic analyses of the program structure and token role\n    * CHANGE: Documentation\n* 0.0.1\n    * Work in progress\n\n## Project Info\nThe goal of this project is to provide developer in the\nprogramming language processing community with easy\naccess to program tokenization and AST parsing. This is currently developed as a helper library for internal research projects. Therefore, it will only be updated\nas needed.\n\nFeel free to open an issue if anything unexpected\nhappens. \n\nDistributed under the MIT license. See ``LICENSE`` for more information.\n\nThis project was developed as part of our research related to:\n```bibtex\n@inproceedings{richter2022tssb,\n  title={TSSB-3M: Mining single statement bugs at massive scale},\n  author={Cedric Richter, Heike Wehrheim},\n  booktitle={MSR},\n  year={2022}\n}\n```\n\nWe thank the developer of [tree-sitter](https://tree-sitter.github.io/tree-sitter/) library. Without tree-sitter this project would not be possible. \n","funding_links":[],"categories":["Tools"],"sub_categories":["Code Analysis"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedricrupb%2Fcode_tokenize","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcedricrupb%2Fcode_tokenize","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcedricrupb%2Fcode_tokenize/lists"}