{"id":19512519,"url":"https://github.com/vberlier/tokenstream","last_synced_at":"2026-02-28T11:09:39.553Z","repository":{"id":37963196,"uuid":"376339960","full_name":"vberlier/tokenstream","owner":"vberlier","description":"A versatile token stream for handwritten parsers.","archived":false,"fork":false,"pushed_at":"2023-08-03T00:20:59.000Z","size":855,"stargazers_count":13,"open_issues_count":5,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-11-14T06:22:23.163Z","etag":null,"topics":["lexer","parsing","recursive-descent-parser","token-stream","tokenizer"],"latest_commit_sha":null,"homepage":"https://vberlier.github.io/tokenstream/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vberlier.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":["vberlier"]}},"created_at":"2021-06-12T16:46:29.000Z","updated_at":"2024-05-31T16:21:55.000Z","dependencies_parsed_at":"2024-11-10T23:37:00.214Z","dependency_job_id":null,"html_url":"https://github.com/vberlier/tokenstream","commit_stats":{"total_commits":168,"total_committers":3,"mean_commits":56.0,"dds":0.5297619047619048,"last_synced_commit":"c596c302cc07d54d0090a3cc45a5972872c09d7e"},"previous_names":[],"tags_count":40,"template":false,"template_full_name":null,"purl":"pkg:github/vberlier/tokenstream","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vberlier%2Ftokenstream","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vberlier%2Ftokenstream/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vberlier%2Ftokenstream/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vberlier%2Ftokenstream/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vberlier","download_url":"https://codeload.github.com/vberlier/tokenstream/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vberlier%2Ftokenstream/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29931486,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-28T09:58:13.507Z","status":"ssl_error","status_checked_at":"2026-02-28T09:57:57.047Z","response_time":90,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["lexer","parsing","recursive-descent-parser","token-stream","tokenizer"],"created_at":"2024-11-10T23:26:29.412Z","updated_at":"2026-02-28T11:09:39.518Z","avatar_url":"https://github.com/vberlier.png","language":"Python","funding_links":["https://github.com/sponsors/vberlier"],"categories":[],"sub_categories":[],"readme":"# tokenstream\n\n[![GitHub Actions](https://github.com/vberlier/tokenstream/workflows/CI/badge.svg)](https://github.com/vberlier/tokenstream/actions)\n[![PyPI](https://img.shields.io/pypi/v/tokenstream.svg)](https://pypi.org/project/tokenstream/)\n[![PyPI - Python Version](https://img.shields.io/pypi/pyversions/tokenstream.svg)](https://pypi.org/project/tokenstream/)\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black)\n\n\u003e A versatile token stream for handwritten parsers.\n\n```python\nfrom tokenstream import TokenStream\n\ndef parse_sexp(stream: TokenStream):\n    \"\"\"A basic S-expression parser.\"\"\"\n    with stream.syntax(brace=r\"\\(|\\)\", number=r\"\\d+\", name=r\"\\w+\"):\n        brace, number, name = stream.expect((\"brace\", \"(\"), \"number\", \"name\")\n        if brace:\n            return [parse_sexp(stream) for _ in stream.peek_until((\"brace\", \")\"))]\n        elif number:\n            return int(number.value)\n        elif name:\n            return name.value\n\nprint(parse_sexp(TokenStream(\"(hello (world 42))\")))  # ['hello', ['world', 42]]\n```\n\n## Introduction\n\nWriting recursive-descent parsers by hand can be quite elegant but it's often a bit more verbose than expected, especially when it comes to handling indentation and reporting proper syntax errors. This package provides a powerful general-purpose token stream that addresses these issues and more.\n\n### Features\n\n- Define the set of recognizable tokens dynamically with regular expressions\n- Transparently skip over irrelevant tokens\n- Expressive API for matching, collecting, peeking, and expecting tokens\n- Clean error reporting with line numbers and column numbers\n- Contextual support for indentation-based syntax\n- Checkpoints for backtracking parsers\n- Works well with Python 3.10+ match statements\n\nCheck out the [`examples`](https://github.com/vberlier/tokenstream/tree/main/examples) directory for practical examples.\n\n## Installation\n\nThe package can be installed with `pip`.\n\n```bash\npip install tokenstream\n```\n\n## Getting started\n\nYou can define tokens with the `syntax()` method. The keyword arguments associate regular expression patterns to token types. The method returns a context manager during which the specified tokens will be recognized.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n    print([token.value for token in stream])  # ['hello', 'world']\n```\n\nCheck out the full [API reference](https://vberlier.github.io/tokenstream/api_reference/) for more details.\n\n### Expecting tokens\n\nThe token stream is iterable and will yield all the extracted tokens one after the other. You can also retrieve tokens from the token stream one at a time by using the `expect()` method.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n    print(stream.expect().value)  # \"hello\"\n    print(stream.expect().value)  # \"world\"\n```\n\nThe `expect()` method lets you ensure that the extracted token matches a specified type and will raise an exception otherwise.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(number=r\"\\d+\", word=r\"\\w+\"):\n    print(stream.expect(\"word\").value)  # \"hello\"\n    print(stream.expect(\"number\").value)  # UnexpectedToken: Expected number but got word 'world'\n```\n\n### Filtering the stream\n\nNewlines and whitespace are ignored by default. You can reject interspersed whitespace by intercepting the built-in `newline` and `whitespace` tokens.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"), stream.intercept(\"newline\", \"whitespace\"):\n    print(stream.expect(\"word\").value)  # \"hello\"\n    print(stream.expect(\"word\").value)  # UnexpectedToken: Expected word but got whitespace ' '\n```\n\nThe opposite of the `intercept()` method is `ignore()`. It allows you to ignore tokens and handle comments pretty easily.\n\n```python\nstream = TokenStream(\n    \"\"\"\n    # this is a comment\n    hello # also a comment\n    world\n    \"\"\"\n)\n\nwith stream.syntax(word=r\"\\w+\", comment=r\"#.+$\"), stream.ignore(\"comment\"):\n    print([token.value for token in stream])  # ['hello', 'world']\n```\n\n### Indentation\n\nTo enable indentation you can use the `indent()` method. The stream will now yield balanced pairs of `indent` and `dedent` tokens when the indentation changes.\n\n```python\nsource = \"\"\"\nhello\n    world\n\"\"\"\nstream = TokenStream(source)\n\nwith stream.syntax(word=r\"\\w+\"), stream.indent():\n    stream.expect(\"word\")\n    stream.expect(\"indent\")\n    stream.expect(\"word\")\n    stream.expect(\"dedent\")\n```\n\nTo prevent some tokens from triggering unwanted indentation changes you can use the `skip` argument.\n\n```python\nsource = \"\"\"\nhello\n        # some comment\n    world\n\"\"\"\nstream = TokenStream(source)\n\nwith stream.syntax(word=r\"\\w+\", comment=r\"#.+$\"), stream.indent(skip=[\"comment\"]):\n    stream.expect(\"word\")\n    stream.expect(\"comment\")\n    stream.expect(\"indent\")\n    stream.expect(\"word\")\n    stream.expect(\"dedent\")\n```\n\n### Checkpoints\n\nThe `checkpoint()` method returns a context manager that resets the stream to the current token at the end of the `with` statement. You can use the returned `commit()` function to keep the state of the stream at the end of the `with` statement.\n\n```python\nstream = TokenStream(\"hello world\")\n\nwith stream.syntax(word=r\"\\w+\"):\n    with stream.checkpoint():\n        print([token.value for token in stream])  # ['hello', 'world']\n    with stream.checkpoint() as commit:\n        print([token.value for token in stream])  # ['hello', 'world']\n        commit()\n    print([token.value for token in stream])  # []\n```\n\n### Match statements\n\nMatch statements make it very intuitive to process tokens extracted from the token stream. If you're using Python 3.10+ give it a try and see if you like it.\n\n```python\nfrom tokenstream import TokenStream, Token\n\ndef parse_sexp(stream: TokenStream):\n    \"\"\"A basic S-expression parser that uses Python 3.10+ match statements.\"\"\"\n    with stream.syntax(brace=r\"\\(|\\)\", number=r\"\\d+\", name=r\"\\w+\"):\n        match stream.expect_any((\"brace\", \"(\"), \"number\", \"name\"):\n            case Token(type=\"brace\"):\n                return [parse_sexp(stream) for _ in stream.peek_until((\"brace\", \")\"))]\n            case Token(type=\"number\") as number :\n                return int(number.value)\n            case Token(type=\"name\") as name:\n                return name.value\n```\n\n## Contributing\n\nContributions are welcome. Make sure to first open an issue discussing the problem or the new feature before creating a pull request. The project uses [`poetry`](https://python-poetry.org/).\n\n```bash\n$ poetry install\n```\n\nYou can run the tests with `poetry run pytest`.\n\n```bash\n$ poetry run pytest\n```\n\nThe project must type-check with [`pyright`](https://github.com/microsoft/pyright). If you're using VSCode the [`pylance`](https://marketplace.visualstudio.com/items?itemName=ms-python.vscode-pylance) extension should report diagnostics automatically. You can also install the type-checker locally with `npm install` and run it from the command-line.\n\n```bash\n$ npm run watch\n$ npm run check\n$ npm run verifytypes\n```\n\nThe code follows the [`black`](https://github.com/psf/black) code style. Import statements are sorted with [`isort`](https://pycqa.github.io/isort/).\n\n```bash\n$ poetry run isort tokenstream examples tests\n$ poetry run black tokenstream examples tests\n$ poetry run black --check tokenstream examples tests\n```\n\n---\n\nLicense - [MIT](https://github.com/vberlier/tokenstream/blob/main/LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvberlier%2Ftokenstream","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvberlier%2Ftokenstream","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvberlier%2Ftokenstream/lists"}