{"id":35620860,"url":"https://github.com/crwsolutions/ntokenizers","last_synced_at":"2026-05-30T19:00:45.633Z","repository":{"id":324748492,"uuid":"1097033122","full_name":"crwsolutions/ntokenizers","owner":"crwsolutions","description":"Collection of stream-capable tokenizers for JSON, YAML, XML, SQL, Typescript, CSS, CSharp and Markup processing","archived":false,"fork":false,"pushed_at":"2026-05-26T15:04:28.000Z","size":586,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T16:36:36.744Z","etag":null,"topics":["csharp-parser","css-parser","dotnet","javascript-parser","json-parser","markup-parser","sql-parser","stream","streaming","tokenizer","typescript-parser","xml-parser"],"latest_commit_sha":null,"homepage":"https://crwsolutions.github.io/ntokenizers","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crwsolutions.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"crwsolutions"}},"created_at":"2025-11-15T12:13:25.000Z","updated_at":"2026-05-26T15:07:41.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/crwsolutions/ntokenizers","commit_stats":null,"previous_names":["crwsolutions/ntokenizers"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/crwsolutions/ntokenizers","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwsolutions%2Fntokenizers","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwsolutions%2Fntokenizers/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwsolutions%2Fntokenizers/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwsolutions%2Fntokenizers/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crwsolutions","download_url":"https://codeload.github.com/crwsolutions/ntokenizers/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crwsolutions%2Fntokenizers/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33705207,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["csharp-parser","css-parser","dotnet","javascript-parser","json-parser","markup-parser","sql-parser","stream","streaming","tokenizer","typescript-parser","xml-parser"],"created_at":"2026-01-05T06:24:19.401Z","updated_at":"2026-05-30T19:00:45.623Z","avatar_url":"https://github.com/crwsolutions.png","language":"C#","funding_links":["https://github.com/sponsors/crwsolutions"],"categories":[],"sub_categories":[],"readme":"# NTokenizers\n\nLightweight **Stream Tokenizers** for syntax highlighting and formatting. Perfect building block for **chat applications**, and **AI response rendering** . Tokenize streaming AI responses in real-time for beautiful syntax-highlighted output.\n\n| Markup languages | Data formats | Programming languages |\n|------------------|--------------|----------------------|\n| Markdown | JSON | CSharp |\n| HTML | YAML | C |\n| | TOML | C++ |\n| | XML | Go |\n| | | Java |\n| | | Kotlin |\n| | | Rust |\n| | | Swift |\n| | | TypeScript |\n| | | SQL |\n| | | CSS |\n| | | Python |\n\n## How to: kickoff token processing\n\n### Composite tokenizers\n\n#### Markup languages\n\n```csharp\n// kickoff markdown tokenizer\nawait MarkdownTokenizer.Create().ParseAsync(stream, onToken: async token =\u003e { /* handle markdown-tokens here */ });\n\n// kickoff html tokenizer\nawait HtmlTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle html-tokens here */ });\n```\n\n### Individual tokenizers\n\n#### Data formats\n\n```csharp\n// kickoff json tokenizer\nawait JsonTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle json-tokens here */ });\n\n// kickoff yaml tokenizer\nawait YamlTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle yaml-tokens here */ });\n\n// kickoff toml tokenizer\nawait TomlTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle toml-tokens here */ });\n\n// kickoff xml tokenizer\nawait XmlTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle xml-tokens here */ });\n```\n\n#### Programming languages\n\n```csharp\n// kickoff csharp tokenizer\nawait CSharpTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle csharp-tokens here */ });\n\n// kickoff c tokenizer\nawait CTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle c-tokens here */ });\n\n// kickoff cpp tokenizer\nawait CppTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle cpp-tokens here */ });\n\n// kickoff go tokenizer\nawait GoTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle go-tokens here */ });\n\n// kickoff java tokenizer\nawait JavaTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle java-tokens here */ });\n\n// kickoff kotlin tokenizer\nawait KotlinTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle kotlin-tokens here */ });\n\n// kickoff rust tokenizer\nawait RustTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle rust-tokens here */ });\n\n// kickoff swift tokenizer\nawait SwiftTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle swift-tokens here */ });\n\n// kickoff typescript/ javascript tokenizer\nawait TypescriptTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle typescript-tokens here */ });\n\n// kickoff sql tokenizer\nawait SqlTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle sql-tokens here */ });\n\n// kickoff css tokenizer\nawait CssTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle css-tokens here */ });\n\n// kickoff python tokenizer\nawait PythonTokenizer.Create().ParseAsync(stream, onToken: token =\u003e { /* handle python-tokens here */ });\n```\n\n## Overview\n\nNTokenizers is a .NET library written in C# that provides tokenizers for processing structured text formats like Markdown, JSON, XML, HTML, YAML, TOML, SQL, Typescript, CSS, CSharp, C, C++, Go, Java, Kotlin, Rust, Swift and Python. The `Tokenize` method is the core functionality that breaks down structured text into meaningful components (tokens) for processing. Its key feature is **stream processing capability** - it can handle data as it arrives in real-time, making it ideal for processing large files or streaming data without loading everything into memory at once.\n\n\u003e [!WARNING] \n\u003e\n\u003e These tokenizers are **not validation-based** and are primarily intended for **prettifying**, **formatting**, or **visualizing** structured text. They do not perform strict validation of the input format, so they may produce unexpected results when processing malformed or invalid XML, JSON, or HTML. Use them with caution when dealing with untrusted or poorly formatted input.\n\n\u003e [!WARNING] \n\u003e\n\u003e MarkupTokenizer was renamed to MarkdownTokenizer in v2.\n\n\n## Used by\n\n- [NTokenizers.Extensions.Spectre.Console](https://www.nuget.org/packages/NTokenizers.Extensions.Spectre.Console/) Spectre.Console rendering extensions for NTokenizers, Style-rich console syntax highlighting.\n\n# Architecture\n\nMost **tokenizers**, such as json, xml, or etc..., can be used individually, depending on the specific format you want to parse.\n\nThe `MarkdownTokenizer` however is a special case. Instead of working on a single format, it acts as a **composite tokenizer**, using the other tokenizers as **subtokenizers**. When parsing a stream, MarkdownTokenizer delegates portions of the input to the appropriate subtokenizer, allowing it to handle multiple formats seamlessly in one pass.\n\nThe same principle applies to inline tokenizers such as Heading, Blockquote, ListItem, and others. However, they cannot be used individually and produce the same token types as the `MarkdownTokenizer`.\n\n### Diagram\n\n```\n         ┌─────────┐\n         │ stream  │\n         └─────────┘\n              │  ParseAsync()\n              ▼\n   ┌─────────────────────┐\n   │  MarkdownTokenizer  │ ───────────► fire markdown tokens\n   └─────────────────────┘\n              │\n              ▼       ┌─────────┐\n              ├──────►│   json  │ ───► fire json tokens\n              │       └─────────┘\n              │\n              │       ┌─────────┐\n              ├──────►│ Heading │ ───► fire markdown tokens\n              │       └─────────┘\n              │\n              │       ┌─────────┐\n              ├──────►│   html  │ ───► fire html tokens\n              │       └─────────┘\n              │            │\n              │            ▼       ┌─────────┐\n              │            ├──────►│   css   │ ───► fire css tokens\n              │            │       └─────────┘\n              │            │\n              │            │       ┌─────────┐\n              │            └──────►│ script  │ ───► fire typescript tokens\n              │                    └─────────┘\n              │       ┌─────────┐\n              └──────►│  etc..  │ ───► etc\n                      └─────────┘\n```\n\n## Example\n\nHere's a simple example showing how to use the `MarkdownTokenizer`:\n\n```csharp\nusing NTokenizers.Core;\nusing NTokenizers.Css;\nusing NTokenizers.Html;\nusing NTokenizers.Json;\nusing NTokenizers.Markdown;\nusing NTokenizers.Markdown.Metadata;\nusing NTokenizers.Typescript;\nusing NTokenizers.Xml;\nusing Spectre.Console;\nusing System.Diagnostics;\nusing System.IO.Pipes;\nusing System.Text;\n\nclass Program\n{\n    static async Task Main()\n    {\n        string markdown = \"\"\"\n        Here is some **bold** text and some *italic* text.\n\n        # NTokenizers Showcase\n        \n        ## Css example\n        ```css\n        .user {\n            color: #FFFFFF;\n            active: true;\n        }\n        ```\n\n        ## XML example\n        ```xml\n        \u003cuser id=\"4821\" active=\"true\"\u003e\n            \u003cname\u003eLaura Smith\u003c/name\u003e\n        \u003c/user\u003e\n        ```\n\n        ## HTML example\n        ```html\n        \u003chtml\u003e\n        \u003chead\u003e\n            \u003cstyle\u003e\n                body { font-family: Arial, sans-serif; background-color: #f0f8ff; }\n                .header { color: #4682b4; text-align: center; }\n                .content { margin: 20px; padding: 15px; background-color: white; border-radius: 5px; }\n            \u003c/style\u003e\n        \u003c/head\u003e\n        \u003cbody\u003e\n            \u003cp\u003eHello world!\u003c/p\u003e\n            \u003cscript\u003e\n                console.log(\"Hello from the sample script!\");\n                document.addEventListener('DOMContentLoaded', function() {\n                    console.log(\"DOM is fully loaded\");\n                });\n            \u003c/script\u003e\n        \u003c/body\u003e\n        \u003c/html\u003e\n        ```\n\n        ## JSON example\n        ```json\n        {\n            \"name\": \"Laura Smith\",\n            \"active\": true\n        }\n        ```\n\n        ## TypeScript example\n        ```typescript\n        const user = {\n            name: \"Laura Smith\",\n            active: true\n        };\n        ```\n        \"\"\";\n\n        // Create connected streams\n        using var pipe = new AnonymousPipeServerStream(PipeDirection.Out);\n        using var reader = new AnonymousPipeClientStream(PipeDirection.In, pipe.ClientSafePipeHandle);\n\n        // Start slow writer\n        var writerTask = EmitSlowlyAsync(markdown, pipe);\n\n        // Parse markup\n        await MarkdownTokenizer.Create().ParseAsync(reader, onToken: async token =\u003e\n        {\n            if (token.Metadata is ICodeBlockMetadata codeBlock)\n            {\n                AnsiConsole.WriteLine();\n                AnsiConsole.Write(new Markup($\"[bold lime]{codeBlock.Language}:[/]\"));\n                AnsiConsole.WriteLine();\n            }\n\n            if (token.Metadata is ListItemMetadata listMetadata)\n            {\n                AnsiConsole.Write(new Markup($\"[bold lime]{listMetadata.Marker} [/]\"));\n                await listMetadata.RegisterInlineTokenHandler(inlineToken =\u003e\n                {\n                    var value = Markup.Escape(inlineToken.Value);\n                    AnsiConsole.Write(new Markup($\"[bold red]{value}[/]\"));\n                });\n                Debug.WriteLine(\"Written listItem inlines\");\n\n            }\n            else if (token.Metadata is HeadingMetadata headingMetadata)\n            {\n                await headingMetadata.RegisterInlineTokenHandler(inlineToken =\u003e\n                {\n                    var value = Markup.Escape(inlineToken.Value);\n                    var colored = headingMetadata.Level != 1 ?\n                        new Markup($\"[bold GreenYellow]{value}[/]\") :\n                        new Markup($\"[bold yellow]** {value} **[/]\");\n                    AnsiConsole.Write(colored);\n                });\n                Debug.WriteLine(\"Written Heading inlines\");\n            }\n            else if (token.Metadata is XmlCodeBlockMetadata xmlMetadata)\n            {\n                await xmlMetadata.RegisterInlineTokenHandler(inlineToken =\u003e\n                {\n                    var value = Markup.Escape(inlineToken.Value);\n                    var colored = inlineToken.TokenType switch\n                    {\n                        XmlTokenType.ElementName =\u003e new Markup($\"[blue]{value}[/]\"),\n                        XmlTokenType.OpeningAngleBracket =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        XmlTokenType.ClosingAngleBracket =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        XmlTokenType.SelfClosingSlash =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        XmlTokenType.AttributeName =\u003e new Markup($\"[cyan]{value}[/]\"),\n                        XmlTokenType.AttributeEquals =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        XmlTokenType.AttributeQuote =\u003e new Markup($\"[grey]{value}[/]\"),\n                        XmlTokenType.AttributeValue =\u003e new Markup($\"[green]{value}[/]\"),\n                        XmlTokenType.Text =\u003e new Markup($\"[white]{value}[/]\"),\n                        XmlTokenType.Whitespace =\u003e new Markup($\"[grey]{value}[/]\"),\n                        _ =\u003e new Markup(value)\n                    };\n                    AnsiConsole.Write(colored);\n                });\n            }\n            else if (token.Metadata is JsonCodeBlockMetadata jsonMetadata)\n            {\n                await jsonMetadata.RegisterInlineTokenHandler(inlineToken =\u003e\n                {\n                    var value = Markup.Escape(inlineToken.Value);\n                    var colored = inlineToken.TokenType switch\n                    {\n                        JsonTokenType.StartObject =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        JsonTokenType.EndObject =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        JsonTokenType.StartArray =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        JsonTokenType.EndArray =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        JsonTokenType.PropertyName =\u003e new Markup($\"[cyan]{value}[/]\"),\n                        JsonTokenType.StringValue =\u003e new Markup($\"[green]{value}[/]\"),\n                        JsonTokenType.Number =\u003e new Markup($\"[magenta]{value}[/]\"),\n                        JsonTokenType.True =\u003e new Markup($\"[orange1]{value}[/]\"),\n                        JsonTokenType.False =\u003e new Markup($\"[orange1]{value}[/]\"),\n                        JsonTokenType.Null =\u003e new Markup($\"[grey]{value}[/]\"),\n                        JsonTokenType.Colon =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        JsonTokenType.Comma =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        JsonTokenType.Whitespace =\u003e new Markup($\"[grey]{value}[/]\"),\n                        _ =\u003e new Markup(value)\n                    };\n                    AnsiConsole.Write(colored);\n                });\n            }\n            else if (token.Metadata is HtmlCodeBlockMetadata htmlMetadata)\n            {\n                await htmlMetadata.RegisterInlineTokenHandler(async inlineToken =\u003e\n                {\n                    if (inlineToken.Metadata is TypeScriptCodeBlockMetadata tsMeta)\n                    {\n                        await HandleScript(tsMeta);\n                    }\n                    else if (inlineToken.Metadata is CssCodeBlockMetadata cssMeta)\n                    {\n                        await HandleCss(cssMeta);\n                    }\n                    else\n                    {\n                    var value = Markup.Escape(inlineToken.Value);\n                    var colored = inlineToken.TokenType switch\n                    {\n                        HtmlTokenType.OpeningAngleBracket =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        HtmlTokenType.ClosingAngleBracket =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        HtmlTokenType.SelfClosingSlash =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        HtmlTokenType.AttributeName =\u003e new Markup($\"[cyan]{value}[/]\"),\n                        HtmlTokenType.AttributeEquals =\u003e new Markup($\"[yellow]{value}[/]\"),\n                        HtmlTokenType.AttributeQuote =\u003e new Markup($\"[grey]{value}[/]\"),\n                        HtmlTokenType.AttributeValue =\u003e new Markup($\"[green]{value}[/]\"),\n                        HtmlTokenType.Text =\u003e new Markup($\"[white]{value}[/]\"),\n                        HtmlTokenType.Comment =\u003e new Markup($\"[grey]{value}[/]\"),\n                        HtmlTokenType.Whitespace =\u003e new Markup($\"[grey]{value}[/]\"),\n                        _ =\u003e new Markup(value)\n                    };\n                    AnsiConsole.Write(colored);\n                    }\n                });\n            }\n            else if (token.Metadata is TypeScriptCodeBlockMetadata tsMetadata)\n            {\n                await HandleScript(tsMetadata);\n            }\n            else if (token.Metadata is CssCodeBlockMetadata cssMetadata)\n            {\n                await HandleCss(cssMetadata);\n            }\n            else\n            {\n                // Handle regular markup tokens\n                var value = Markup.Escape(token.Value);\n                var colored = token.TokenType switch\n                {\n                    MarkdownTokenType.Text =\u003e new Markup($\"{value}\"),\n                    MarkdownTokenType.Bold =\u003e new Markup($\"[bold]{value}[/]\"),\n                    MarkdownTokenType.Italic =\u003e new Markup($\"[italic]{value}[/]\"),\n                    _ =\u003e new Markup(value)\n                };\n\n                AnsiConsole.Write(colored);\n            }\n\n            if (token.Metadata is InlineMetadata)\n            {\n                AnsiConsole.WriteLine();\n            }\n        });\n\n        await writerTask;\n\n        Console.WriteLine();\n        Console.WriteLine(\"Done.\");\n    }\n\n    private static async Task HandleScript(TypeScriptCodeBlockMetadata tsMetadata)\n    {\n        await tsMetadata.RegisterInlineTokenHandler(inlineToken =\u003e\n        {\n            var value = Markup.Escape(inlineToken.Value);\n            var colored = inlineToken.TokenType switch\n            {\n                TypescriptTokenType.Identifier =\u003e new Markup($\"[cyan]{value}[/]\"),\n                TypescriptTokenType.Keyword =\u003e new Markup($\"[blue]{value}[/]\"),\n                TypescriptTokenType.StringValue =\u003e new Markup($\"[green]{value}[/]\"),\n                TypescriptTokenType.Number =\u003e new Markup($\"[magenta]{value}[/]\"),\n                TypescriptTokenType.Operator =\u003e new Markup($\"[yellow]{value}[/]\"),\n                TypescriptTokenType.Comment =\u003e new Markup($\"[grey]{value}[/]\"),\n                TypescriptTokenType.Whitespace =\u003e new Markup($\"[grey]{value}[/]\"),\n                _ =\u003e new Markup(value)\n            };\n            AnsiConsole.Write(colored);\n        });\n    }\n\n    private static async Task HandleCss(CssCodeBlockMetadata cssMetadata)\n    {\n        await cssMetadata.RegisterInlineTokenHandler(inlineToken =\u003e\n        {\n            var value = Markup.Escape(inlineToken.Value);\n            var colored = inlineToken.TokenType switch\n            {\n                CssTokenType.Identifier =\u003e new Markup($\"[white]{value}[/]\"),\n                CssTokenType.Number =\u003e new Markup($\"[magenta]{value}[/]\"),\n                CssTokenType.Operator =\u003e new Markup($\"[yellow]{value}[/]\"),\n                CssTokenType.Selector =\u003e new Markup($\"[yellow]{value}[/]\"),\n                CssTokenType.Comment =\u003e new Markup($\"[green]{value}[/]\"),\n                CssTokenType.Whitespace =\u003e new Markup($\"[grey]{value}[/]\"),\n                _ =\u003e new Markup(value)\n            };\n            AnsiConsole.Write(colored);\n        });\n    }\n\n    static async Task EmitSlowlyAsync(string markdown, Stream output)\n    {\n        var rng = new Random();\n        byte[] bytes = Encoding.UTF8.GetBytes(markdown);\n\n        foreach (var b in bytes)\n        {\n            await output.WriteAsync(new[] { b }.AsMemory(0, 1));\n            await output.FlushAsync();\n            await Task.Delay(rng.Next(0, 2));\n        }\n\n        output.Close(); // EOF\n    }\n}\n```\n\nFor more information, check out the documentation [here](https://crwsolutions.github.io/ntokenizers/).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrwsolutions%2Fntokenizers","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrwsolutions%2Fntokenizers","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrwsolutions%2Fntokenizers/lists"}