{"id":24381866,"url":"https://github.com/xeverous/arbitrary_code_highlighter","last_synced_at":"2026-05-21T11:03:24.301Z","repository":{"id":62066163,"uuid":"205031365","full_name":"Xeverous/arbitrary_code_highlighter","owner":"Xeverous","description":"semi-automatic code highlighter generating HTML span elements supporting precise color specification","archived":false,"fork":false,"pushed_at":"2023-02-12T23:12:11.000Z","size":282,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-01-19T09:13:48.743Z","etag":null,"topics":["cpp","generator","html"],"latest_commit_sha":null,"homepage":null,"language":"C++","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Xeverous.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-08-28T22:04:54.000Z","updated_at":"2022-10-21T19:20:55.000Z","dependencies_parsed_at":"2025-01-19T09:13:49.135Z","dependency_job_id":"5e65009a-ca1f-4901-b37b-0f6bcb7dc5e5","html_url":"https://github.com/Xeverous/arbitrary_code_highlighter","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xeverous%2Farbitrary_code_highlighter","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xeverous%2Farbitrary_code_highlighter/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xeverous%2Farbitrary_code_highlighter/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Xeverous%2Farbitrary_code_highlighter/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Xeverous","download_url":"https://codeload.github.com/Xeverous/arbitrary_code_highlighter/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243249011,"owners_count":20260768,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cpp","generator","html"],"created_at":"2025-01-19T09:13:45.408Z","updated_at":"2025-12-27T14:12:15.134Z","avatar_url":"https://github.com/Xeverous.png","language":"C++","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Arbitrary Code Highlighter\n\nA tool that given source code and additional information, outputs it embedded in HTML `\u003cspan\u003e` tags containing classes for CSS highlight. [GeSHi](http://qbnz.com/highlighter/) and [Pygments](https://pygments.org) are a good comparison but instead of supporting highlight output for a vast set of languages, this tool focuses on specific use cases where the source code is accompanied by additional information for a richer colorization.\n\nThe uses cases are:\n\n- highlighting arbitrary code with \"code mirror color file\"\n\n```cpp\nint main() // a comment\n{\n    std::cout \u003c\u003c \"sizeof(long) = \" \u003c\u003c sizeof(long) \u003c\u003c '\\n';\n}\n```\n\n```\nkeyword func() 0comment\n{\n    namespace::global \u003c\u003c str \u003c\u003c keyword(keyword) \u003c\u003c chr;\n}\n```\n\nThis would produce HTML code where the `int` gets a span with class `keyword`, `main` a span with class `func` and so on...\n\nThe format has been specifically designed to visually mirror original text. This allows fast, copy-and-paste editing of original text while still being human readable.\n\nBy default sequences of tokens - that is whitespace, *identifiers* (an alpha character followed by any number of alphanumeric characters) and symbols are expected to match 1:1 but there are many additional features like fixed-length tokens (`0` in the above example meaning to-the-end-of-line) and automatic quotation matching (`str` and `chr`).\n\nThe goal is to provide a very customizable highlight (typically for small snippets) where there is no highligher available and/or only manual coloration can achieve desired effect. Intended to colorize specifications, syntax examples and anything else that has no dedicated highlighter.\n\n- highlighting C or C++ code with information provided by [clangd](https://clangd.llvm.org)\n\nThis is similar to already available highlighters but instead of implementing a more-or-less fuzzy/wonky/lax parser that doesn't fully understand the language the goal is to implement a parser that is able to highlight the code with maximum precision by utilizing compiler-level knowledge about the code delivered by clangd.\n\nBelow is a simplified example of a block of information delived by clangd about a sample program (token text positions ommited for brevity):\n\n```\n\"#error Misconfigured build!\", type: comment, modifiers:\n\"MACRO\",        type: macro,         modifiers: globalScope\n\"T\",            type: typeParameter, modifiers: declaration\n\"T\",            type: typeParameter, modifiers:\n\"result\",       type: unknown,       modifiers: dependentName, classScope\n\"main\",         type: function,      modifiers: declaration, globalScope\n\"power_states\", type: enum,          modifiers: declaration, globalScope\n\"on\",           type: enumMember,    modifiers: declaration, readonly, globalScope\n\"sleep\",        type: enumMember,    modifiers: declaration, readonly, globalScope\n\"off\",          type: enumMember,    modifiers: declaration, readonly, globalScope\n\"point\",        type: class,         modifiers: declaration, globalScope\n\"x\",            type: property,      modifiers: declaration, classScope\n\"y\",            type: property,      modifiers: declaration, classScope\n\"auto\",         type: class,         modifiers: deduced, defaultLibrary, globalScope\n\"auto\",         type: typeParameter, modifiers: functionScope\n\"std\",          type: namespace,     modifiers: defaultLibrary, globalScope\n\"cout\",         type: variable,      modifiers: defaultLibrary, globalScope\n```\n\nSpecifically, this is taken from [Semantic Tokens LSP call](https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_semanticTokens). The same token (by string contents but not by position) can be reported multiple times (e.g. a keyword or variable used multiple times). As of writing this, roughly speaking, clangd reports every non-keyword identifier and few additional tokens like preprocessor-disabled code. With basic syntax understanding (for comments, keywords, operators, literals etc.) it's possible to implement IDE level of highlight in generated HTML. Obviously input code snippets should be self-contained and free of errors. Otherwise clangd may not deliver full information. In other words, the code must be compileable.\n\n## Documentation - mirror\n\nIn general, tokens are matched 1:1, starting by reading the mirror color file first. The color file contents dictate how the code file contents should be consumed and transformed into HTML.\n\nWhen the color file contains an identifier, it will be matched against an arbitrary identifier in the code.\n\n- An identifier within the color file is a non-zero sequence of characters from the set of `a-zA-Z_`.\n- An identifier within the code file is a character from the set of `a-zA-Z_` followed by any number of characters from the set of `a-zA-Z_0-9`.\n\nThis means that a color identifier like `var` can be matched against `x1` identifier in the code. Identifiers in the color file are not allowed to contain digits because digits have other, special meaning for color specification.\n\nExample (code/color/output):\n\n```\nint x1\nkeyword var\n\u003cspan class=\"keyword\"\u003eint\u003c/span\u003e \u003cspan class=\"var\"\u003ex1\u003c/span\u003e\n```\n\nSymbols without special meaning and whitespace are compared exactly. This forces the color file to \"mirror\" the code file contents in regards to syntax:\n\n```\nfoo(bar);\nfunc(arg);\n\u003cspan class=\"func\"\u003efoo\u003c/span\u003e(\u003cspan class=\"arg\"\u003ebar\u003c/span\u003e);\n```\n\n3 configurable color identifiers have special meaning (in other words, color keywords):\n\n- `num` - matches against a sequence of digits\n- `chr` - matches against a literal, formed by `'` quotes\n- `str` - matches against a literal, formed by `\"` quotes\n\nThe literals may contain the simplest form of literal escapes: `\\` followed by any single character. Escaped content automatically receives separate CSS span classes (there is no keyword for escaped content):\n\n```\n\"string\\nwith\\bescapes\"\nstr\n\u003cspan class=\"str\"\u003e\"string\u003cspan class=\"str_esc\"\u003e\\n\u003c/span\u003ewith\u003cspan class=\"str_esc\"\u003e\\b\u003c/span\u003eescapes\"\u003c/span\u003e\n```\n\nDue to very high variance in numeric syntax across languages, the `num` keyword supports only ASCII digits. Anything more complex requires to use length-based matching.\n\nIf the color identifier is preceeded by a number, it changes the matching behavior from color-identifier-to-code-identifier to color-identifier-to-exactly-this-many-characters:\n\n```\nonelongidentifier\n3a4b10c\n\u003cspan class=\"a\"\u003eone\u003c/span\u003e\u003cspan class=\"b\"\u003elong\u003c/span\u003e\u003cspan class=\"c\"\u003eidentifier\u003c/span\u003e\n```\n\nthese characters can be any characters:\n\n```\n\"text = %s\"\n\"7str2fmt\"\n\"\u003cspan class=\"str\"\u003etext = \u003c/span\u003e\u003cspan class=\"fmt\"\u003e%s\u003c/span\u003e\"\n```\n\nIf the number is 0, it will consume all characters left on the line:\n\n```\nf(); // comment\nfunc(); 0com\n\u003cspan class=\"func\"\u003ef\u003c/span\u003e(); \u003cspan class=\"com\"\u003e// comment\u003c/span\u003e\n```\n\nIf specifying length is not desired and there is no whitespace between tokens, a special, configurable character (` by default) can be used to separate color token names:\n\n```\n**kwargs\n2op`param\n\u003cspan class=\"op\"\u003e**\u003c/span\u003e\u003cspan class=\"param\"\u003ekwargs\u003c/span\u003e\n```\n\nthis includes the possibility to generate fixed-length spanless outputs:\n\n```\nonetwothree\n3first3`5third\n\u003cspan class=\"first\"\u003eone\u003c/span\u003etwo\u003cspan class=\"third\"\u003ethree\u003c/span\u003e\n```\n\nAdditional remarks:\n\n- More examples can be found in `src/test/mirror_tests.cpp`.\n- Because CSS class names frequently use `-` and this character is not allowed in color identifier names, one of generation option allows to convert `_` to `-`. This means it's possible to use `underscored_names` within color file and receive `dashed-names` in the span class names in the output.\n- There is no special handling for unicode. Everything is compared byte-to-byte. Both code and color files may be unicode, but then one must be aware that because identifiers are matched only by ASCII characters, non-ASCII unicode bytes will not form identifiers (in both types of files) and fall into byte-to-byte matching intended for symbols. Additionally, the color file will have to use a lot of numeric specifiers to correctly match certain number of bytes from the code file.\n\n## Documentation - clangd\n\nACH does only basic tokenization (keywords, identifiers, literals, preprocessor and comments). No proper C++ parsing takes place. It works by first splitting the code into these fundamental syntax categories, then tries to improve this information by application of semantic token information - most of which augments identifiers. Other parts of syntax like literals, comments and keywords are generally not reported by clangd.\n\nThis section assumes the reader is familiar with [LSP](https://microsoft.github.io/language-server-protocol/overviews/lsp/overview/), particularly with the [Semantic Tokens call](https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_semanticTokens). If not, read the Semantic Tokens section now.\n\nClangd has no documentation how exactly it implements LSP. Because LSP gives some freedom of implementation (like defining custom token types and modifiers, in addition to the standard ones) the information below is based on own experiments.\n\nReported types (each token has exactly 1 type):\n\n- objects: `variable`, `parameter`, `property` (field), `enumMember`\n- functions: `function`, `method`\n- types: `class`, `interface`, `enum`, `type`\n- templates: `typeParameter` (both TTP and NTTP), `concept`\n- `namespace`\n- `comment` (used to report preprocessor-disabled code, not comments) (always a whole line)\n- `macro` (definitions in preprocessor and usages outside preprocessor - if the macro has a body and the body uses other macros those are not reported)\n- `modifier` - `override` and `final` when used as intended\n- `operator` - both built-in and overloaded\n- `bracket` - `\u003c` and `\u003e` when not used as operator and not used in preprocessor\n- `label`\n- `unknown`\n\nReported modifiers as of clangd 20.1.8 (each token can have 0+ modifiers):\n\n- `declaration`\n- `definition`\n- `deprecated`\n- `deduced` (applies to `auto` when possible - e.g. in initializer but not in generic lambdas)\n- `readonly` (`const` and `constexpr`)\n- `static`\n- `abstract`\n- `virtual`\n- `dependentName`\n- `defaultLibrary` (standard library entities)\n- `usedAsMutableReference`\n- `usedAsMutablePointer`\n- `constructorOrDestructor`\n- `userDefined` (e.g. overloaded operators)\n- 0 or 1 of `functionScope`, `classScope`, `fileScope`, `globalScope`\n\nDiscoveries and details:\n\n- Literals and their prefixes/suffixes are not reported.\n- Generally, keywords are not reported but:\n\n  - `auto` is reported when used as a type deduction, e.g. `auto x = f();` (with semantic info about deduced type, as if auto wasn't used)\n  - `auto` is not reported when used as return type, e.g. `auto f() { return /* ... */; }`\n  - `auto` is not repoeted when used in trailing return type syntax, e.g. `auto f() -\u003e T;`\n  - `override`, `final` is reported when used as entity names (type depends on entity)\n  - `override`, `final` is reported when used as intended (type = modifier)\n  - `declaration` works more like \"definition\" - it's present when the entity appears for the first time (e.g. type definition, function parameter name in the parameters list).\n\n- `readonly` is present for `const` and `constexpr` objects, including references (e.g. parameter names that have const reference type).\n- `static` is not reported for every entity using `static` keyword in C and C++. It's present only for static class members (`static` non-member functions and globals do not have this token modifier).\n- `usedAsMutableReference` is reported for objects passed by non-const reference, only at the caller side. Analogically `usedAsMutablePointer`.\n- All or almost all entities will have set one of scope modifiers. It's basically entity's visibility.\n- Macro calls outside preprocessor lines are reported. If such macros have parameters, they *may* be reported depending on how they are being used inside the macro (basically it depends on what the macro parameter becomes after macro expansion).\n\n## Usage\n\nThe program has 3 interfaces:\n\n- (unfinished) command-line\n- directly through C++ (see `src/ach/mirror/core.hpp` and `src/ach/clangd/core.hpp`)\n- indirectly through Python bindings\n\nThe mirror API requires to provide contents of 2 files (code and \"mirror\" color specification).\n\nThe clangd API requires to provide contents of the source file and clangd-semantic-token-information (already transformed into a form required by the project's C++ or Python interface). The clangd API does only parsing and application of clangd information - invoking clangd on the source file (with appropriate compiler settings) and then transforming received JSONs to required interface is your responsibility.\n\n## Building\n\nModern CMake build recipe (targets not variables). See top-level `CMakeLists.txt` for details.\n\n- (static library) Project core is the only mandatory part and requires only C++17.\n- (executable) Unit tests require Boost test library (header-only).\n- (executable) Command-line interface requires Boost with program_options library built.\n- (shared library) Python bindings require Python 3.6+ development installation. Everything else is provided in submodules.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxeverous%2Farbitrary_code_highlighter","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fxeverous%2Farbitrary_code_highlighter","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fxeverous%2Farbitrary_code_highlighter/lists"}