{"id":50893972,"url":"https://github.com/pmarreck/chardetz","last_synced_at":"2026-06-15T23:01:21.600Z","repository":{"id":364876841,"uuid":"1269557541","full_name":"pmarreck/chardetz","owner":"pmarreck","description":"Pure-Zig character-encoding detector — a faithful translation of uchardet (no C/C++ dependency, cross-compilable, WASM-able, drop-in uchardet C ABI)","archived":false,"fork":false,"pushed_at":"2026-06-14T23:14:38.000Z","size":311,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"yolo","last_synced_at":"2026-06-14T23:20:53.199Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Zig","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pmarreck.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"COPYING","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-14T21:27:59.000Z","updated_at":"2026-06-14T23:14:42.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/pmarreck/chardetz","commit_stats":null,"previous_names":["pmarreck/chardetz"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/pmarreck/chardetz","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fchardetz","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fchardetz/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fchardetz/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fchardetz/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pmarreck","download_url":"https://codeload.github.com/pmarreck/chardetz/tar.gz/refs/heads/yolo","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pmarreck%2Fchardetz/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34383468,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-15T02:00:07.085Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-06-15T23:01:20.272Z","updated_at":"2026-06-15T23:01:21.591Z","avatar_url":"https://github.com/pmarreck.png","language":"Zig","funding_links":[],"categories":[],"sub_categories":[],"readme":"# chardetz\n\n[![Garnix](https://img.shields.io/endpoint.svg?url=https%3A%2F%2Fgarnix.io%2Fapi%2Fbadges%2Fpmarreck%2Fchardetz%3Fbranch%3Dyolo)](https://garnix.io/repo/pmarreck/chardetz)\n\nA pure-[Zig](https://ziglang.org) character-encoding detector — a faithful\ntranslation of [uchardet](https://github.com/BYVoid/uchardet) (the Mozilla\nuniversalchardet-lineage detector) with **no C/C++ runtime dependency**,\ncross-compilable to every Zig target, and **WASM-able**.\n\nchardetz mirrors uchardet's C ABI, so it is a **drop-in** replacement for\nexisting uchardet consumers.\n\n\u003e **Status:** the full detection engine is complete and verified at **oracle\n\u003e parity** — chardetz matches uchardet exactly across the entire test corpus\n\u003e (every charset uchardet supports: UTF-8/16/32, the CJK multibyte set, all\n\u003e single-byte language models, Hebrew, Latin-1, and the ISO-2022/HZ escape\n\u003e sets), backed by a differential fuzz harness. A drop-in `uchardet_*` C ABI and\n\u003e a C CLI are included. See `PLAN.md`.\n\n## Beyond uchardet: `PRINTABLE-BINARY`\n\nchardetz adds one charset uchardet doesn't know:\n[`printable_binary`](https://github.com/pmarreck/printable_binary) — binary data\nencoded with a fixed 256-glyph printable-UTF-8 map. It's reported as\n`PRINTABLE-BINARY`. Detection is a high-precision heuristic over the **260-codepoint\nallowed set** (those 256 glyphs **plus** the literal space/tab/CR/LF the preserve\nmodes `-s/-t/-n/-w` can emit): ≥99% of codepoints in that set, ≥10% \"distinctive\"\ncontrol/punctuation/high-byte glyphs, and ≥32 codepoints. It pre-empts the UTF-8\nverdict only on a strong PB signal and is\notherwise inert — so it never perturbs uchardet-faithful detection (verified: the\nfull corpus still matches uchardet exactly, and the differential fuzz finds zero\nnew divergences). It's validated metamorphically (round-trip through PB's own\nencoder) rather than against the oracle, since uchardet has no such charset.\n\n## Performance\n\nThe pure-Zig rewrite isn't just a portability win — it is **faster than the C++\nuchardet it ports**. Measured by `./bm` (ReleaseFast, min-of-31, Apple aarch64;\n256 KiB synthetic per-path inputs — the same algorithm runs in both, so the ratio\nis what matters):\n\n| detection path | chardetz | uchardet (C++) | speedup |\n|---|---:|---:|---:|\n| single-byte (34-prober group) | 25.6 MB/s | 22.6 MB/s | **1.13×** |\n| CJK multibyte (MBCS group) | 24.6 MB/s | 21.8 MB/s | **1.13×** |\n| UTF-8 (multibyte validation) | 278 MB/s | 276 MB/s | 1.01× |\n| pure-ASCII fast path | 1315 MB/s | 1574 MB/s | 0.84× |\n| **aggregate** | **47.6 MB/s** | **42.4 MB/s** | **≈1.10×** |\n\nchardetz is ~13% faster on the two heavy paths that dominate real-world\nundeclared-encoding detection, at parity on UTF-8, and only trails on the trivial\nASCII fast path (\u003e1.3 GB/s either way — far above any I/O that would feed it).\n\n`./bm` also enforces a machine-independent **scaling-ratio gate**: it runs each hot\nkernel at N/2N/4N/8N and fails if any doubling exceeds 2.5× — catching accidental\nsuper-linear regressions. All paths are confirmed `O(n)` (worst observed 2.04×).\nPer-run throughput is logged to `bench/\u003cmachine-id\u003e.ndjson` with two-sided tolerance.\n\n## Why\n\nuchardet is excellent but is C++, which complicates static linking, cross-\ncompilation, and especially WebAssembly. chardetz removes the C/C++ toolchain\nrequirement while preserving uchardet's tables and logic exactly — validated by\na differential oracle against the original.\n\n## Building\n\n```bash\n./build      # nix build (sandboxed, ReleaseFast) — static library\n./test       # full test suite (hermetic Zig tests; differential oracle harness)\n```\n\n(Uses Nix; native `zig build` is intentionally avoided — see `RULES.md`.)\n\n## License\n\nchardetz is a **derivative work** of uchardet / Mozilla universalchardet. Its\nported logic and generated data tables carry the upstream license and are **not**\nrelicensed.\n\n**Tri-licensed: [MPL 1.1](https://www.mozilla.org/MPL/1.1/) /\nGPL 2.0-or-later / LGPL 2.1-or-later.**\n\nFull text in [`COPYING`](COPYING). Lineage and the pinned upstream commit are\ndocumented in [`PROVENANCE.md`](PROVENANCE.md); attributions in\n[`THIRD_PARTY_LICENSES`](THIRD_PARTY_LICENSES).\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fchardetz","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpmarreck%2Fchardetz","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpmarreck%2Fchardetz/lists"}