{"id":44503984,"url":"https://github.com/msquareau/dns-blocklist","last_synced_at":"2026-06-11T02:01:38.742Z","repository":{"id":338166937,"uuid":"1156660395","full_name":"msquareau/dns-blocklist","owner":"msquareau","description":"DNS blocklist builder","archived":false,"fork":false,"pushed_at":"2026-05-26T05:32:47.000Z","size":147,"stargazers_count":1,"open_issues_count":2,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-05-26T07:30:52.796Z","etag":null,"topics":["ad-block","blocklist","blocklist-aggregator","dns","sdbl"],"latest_commit_sha":null,"homepage":"","language":"Rust","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/msquareau.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-02-12T22:45:52.000Z","updated_at":"2026-05-14T05:24:19.000Z","dependencies_parsed_at":null,"dependency_job_id":"58ff7fc2-9a8d-45b3-a3d6-af74b3a93f3a","html_url":"https://github.com/msquareau/dns-blocklist","commit_stats":null,"previous_names":["msquareau/dns-blocklist"],"tags_count":119,"template":false,"template_full_name":null,"purl":"pkg:github/msquareau/dns-blocklist","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msquareau%2Fdns-blocklist","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msquareau%2Fdns-blocklist/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msquareau%2Fdns-blocklist/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msquareau%2Fdns-blocklist/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/msquareau","download_url":"https://codeload.github.com/msquareau/dns-blocklist/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/msquareau%2Fdns-blocklist/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34178819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-11T02:00:06.485Z","response_time":57,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ad-block","blocklist","blocklist-aggregator","dns","sdbl"],"created_at":"2026-02-13T08:25:14.705Z","updated_at":"2026-06-11T02:01:38.717Z","avatar_url":"https://github.com/msquareau.png","language":"Rust","funding_links":[],"categories":[],"sub_categories":[],"readme":"# DNS Blocklist Compiler\n\nRust CLI tool that downloads DNS blocklists from popular open-source upstream sources and compiles them into a single categorized binary file (SDBL v3 format). Designed for DNS filtering apps, ad blockers, and network-level content filtering.\n\n## Build Instructions\n\n```bash\n# Requirements: Rust 1.85+\ngit clone https://github.com/msquareau/dns-blocklist.git\ncd dns-blocklist\ncargo build --release\n./target/release/dns-blocklist-compiler --output ./output\n# Output: output/blocklist.bin, output/blocklist.bin.gz, output/blocklist.json\n```\n\n### CLI flags\n\n| Flag | Default | Description |\n|---|---|---|\n| `--output \u003cdir\u003e` | `.` | Where to write `blocklist.bin`, `blocklist.bin.gz`, and `blocklist.json`. |\n| `--strict` | (default) | Abort on any validation failure: bad download, parse-count regression, canary mismatch, per-bit floor breach. CI should use strict. |\n| `--best-effort` | | Tolerate up to 2 source-level (download or parse) failures and per-bit floor breaches — they downgrade to `WARN` lines. Canary mismatches and round-trip mismatches still abort because they indicate the artifact is broken, not just under-supplied. Use for local development iterations. |\n\n### Validation layers\n\nThe compiler runs three validation passes; any of them can stop a bad artifact from shipping:\n\n1. **Layer 1 — download.** HTTP status must be 2xx, body must meet the source's optional `minSizeBytes`, `Content-Type` must be `text/*` (not `text/html` / `application/json`), and the first 30 non-comment lines must contain at least one parseable domain. Retries: 3× with 1s/2s/4s ±20 % jitter for network errors and 5xx.\n2. **Layer 2 — parse.** If the source emits a HaGeZi-style `# Number of entries: N` header, parsed count must be ≥ 90 % of N. Independently, the source's optional `minParsedEntries` floor must be met.\n3. **Layer 3 — output.** The just-compiled binary is parsed back through `src/reader.rs`, every canary in [`canary-domains.json`](canary-domains.json) is looked up and its `expectedMinBitmap` bits must be present, ~1000 random store entries are round-tripped through the trie, and every source's optional `minTrieEntries` floor is checked against the trie's per-bit terminal counts.\n\n## Testing\n\n```bash\ncargo test                        # Unit + integration tests\ncargo test -- --ignored           # Plus live-download tests (hits jsDelivr)\ncargo clippy -- -D warnings       # Lint check\ncargo fmt --check                 # Format check\n```\n\nThe test suite covers:\n\n- **Inline unit tests** in every `src/*.rs` module — parser formats, trie serialization, header encoding, metadata generation, config deserialization, SDBL reader, the three validation layers.\n- **`tests/integration_test.rs`** — end-to-end compilation through the SDBL v3 reader to verify domain lookups, category bitmaps, wildcard handling, determinism, and gzip round-trips.\n- **`tests/validation_test.rs`** — the issue-#20 regression suite: HTTP 404, 500, too-small body, `text/html` rejection, smell-test rejection of HTML error pages, parse-count ratio guard, `minParsedEntries` floor, canary mismatch (including the literal \"Ultimate bit 4 dropped\" symptom), per-bit trie-entry floor, round-trip sampling.\n- **`tests/download_integration_test.rs`** — two `#[ignore]`d live-download tests gated behind `cargo test -- --ignored`; run in the release workflow.\n\n## How It Works\n\n1. Downloads DNS blocklists in parallel from upstream open-source sources\n2. Parses three list formats: plain domains, hosts files, and adblock rules\n3. Builds a trie data structure with category bitmaps\n4. Serializes to SDBL v3 binary format for fast domain lookup\n5. Generates metadata JSON with SHA-256 checksums and domain statistics\n\n## Configuring Blocklist Sources\n\nAll blocklist sources are defined in [`blocklist-sources.json`](blocklist-sources.json). You can add, remove, or modify sources by editing this file.\n\n### File Structure\n\n```json\n{\n  \"version\": 1,\n  \"description\": \"Human-readable description of this config\",\n  \"baseUrls\": {\n    \"domains\": \"https://example.com/domains\",\n    \"adblock\": \"https://example.com/adblock\"\n  },\n  \"sources\": [\n    {\n      \"category\": \"adsTrackers\",\n      \"categoryIndex\": 0,\n      \"file\": \"ads.txt\",\n      \"baseUrl\": \"domains\",\n      \"format\": \"domains\",\n      \"displayName\": \"Ad Trackers List\"\n    }\n  ]\n}\n```\n\n### Fields\n\n**Top-level:**\n\n| Field | Description |\n|-------|-------------|\n| `version` | Config schema version (currently `1`) |\n| `description` | Human-readable description |\n| `baseUrls` | Named URL prefixes referenced by sources |\n| `sources` | Array of blocklist source entries |\n\n**Each source entry:**\n\n| Field | Required | Description |\n|-------|----------|-------------|\n| `category` | yes | Unique category identifier (camelCase) |\n| `categoryIndex` | yes | Unique integer `0–31` — used as the bit position in the binary category bitmap |\n| `file` | yes | Filename appended to the base URL to form the download URL |\n| `baseUrl` | yes | Key into `baseUrls` — the download URL is `baseUrls[baseUrl]/file` |\n| `format` | yes | List format: `domains`, `hosts`, or `adblock` (see below) |\n| `displayName` | yes | Human-readable name shown in build output |\n| `minSizeBytes` | optional | Layer 1 — reject downloads smaller than this many bytes. Set ~80 % of the current upstream size. |\n| `minParsedEntries` | optional | Layer 2 — reject parses producing fewer than this many entries (line count). Independent of the upstream's declared count. |\n| `minTrieEntries` | optional | Layer 3 — abort if the compiled trie has fewer than this many entries with this source's bit set. |\n\n### Supported Formats\n\n| Format | Description | Example line |\n|--------|-------------|--------------|\n| `domains` | Plain domain list, one per line. Supports `*.` and `.` wildcard prefixes. | `example.com` |\n| `hosts` | Hosts file format (`0.0.0.0` or `127.0.0.1` followed by a domain). | `0.0.0.0 example.com` |\n| `adblock` | Adblock filter syntax. Only `\\|\\|domain^` rules are used; rules with `$`, `/`, or `*` are skipped. | `\\|\\|example.com^` |\n\n### Adding a New Source\n\n1. If the source uses a new base URL, add it to `baseUrls`.\n2. Add a new entry to `sources` with a unique `category` and `categoryIndex`.\n3. Run the compiler to verify the source downloads and parses correctly:\n   ```bash\n   cargo run -- --output ./output\n   ```\n\n## Third-Party Sources and Licensing\n\nThis tool aggregates domain lists from the following open-source projects. The compiled binary output incorporates data from these sources and is subject to the terms of their respective licenses.\n\n| Source | License | Notes |\n|--------|---------|-------|\n| [HaGeZi DNS Blocklists](https://github.com/hagezi/dns-blocklists) | GPL-3.0-only | 28 category lists |\n| [The Block List Project](https://github.com/blocklistproject/Lists) | Unlicense | 1 category list |\n\nA complete list of upstream sources with URLs is maintained in [`blocklist-sources.json`](blocklist-sources.json).\n\nBecause the majority of upstream data is licensed under the GNU General Public License v3.0, the compiled output and this tool are also distributed under GPL-3.0. If you redistribute the compiled blocklist binary, you must comply with the GPL-3.0 terms — including making the corresponding source code (this repository) available to recipients.\n\nFor a complete list of all third-party software and data sources, including full license texts, see [THIRD-PARTY-NOTICES.md](THIRD-PARTY-NOTICES.md).\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## License\n\nCopyright (C) 2026 M-SQUARE Pty Ltd, Australia\n\nThis program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, version 3 of the License.\n\nThis program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License along with this program. If not, see \u003chttps://www.gnu.org/licenses/\u003e.\n\nSee [LICENSE.txt](LICENSE.txt) for the full license text.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsquareau%2Fdns-blocklist","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmsquareau%2Fdns-blocklist","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmsquareau%2Fdns-blocklist/lists"}