{"id":37883324,"url":"https://github.com/crissyfield/troll-a","last_synced_at":"2026-01-16T16:50:09.196Z","repository":{"id":211329839,"uuid":"728841058","full_name":"crissyfield/troll-a","owner":"crissyfield","description":"Drill into WARC web archives","archived":false,"fork":false,"pushed_at":"2024-10-16T12:09:39.000Z","size":247,"stargazers_count":133,"open_issues_count":0,"forks_count":11,"subscribers_count":6,"default_branch":"main","last_synced_at":"2024-10-18T22:29:19.775Z","etag":null,"topics":["command-line-tool","common-crawl","internet-archive","security","security-tools","warc"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/crissyfield.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-12-07T20:12:05.000Z","updated_at":"2024-10-16T12:08:34.000Z","dependencies_parsed_at":"2023-12-07T21:28:21.999Z","dependency_job_id":"507dcc16-d465-4263-9a84-ca530fa55bfe","html_url":"https://github.com/crissyfield/troll-a","commit_stats":null,"previous_names":["crissyfield/troll-a"],"tags_count":3,"template":false,"template_full_name":null,"purl":"pkg:github/crissyfield/troll-a","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crissyfield%2Ftroll-a","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crissyfield%2Ftroll-a/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crissyfield%2Ftroll-a/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crissyfield%2Ftroll-a/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/crissyfield","download_url":"https://codeload.github.com/crissyfield/troll-a/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/crissyfield%2Ftroll-a/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28480081,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-16T11:59:17.896Z","status":"ssl_error","status_checked_at":"2026-01-16T11:55:55.838Z","response_time":107,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["command-line-tool","common-crawl","internet-archive","security","security-tools","warc"],"created_at":"2026-01-16T16:50:09.073Z","updated_at":"2026-01-16T16:50:09.167Z","avatar_url":"https://github.com/crissyfield.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg width=\"256\" src=\"assets/logo.png\"\u003e\n\u003c/p\u003e\n\n# Troll-A\n\n[![License: Apache](https://img.shields.io/github/license/crissyfield/troll-a?color=orange)](LICENSE)\n[![Go Report Card](https://goreportcard.com/badge/github.com/crissyfield/troll-a)](https://goreportcard.com/report/github.com/crissyfield/troll-a)\n[![Go Reference](https://pkg.go.dev/badge/github.com/crissyfield/troll-a.svg)](https://pkg.go.dev/github.com/crissyfield/troll-a)\n\n`Troll-A` is a command line tool for extracting secrets such as passwords, API keys, and tokens from WARC (Web\nARChive) files. `Troll-A` is an **easy-to-use**, **comprehensive**, and **fast** solution for finding\nsecrets in web archive.\n\n\n## Features\n\n- **Protocols:** Supports retrieving web archives directly from a network server via HTTP/HTTPS, from the\n  [Amazon S3](https://aws.amazon.com/pm/serv-s3/) object storage service, from the local file system, or\n  from STDIN.\n- **Compression:** Supports web archives compressed with [GZip](https://www.gzip.org),\n  [BZip2](https://sourceware.org/bzip2/), [XZ](https://github.com/tukaani-project/xz), or \n  [ZStd](https://github.com/facebook/zstd). For ZStd, it also supports custom dictionaries prepended to the\n  compressed data stream (as used by `*.megawarc.warc.zst` files).\n- **Comprehensive:** Uses the battle-tested ruleset from the [Gitleaks](https://gitleaks.io) project to\n  detect up to 166 different types of secrets, tokens, keys, or other sensitive information.\n- **Performance:** Works concurrently and optionally uses optimized regular expressions (via\n  [go-re2](https://github.com/wasilibs/go-re2)) to process a typical [Common Crawl](https://commoncrawl.org)\n  web archive (~34.000 pages) in less than 30 seconds on AWS `c7g.12xlarge`. This can be further improved by\n  narrowing down the WARC records to process, via the `--filter` option.\n- **Distribution:** `Troll-A` is distributed as prebuilt binaries, as a Docker image, or in source form.\n\n\n## Installation\n\n### Docker\n\n`Troll-A` is available on [Github's container\nregistry](https://github.com/crissyfield/troll-a/pkgs/container/troll-a) and can be used as follows:\n\n```bash\ndocker run --rm ghcr.io/crissyfield/troll-a [flags] [url]\n```\n\n### Prebuilt Binaries\n\n`Troll-A` is also available in binary form for macOS and Linux on the\n[releases page](https://github.com/crissyfield/troll-a/releases).\n\n\u003e [!NOTE]\n\u003e Unlike the Docker image, the prebuilt binaries are compiled using Go's Stdlib regular expressions and are\n\u003e therefore noticeably slower. If native binaries are preferred and performance is crucial, it is recommended\n\u003e to build the binaries from source.\n\n### Build From Source\n\nFor better performance, it is recommended to build `Troll-A` from source, as this allows to use the optimized\nregular expression engine provided by [go-re2](https://github.com/wasilibs/go-re2). For this to work, the\n[RE2](https://github.com/google/re2) dependency must be installed first.\n\n#### macOS\n\n```\n# Install dependencies\nbrew install re2\n\n# Install with RE2 activated\ngo install -tags re2_cgo github.com/crissyfield/troll-a@v1.2.0\n```\n\n#### Debian / Ubuntu\n\n```\n# Install dependencies\nsudo apt install -u build-essential libre2-dev\n\n# Install with RE2 activated\ngo install -tags re2_cgo github.com/crissyfield/troll-a@v1.2.0\n```\n\n\n## Usage\n\n```\nUsage:\n  troll-a [flags] [url]\n\nThis tool allows to extract (potential) secrets such as passwords, API keys, and tokens\nfrom WARC (Web ARChive) files. Extracted information is output as structured text org\nJSON, which simplifies further processing of the data.\n\n\"url\" can be either a regular HTTP or HTTPS reference (\"https://domain/path\"), an Amazon\nS3 reference (\"s3://bucket/path\"), a file path (either \"file:///path\" or simply \"path\"),\nor a dash (\"-\") to read from STDIN. If \"url\" is omitted data is read from STDIN. If the\ninput data is compressed with either GZip, BZip2, XZ, or ZStd it is automatically\ndecompressed. ZStd with a prepended custom dictionary (as used by \"*.megawarc.warc.zstd\")\nis also handled transparently.\n\nThis tool uses rules from the Gitleaks project (https://gitleaks.io) to detect secrets.\n\nFlags:\n  -c, --custom stringArray     additional custom rule to apply. Secrets that match the\n                               given regular expression (using RE2 syntax) will also be\n                               reported. Can be specified multiple times.\n  -e, --enclosed               only report secrets that are enclosed within their context\n  -f, --filter string          filter for the target URL of each WARC record. Only WARC\n                               records that match the given regular expression (using RE2\n                               syntax) will be checked for secrets. An empty filter will\n                               match everything.\n  -h, --help                   help for troll-a\n  -j, --jobs uint              detect secrets with this many concurrent jobs (default 8)\n  -s, --json                   output detected secrets as JSON\n  -p, --preset rules-preset    rules preset to use. This could be one of the following:\n                               all:         All known rules will be applied, which can\n                                            result in a significant amount of noise for\n                                            large data sets.\n                               most:        Most of the rules are applied, skipping the\n                                            biggest culprits for false positives.\n                               secret:      Only rules are applied that are most likely\n                                            to result in an actual leak of a secret.\n                               none:        No rules at all are applied. This can be used\n                                            in combination with custom rules via the\n                                            --custom/-c switch.\n                               No other values are allowed. (default secret)\n  -q, --quiet                  suppress success message(s)\n  -r, --retry retry-strategy   retry strategy to use. This could be one of the following:\n                               never:       This strategy will fail after the first fetch\n                                            failure and will not attempt to retry.\n                               constant:    This strategy will attempt to retry up to 5\n                                            times, with a 5s delay after each attempt.\n                               exponential: This strategy will attempt to retry for 15\n                                            minutes, with an exponentially increasing\n                                            delay after each attempt.\n                               always:      This strategy will attempt to retry forever,\n                                            with no delay at all after each attempt.\n                               No other values are allowed. (default never)\n  -t, --timeout duration       fetching timeout (does not apply to files) (default 30m0s)\n  -v, --version                version for troll-a\n```\n\n\n## Examples\n\n### Common Crawl\n\n[Common Crawl](https://commoncrawl.org) maintains a free, open repository of web crawl data that can be used by\nanyone. The Common Crawl corpus contains petabytes of data collected regularly since 2008.\n\nFor example, to extract secrets from all of the 3.35 billion pages of the November/December 2023 crawl\n(called `CC-MAIN-2023-50`), you can do this:\n\n```bash\n# Download the list of all 90.000 WARC paths\ncurl -sSL -O https://data.commoncrawl.org/crawl-data/CC-MAIN-2023-50/warc.paths.gz\n\n# Iterate through all paths using 64 scanning jobs, output matches as JSON\ngzcat warc.paths.gz | \\\nxargs -I{} -- troll-a -e -s -j64 https://data.commoncrawl.org/{} \u003e secrets.json\n```\n\n\u003e [!WARNING]\n\u003e This will take a long time! Depending on your hardware and Internet connection, this can take anywhere from\n\u003e a week to several months. You may want to run this example only for the first few lines of `warc.paths.gz`.\n\n### Internet Archive\n\nThe [Archive Team](http://archiveteam.org/index.php) is a group dedicated to digital preservation and web\narchiving founded in 2009. Web archives are stored as WARC files (more specifically, in MegaWARC format) and\nmade available through the [Internet Archive](https://archive.org/details/archiveteam).\n\nFor example, to extract secrets from the 113.372 pages the Archive Team crawled from\n[pastebin.com](https://pastebin.com) in April of 2023 (here's the corresponding\n[publication](https://archive.org/details/archiveteam_pastebin_20230421003309_a3b951b4) on the Internet\nArchive), you can do this:\n\n```bash\n# Call troll-a directly with the MegaWARC URL\ntroll-a -e https://archive.org/download/archiveteam_pastebin_20230421003309_a3b951b4/pastebin_20230421003309_a3b951b4.1603050931.megawarc.warc.zst\n```\n\n...which results in...\n\n```\nDetected: secret=\"acf30fb56amsh654fa8104418601p1e420cjsn3152a0032f0b\" rule=\"rapidapi-access-token\" uri=\"https://pastebin.com/raw/bKMJXkQE\" line=36 column=15\nDetected: secret=\"acf30fb56amsh654fa8104418601p1e420cjsn3152a0032f0b\" rule=\"rapidapi-access-token\" uri=\"https://pastebin.com/raw/bKMJXkQE\" line=36 column=15\nDetected: secret=\"acf30fb56amsh654fa8104418601p1e420cjsn3152a0032f0b\" rule=\"rapidapi-access-token\" uri=\"https://pastebin.com/raw/nferefe2\" line=37 column=6\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/print/cQEA2GCS\" line=39 column=123\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/embed_js/cQEA2GCS\" line=11 column=2688\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/embed_iframe/cQEA2GCS?theme=dark\" line=49 column=123\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/cQEA2GCS\" line=222 column=123\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/raw/cQEA2GCS\" line=22 column=22\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/embed_iframe/cQEA2GCS\" line=48 column=123\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/embed_js/cQEA2GCS?theme=dark\" line=11 column=2796\nDetected: secret=\"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\" rule=\"github-pat\" uri=\"https://pastebin.com/clone/cQEA2GCS\" line=152 column=27\nSuccess: Processed https://archive.org/download/archiveteam_pastebin_20230421003309_a3b951b4/pastebin_20230421003309_a3b951b4.1603050931.megawarc.warc.zst (113372 records)\n```\n\n\n## Credits\n\nThe set of rules used to detect the actual secrets is part of the [Gitleaks](https://gitleaks.io) project. We\nare very grateful for the tremendous work they have done in compiling all this information!\n\n\n## What's up with the name?\n\nThe [Troll A platform](https://en.wikipedia.org/wiki/Troll_A_platform) is a natural gas platform in the Troll\ngas field off the west coast of Norway. As of 2014, it was the tallest structure that has ever been moved to\nanother position, relative to the surface of the Earth, and is among the largest and most complex engineering\nprojects in history. In 1996, the platform set the Guinness World Record for the largest offshore gas platform.\n\n\u003e [!NOTE]\n\u003e While we deeply dislike the exploitation of natural resources, we admire the engineering feat!\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrissyfield%2Ftroll-a","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fcrissyfield%2Ftroll-a","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fcrissyfield%2Ftroll-a/lists"}