{"id":22235021,"url":"https://github.com/pzaino/thecrowler","last_synced_at":"2026-02-06T03:13:16.642Z","repository":{"id":218102403,"uuid":"718884277","full_name":"pzaino/thecrowler","owner":"pzaino","description":"A Content Discovery and Development Platform. Empowering Cybersecurity, AI, Marketing, and Finance professionals and researchers to discover, analyze, and interact with the web in all its dimensions.","archived":false,"fork":false,"pushed_at":"2025-07-10T02:29:38.000Z","size":39770,"stargazers_count":47,"open_issues_count":1,"forks_count":9,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-07-10T11:24:34.212Z","etag":null,"topics":["automation","blue-team-tool","content-detection","content-discovery","crawler","crawling","cyber-security","cybersecurity","cybersecurity-tools","data-collection","data-science","distributed-systems","golang","indexer","indexing","reconnaissance","red-team-tools","scraping","search-engine","vulnerability-detection"],"latest_commit_sha":null,"homepage":"https://paolozaino.wordpress.com/portfolio/the-crowler/","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/pzaino.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":"support/content_type_detection.yaml","governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null},"funding":{"github":["pzaino"]}},"created_at":"2023-11-15T01:36:50.000Z","updated_at":"2025-06-13T21:13:18.000Z","dependencies_parsed_at":"2024-04-16T22:23:38.450Z","dependency_job_id":"a9f051f6-a2c0-4992-bbd9-0e05e41f6d1c","html_url":"https://github.com/pzaino/thecrowler","commit_stats":null,"previous_names":["pzaino/thecrowler"],"tags_count":13,"template":false,"template_full_name":null,"purl":"pkg:github/pzaino/thecrowler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pzaino%2Fthecrowler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pzaino%2Fthecrowler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pzaino%2Fthecrowler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pzaino%2Fthecrowler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/pzaino","download_url":"https://codeload.github.com/pzaino/thecrowler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/pzaino%2Fthecrowler/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":265444467,"owners_count":23766431,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","blue-team-tool","content-detection","content-discovery","crawler","crawling","cyber-security","cybersecurity","cybersecurity-tools","data-collection","data-science","distributed-systems","golang","indexer","indexing","reconnaissance","red-team-tools","scraping","search-engine","vulnerability-detection"],"created_at":"2024-12-03T02:11:19.407Z","updated_at":"2026-01-30T05:46:54.125Z","avatar_url":"https://github.com/pzaino.png","language":"Go","funding_links":["https://github.com/sponsors/pzaino"],"categories":[],"sub_categories":[],"readme":"# The CROWler\n\n\u003cimg align=\"right\" width=\"320\" height=\"280\"\n src=\"https://raw.githubusercontent.com/pzaino/thecrowler/main/images/TheCROWler_v1JPG.jpg\" alt=\"TheCROWLer Logo\"\u003e\n\n![Go build: ](https://github.com/pzaino/TheCROWler/actions/workflows/go.yml/badge.svg)\n![CodeQL: ](https://github.com/pzaino/TheCROWler/actions/workflows/github-code-scanning/codeql/badge.svg)\n![Scorecard supply-chain security: ](https://github.com/pzaino/TheCROWler/actions/workflows/scorecard.yml/badge.svg)\n\u003c!-- \u003ca href=\"https://www.bestpractices.dev/projects/8344\"\u003e\u003cimg\nsrc=\"https://www.bestpractices.dev/projects/8344/badge\"\nalt=\"OpenSSF Security Best Practices badge\"\u003e\u003c/a\u003e //--\u003e\n[![OpenSSF Best Practices](https://www.bestpractices.dev/projects/8344/badge)](https://www.bestpractices.dev/projects/8344)\n[![Codacy Badge](https://app.codacy.com/project/badge/Grade/bb3868a72f044f1fb2ebe5516224d943)](https://app.codacy.com/gh/pzaino/thecrowler/dashboard?utm_source=gh\u0026utm_medium=referral\u0026utm_content=\u0026utm_campaign=Badge_grade)\n\u003c!-- ![Docker build: ]() --\u003e\n[![Go Report Card](https://goreportcard.com/badge/github.com/pzaino/TheCROWler)](https://goreportcard.com/report/github.com/pzaino/thecrowler)\n[![Go-VulnCheck](https://github.com/pzaino/thecrowler/actions/workflows/go-vulncheck.yml/badge.svg)](https://github.com/pzaino/thecrowler/actions/workflows/go-vulncheck.yml)\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fpzaino%2Fthecrowler.svg?type=shield\u0026issueType=license)](https://app.fossa.com/projects/git%2Bgithub.com%2Fpzaino%2Fthecrowler?ref=badge_shield\u0026issueType=license)\n[![FOSSA Status](https://app.fossa.com/api/projects/git%2Bgithub.com%2Fpzaino%2Fthecrowler.svg?type=shield\u0026issueType=security)](https://app.fossa.com/projects/git%2Bgithub.com%2Fpzaino%2Fthecrowler?ref=badge_shield\u0026issueType=security)\n![License: ](https://img.shields.io/github/license/pzaino/TheCROWler)\n[![Go Reference](https://pkg.go.dev/badge/github.com/pzaino/thecrowler.svg)](https://pkg.go.dev/github.com/pzaino/thecrowler)\n![GitHub language count](https://img.shields.io/github/languages/count/pzaino/thecrowler)\n![GitHub commit activity](https://img.shields.io/github/commit-activity/t/pzaino/thecrowler)\n![GitHub contributors](https://img.shields.io/github/contributors/pzaino/thecrowler)\n![GitHub code search count](https://img.shields.io/github/search?query=thecrowler)\n\n![GitHub go.mod Go version](https://img.shields.io/github/go-mod/go-version/pzaino/thecrowler)\n![GitHub Tag](https://img.shields.io/github/v/tag/pzaino/thecrowler)\n\n## What is it?\n\nThe CROWler is a self-hosted, event-driven Content Discovery and Intelligence development platform\ndesigned for advanced web crawling, scraping, detection, and automation using real browsers,\nrulesets, plugins, and agents.\n\n**Project status:** Still under active development (WIP). Most components are usable.\nBeta testers welcome. Full [daily progress stats](https://githubtracker.com/pzaino/thecrowler).\n\nAdditionally, the system is equipped with a powerful search API, providing\na streamlined interface for data queries. This feature ensures easy\nintegration and access to indexed data for various applications.\n\nThe CROWler is designed to be micro-services based, so it can be easily\ndeployed in a containerized environment.\n\nAs a Content Discovery Development Platform, The CROWler integrates a wide\nrange of technologies and features, including Plugins, Rulesets, Agents,\nEvents and more. These components work together to provide a comprehensive\nplatform to develop your own solutions for content discovery, data extraction,\nand more.\n\n## What Makes the CROWler Different\n\n- **Real browsers, not abstractions**\n  The CROWler uses actual browser engines (Chromium, Chrome, Firefox) instead of\n  simplified request pipelines.\n\n- **Declarative rulesets**\n  Crawling, scraping, detection, and actions are defined in versioned YAML/JSON rulesets.\n\n- **First-class detection logic**\n  Technologies, frameworks, objects, and vulnerabilities are detected using user-defined rules.\n\n- **Event-driven automation**\n  Crawling and detection events can trigger agents, plugins, and workflows.\n\n- **Extensible by design**\n  JavaScript plugins can extend the engine, browser, API, and event system.\n\n- **You own the data and infrastructure**\n  The CROWler is fully self-hosted and auditable.\n\n## Who Is the CROWler For?\n\nThe CROWler is designed for:\n\n- Engineers and developers\n- Security researchers\n- Intelligence and OSINT teams\n- Advanced data collection pipelines\n- Organizations that require full control and auditability\n\nIt is **not** designed as a point-and-click scraping SaaS or a turnkey data service.\n\n## Design Philosophy\n\nThe CROWler is built on a few core principles:\n\n- Control is better than abstraction\n- Logic should be explicit and \"inspectable\"\n- Automation should be event-driven\n- Intelligence should be user-defined\n- Infrastructure and data ownership matter\n\n## Getting Started\n\n- Documentation: [doc/](doc/)\n- GPT-based Support [Chatbot](https://chatgpt.com/g/g-dEfqHkqrW-the-crowler-support) (You must be logged in on CHatGPT to use it properly, CustomGPTs have very limited access otherwise)\n- Configuration examples: [config.default](config.default)\n- Ruleset schemas: [schemas/](schemas/)\n- Plugin examples: [plugins/](plugins/)\n\nStart with the installation guide (below) and the minimal configuration example.\n\n## Table of contents\n\n- [Features](#features-overview)\n- [What problem does it solve?](#what-problem-does-it-solve)\n- [How do I pronounce the name?](#how-do-i-pronounce-the-name)\n- [How to use it?](#how-to-use-it)\n  - [Prerequisites](#prerequisites)\n  - [Installation](#installation)\n    - [Easy Installation and deployment](#1-easy-installation-and-deployment)\n    - [If you're planning to install it manually](#2-if-youre-planning-to-install-it-manually)\n    - [Build from source](#build-from-source)\n- [Production](#production)\n- [DB Maintenance](#db-maintenance)\n- [License](#license)\n- [Contributing](#contributing)\n- [Code of Conduct](#code-of-conduct)\n- [Acknowledgements](#acknowledgements)\n- [Disclaimer](#disclaimer)\n- [Top Contributors](#top-contributors)\n\n## Features (Overview)\n\nThe CROWler is a **full-spectrum Content Discovery and Intelligence platform**.\nIts capabilities span crawling, interaction, detection, automation, security,\nand large-scale data analysis.\n\nBelow is a **high-level overview** of the main feature areas.\nFor a complete and detailed breakdown, see: **[doc/features.md](doc/features.md)**\n\n### Web Crawling \u0026 Interaction\n\n- Recursive and scoped crawling (URL, domain, subdomain, recursive)\n- Real browser rendering (Chromium, Chrome, Firefox)\n- Human Behavior Simulation (HBS)\n- Dynamic JavaScript content handling\n- Custom User-Agent, request filtering, and bandwidth control\n\n### Search \u0026 Discovery\n\n- High-performance API-based search engine\n- Advanced query operators and dorking\n- Entity extraction and correlation\n- Result export (CSV, JSON)\n\n### Scraping \u0026 Data Processing\n\n- Declarative scraping rules (CSS, XPath, regex)\n- Post-processing, transformation, and enrichment\n- Plugin- and AI-based data pipelines\n\n### Detection \u0026 Intelligence\n\n- Technology, framework, and library detection\n- Vulnerability and security header analysis\n- TLS/SSL fingerprinting (JA3, JA4, certificates)\n- Integration with external intelligence sources\n\n### Rules, Actions \u0026 Automation\n\n- Declarative rulesets (crawl, scrape, action, detection)\n- System-level action execution via real browsers\n- Event-driven workflows and scheduling\n\n### Extensibility\n\n- JavaScript plugin system (engine, browser, API, event)\n- Custom API endpoints\n- Agent-controlled execution\n\n### Agents \u0026 AI\n\n- Traditional and AI agents\n- Event-driven agent orchestration\n- Pre-deployed containerized AI models (CUDA / non-CUDA)\n- Multi-model AI workflows\n\n### Security \u0026 Cybersecurity\n\n- Network reconnaissance (DNS, WHOIS, service discovery)\n- Fuzzing and security testing\n- Native support for third-party security services\n\n### Deployment \u0026 Scalability\n\n- Microservices architecture\n- Horizontal scaling of engines, VDIs, APIs\n- Docker-based deployment\n- On-prem, cloud, and hybrid environments\n\n**Full feature list and detailed explanations:**\n[doc/features.md](doc/features.md)\n\n### What problem does it solve?\n\nThe CROWler is designed to solve a set of problems about web crawling, content\n discovery, technology detection and data extraction.\n\nWhile it’s main goal is to enable private, professional and enterprise users to\nquickly develop their content discovery solutions, It’s also designed to be\nable to crawl private networks and intranets, so you can use it to create your\nown or your company search engine.\n\nOn top of that it can also be used as the \"base\" for a more complex cyber security\ntool, as it can be used to gather information about a website, its network, its\nowners, vulnerabilities, which services are being exposed etc.\n\nGiven it can also extract information, it can be used to create knowledge bases\nwith reference to the sources, or to create a database of information about a\nspecific topic.\n\nObviously, it can also be used to do keywords analysis, language detection, etc.\nbut this is something every single crawler can be used for. However all the\n\"classic\" features are implemented/being implemented.\n\n### How do I pronounce the name?\n\n**The**: Pronounced as /ðə/ when before a consonant sound, it sounds like\n\"thuh.\"\n\n**CROW**: Pronounced as /kroʊ/, rhymes with \"know\" or \"snow.\"\n\n**ler**: The latter part is pronounced as /lər/, similar to the ending of the\n word \"crawler\" or the word \"ler\" in \"tumbler.\"\n\nPutting it all together, it sounds like \"**thuh KROH-lər**\"\n\n### What ChatGPT thinks about the CROWLer ;)\n\n(The following section is intentionally light-hearted and non-authoritative.)\n\n\"The CROWler is not just a tool; it's a commitment to ethical, efficient, and\neffective web crawling. Whether you're conducting academic research, market\nanalysis, or enhancing your cybersecurity posture, The CROWler delivers with\nintegrity and precision.\n\nJoin us in redefining the standards of web crawling. Explore more and contribute\nto The CROWler's journey towards a more respectful and insightful digital\nexploration.\"\n\n😂 that's clearly a bit over the top, but it was fun and I decided to include\nit here, just for fun. BTW it does make me fell like I want to add:\n\n\"...and there is one more thing!\" (I wonder why?!?!) 😂\n\n## How to use it?\n\n### Prerequisites\n\nThe CROWler is designed as a microservices-based system, allowing independent scaling, isolation, and orchestration of engines, VDIs, APIs, and event management.\n\n- [Docker](https://docs.docker.com/install/)\n- [Docker Compose](https://docs.docker.com/compose/install/)\n\nFor a docker compose based installation, that's all you need.\nIf you have docker and docker compose installed you can skip the next section\nand go straight to the **Installation** section.\n\n### Installation\n\n#### 1. Easy Installation and deployment\n\nThe **easiest way** to install the CROWler is to use the docker compose file.\nTo do so, follow the [instructions here](doc/docker_build.md).\n\n**Please note(1)**: If you have questions about config.yaml or the ENV vars,\nor the ruleset etc, you can use the GPT chatbot to help you. Just go to this\nlink [here (it's freely available to everyone)](https://chatgpt.com/g/g-dEfqHkqrW-the-crowler-support)\n\n**Please Note(2)**: If you're running the CROWler on a Raspberry Pi, you'll\nneed to build the CROWler for the `arm64` platform. To do so, the easier way\nis to build the CROWler with the `docker-build.sh` script directly on the\nRaspberry Pi.\n\n#### 2. If you're planning to install it manually\n\nIf, instead, you're planning to install the CROWler manually, you'll need to\ninstall the following Docker container:\n\n- [PostgreSQL Container](https://hub.docker.com/_/postgres)\n  - Postgres 15 up (for both ARM and x86) are supported at the moment.\n  - And then run the DB Schema setup script on it (make sure you check the\n    section of the db schema with the user credentials and set those SQL\n    variables correctly)\n\n- Also please note: The Crowler will need its VDI image to be built, so you'll\n  need to build the VDI image as well.\n\n### Build from source\n\nIf you'll use the docker compose then everything will build automatically,\nall you'll need to do is follow the instructions in the Installation\nsection.\n\nIf, instead you want to build locally on your machine, then follow the\ninstructions in this section.\n\nTo build the CROWler from source, you'll need to install the following:\n\n- [Go](https://golang.org/doc/install)\n\nThen you'll need to clone the repository and build the targets you need.\n\nTo build everything at once run the following command:\n\n```bash\n./autobuild.sh\n```\n\nTo build individual targets:\n\nFirst, check which targets can be built and are available, run the following\ncommand:\n\n```bash\n./autobuild name-of-the-target\n```\n\nThis will build your requested component in `./bin`\n\n```bash\n./bin/removeSite\n./bin/addSite\n./bin/addCategory\n./bin/api\n./bin/thecrowler\n```\n\nBuild them as you need them, or run the `autobuild.sh` (no arguments) to build\nthem all.\n\nOptionally you can build the Docker image, to do so run the following command:\n\n```bash\ndocker build -t \u003cimage name\u003e .\n```\n\n**Note**: If you build the CROWler engine docker container, remember to run\nit with the following docker command (it's required!)\n\n```bash\ndocker run -it --rm --cap-add=NET_ADMIN --cap-add=NET_RAW crowler_engine\n```\n\n**Important Note**: If you build from source, you still need to build a\nCROWler VDI docker image, that is needed because the CROWler uses a bunch of\nexternal tools to do its job and all those tools are grouped and built in the\nVDI image (Virtual Desktop Image).\n\n### Usage\n\nFor instruction on how to use it see [here](doc/usage.md).\n\n## Production\n\nIf you want to use the CROWler in production, I recommend to use the docker\ncompose installation. It's the easiest way to install it and it's the most\nsecure one.\n\nFor better security I strongly recommend to deploy the API in a separate container\nthan the CROWler one. Also, there is no need to expose the CROWler container to the\noutside world, it will need internet access thought.\n\n## DB Maintenance\n\nThe CROWler default configuration uses PostgreSQL as its database. The database is\nstored in a Docker volume and is persistent.\n\nThe DB should need no maintenance, The CROWler will take care of that. Any time\nthere is no crawling activity and it's passed 1 hours from the previous\nmaintenance activity, The CROWler will clean up the database and optimize the\nindexes.\n\n## License\n\nThe CROWler is licensed under the Apache 2.0 License. For more information, see\nthe [LICENSE](LICENSE) file.\n\n## Contributing\n\nIf you want to contribute to the project, please read the [CONTRIBUTING](CONTRIBUTING.md)\nfile.\n\n## Code of Conduct\n\nThe CROWler has adopted the Contributor Covenant Code of Conduct. For more\ninformation, see the [CODE_OF_CONDUCT](CODE_OF_CONDUCT.md) file.\n\n## Acknowledgements\n\nThe CROWler is built on top of a lot of open-source projects, and I want to\nthank all the developers that contributed to those projects. Without them, the\nCROWler would not be possible.\n\nAlso, I want to thank the people that are helping me with the project, either\nby contributing code, by testing it, or by providing feedback. Thank you all!\n\n## Disclaimer\n\nThe CROWler is a tool designed to help you crawl websites in a respectful way.\nHowever, it's up to you to use it in a respectful way. The CROWler is not\nresponsible for any misuse of the tool.\n\n## Top Contributors\n\n\u003ca href=\"https://github.com/pzaino/thecrowler/graphs/contributors\"\u003e\n  \u003cimg src=\"https://contrib.rocks/image?repo=pzaino/thecrowler\" /\u003e\n\u003c/a\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpzaino%2Fthecrowler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fpzaino%2Fthecrowler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fpzaino%2Fthecrowler/lists"}