{"id":29935119,"url":"https://github.com/kgruiz/stealth-crawler","last_synced_at":"2025-10-25T06:35:18.636Z","repository":{"id":296758045,"uuid":"994276256","full_name":"kgruiz/stealth-crawler","owner":"kgruiz","description":"Asynchronous headless-Chrome web crawler that discovers internal links and optionally saves HTML, Markdown, screenshots, or PDFs. Built for scripting, inspection, and automation.","archived":false,"fork":false,"pushed_at":"2025-07-31T03:34:41.000Z","size":1339,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-09-07T05:28:38.154Z","etag":null,"topics":["asyncio","cli","crawler","headless-chrome","html-scraper","pydoll","python","web-crawler"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kgruiz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-06-01T15:47:22.000Z","updated_at":"2025-07-31T03:34:44.000Z","dependencies_parsed_at":"2025-07-30T18:49:48.402Z","dependency_job_id":"b27832ed-fa5a-430c-b719-6e916fe7a936","html_url":"https://github.com/kgruiz/stealth-crawler","commit_stats":null,"previous_names":["kgruiz/crawler","kgruiz/stealth-crawler"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/kgruiz/stealth-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2Fstealth-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2Fstealth-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2Fstealth-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2Fstealth-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kgruiz","download_url":"https://codeload.github.com/kgruiz/stealth-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kgruiz%2Fstealth-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":280917358,"owners_count":26413205,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-25T02:00:06.499Z","response_time":81,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["asyncio","cli","crawler","headless-chrome","html-scraper","pydoll","python","web-crawler"],"created_at":"2025-08-02T20:34:59.162Z","updated_at":"2025-10-25T06:35:18.606Z","avatar_url":"https://github.com/kgruiz.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"icon.png\" alt=\"Stealth Crawler Icon\" width=\"200\"\u003e\n\u003c/p\u003e\n\n\u003ch1 align=\"center\"\u003eStealth Crawler\u003c/h1\u003e\n\n\u003cp align=\"center\"\u003e\nA headless-Chrome web crawler that discovers same-host links and optionally saves HTML, Markdown, PDF, or screenshots. Use as a library or via the \u003ccode\u003estealth-crawler\u003c/code\u003e CLI.\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://pypi.org/project/stealth-crawler/\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/v/stealth-crawler.svg\" alt=\"PyPI\"\u003e\n  \u003c/a\u003e\u0026nbsp;\u0026nbsp;\n  \u003ca href=\"https://github.com/kgruiz/stealth-crawler/blob/main/LICENSE\"\u003e\n    \u003cimg src=\"https://img.shields.io/pypi/l/stealth-crawler.svg\" alt=\"License\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n---\n\n## Features\n\n- Asynchronous, headless Chrome browsing via `pydoll`\n- Discovers internal links starting from a root URL\n- Optional content saving:\n  - HTML\n  - Markdown (via `html2text`)\n  - PDF snapshots\n  - PNG screenshots\n- Rich progress bars with `rich`\n- Configurable URL filtering (base, exclude)\n- Pure-Python API and CLI\n\n---\n\n## Installation\n\nInstall the latest stable release:\n\n```bash\npip install stealth-crawler\n```\n\nOr in isolation:\n\n```bash\npipx install stealth-crawler\n```\n\nOr via other tools:\n\n* **uv**\n\n  ```bash\n  uv venv .venv\n  source .venv/bin/activate\n  uv pip install stealth-crawler\n  ```\n\n* **Poetry**\n\n  ```bash\n  poetry add stealth-crawler\n  ```\n\n---\n\n## Quickstart\n\n### \u003cimg src=\"https://upload.wikimedia.org/wikipedia/commons/b/b3/Terminalicon2.png\" alt=\"Terminal\" width=\"45\" valign=\"middle\" /\u003e Command-Line\n\n```bash\n# Discover URLs only\nstealth-crawler crawl https://example.com --urls-only\n\n# Crawl and save HTML + Markdown\nstealth-crawler crawl https://example.com \\\n  --save-html --save-md \\\n  --output-dir ./output\n\n# Exclude specific paths\nstealth-crawler crawl https://example.com \\\n  --exclude /private,/logout\n```\n\nRun `stealth-crawler --help` for full options.\n\n### \u003cimg src=\"https://s3.dualstack.us-east-2.amazonaws.com/pythondotorg-assets/media/files/python-logo-only.svg\" alt=\"Python\" width=\"35\" valign=\"middle\" /\u003e Python API\n\n```python\nimport asyncio\nfrom stealthcrawler import StealthCrawler\n\ncrawler = StealthCrawler(\n    base=\"https://example.com\",\n    exclude=[\"/admin\"],\n    save_html=True,\n    save_md=True,\n    output_dir=\"export\"\n)\nurls = asyncio.run(crawler.crawl(\"https://example.com\"))\nprint(urls)\n```\n\n---\n\n## Configuration\n\n| Option        | CLI flag       | API param    | Default    |\n| ------------- | -------------- | ------------ | ---------- |\n| Base URL(s)   | `--base`       | `base`       | start URL  |\n| Exclude paths | `--exclude`    | `exclude`    | none       |\n| Save HTML     | `--save-html`  | `save_html`  | `False`    |\n| Save Markdown | `--save-md`    | `save_md`    | `False`    |\n| URLs only     | `--urls-only`  | `urls_only`  | `False`    |\n| Output folder | `--output-dir` | `output_dir` | `./output` |\n\n---\n\n## Testing \u0026 Quality\n\n* Run tests:\n\n  ```bash\n  pytest\n  ```\n\n* Check formatting \u0026 linting:\n\n  ```bash\n  black src tests\n  ruff check src tests\n  ```\n\n---\n\n## Contributing\n\n1. Fork the repository and create a feature branch.\n2. Set up your development environment:\n\n   ```bash\n   python3 -m venv .venv\n   source .venv/bin/activate\n   pip install -e \".[dev]\"\n   ```\n\n   Or with **uv**:\n\n   ```bash\n   uv venv .venv\n   source .venv/bin/activate\n   uv pip install -e \".[dev]\"\n   ```\n3. Implement your changes, add tests, and run:\n\n   ```bash\n   black src tests\n   ruff check src tests\n   pytest\n   ```\n4. Open a pull request against `main`.\n\n---\n\n## License\n\nThis project is licensed under the **GNU General Public License v3.0 or later** (GPL-3.0-or-later).\nYou are free to use, modify, and redistribute under the terms of the GPL.\nSee [LICENSE](./LICENSE) for full details.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkgruiz%2Fstealth-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkgruiz%2Fstealth-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkgruiz%2Fstealth-crawler/lists"}