https://github.com/kgruiz/stealth-crawler

Asynchronous headless-Chrome web crawler that discovers internal links and optionally saves HTML, Markdown, screenshots, or PDFs. Built for scripting, inspection, and automation.
https://github.com/kgruiz/stealth-crawler

asyncio cli crawler headless-chrome html-scraper pydoll python web-crawler

Last synced: 8 months ago
JSON representation

Asynchronous headless-Chrome web crawler that discovers internal links and optionally saves HTML, Markdown, screenshots, or PDFs. Built for scripting, inspection, and automation.

Host: GitHub
URL: https://github.com/kgruiz/stealth-crawler
Owner: kgruiz
License: gpl-3.0
Created: 2025-06-01T15:47:22.000Z (about 1 year ago)
Default Branch: main
Last Pushed: 2025-07-31T03:34:41.000Z (11 months ago)
Last Synced: 2025-09-07T05:28:38.154Z (10 months ago)
Topics: asyncio, cli, crawler, headless-chrome, html-scraper, pydoll, python, web-crawler
Language: Python
Homepage:
Size: 1.28 MB
Stars: 1
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

Stealth Crawler

A headless-Chrome web crawler that discovers same-host links and optionally saves HTML, Markdown, PDF, or screenshots. Use as a library or via the stealth-crawler CLI.

---

## Features

- Asynchronous, headless Chrome browsing via `pydoll`
- Discovers internal links starting from a root URL
- Optional content saving:
- HTML
- Markdown (via `html2text`)
- PDF snapshots
- PNG screenshots
- Rich progress bars with `rich`
- Configurable URL filtering (base, exclude)
- Pure-Python API and CLI

---

## Installation

Install the latest stable release:

```bash
pip install stealth-crawler
```

Or in isolation:

```bash
pipx install stealth-crawler
```

Or via other tools:

* **uv**

```bash
uv venv .venv
source .venv/bin/activate
uv pip install stealth-crawler
```

* **Poetry**

```bash
poetry add stealth-crawler
```

---

## Quickstart

### Terminal Command-Line

```bash
# Discover URLs only
stealth-crawler crawl https://example.com --urls-only

# Crawl and save HTML + Markdown
stealth-crawler crawl https://example.com \
--save-html --save-md \
--output-dir ./output

# Exclude specific paths
stealth-crawler crawl https://example.com \
--exclude /private,/logout
```

Run `stealth-crawler --help` for full options.

### Python API

```python
import asyncio
from stealthcrawler import StealthCrawler

crawler = StealthCrawler(
base="https://example.com",
exclude=["/admin"],
save_html=True,
save_md=True,
output_dir="export"
)
urls = asyncio.run(crawler.crawl("https://example.com"))
print(urls)
```

---

## Configuration

| Option | CLI flag | API param | Default |
| ------------- | -------------- | ------------ | ---------- |
| Base URL(s) | `--base` | `base` | start URL |
| Exclude paths | `--exclude` | `exclude` | none |
| Save HTML | `--save-html` | `save_html` | `False` |
| Save Markdown | `--save-md` | `save_md` | `False` |
| URLs only | `--urls-only` | `urls_only` | `False` |
| Output folder | `--output-dir` | `output_dir` | `./output` |

---

## Testing & Quality

* Run tests:

```bash
pytest
```

* Check formatting & linting:

```bash
black src tests
ruff check src tests
```

---

## Contributing

1. Fork the repository and create a feature branch.
2. Set up your development environment:

```bash
python3 -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
```

Or with **uv**:

```bash
uv venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]"
```
3. Implement your changes, add tests, and run:

```bash
black src tests
ruff check src tests
pytest
```
4. Open a pull request against `main`.

---

## License

This project is licensed under the **GNU General Public License v3.0 or later** (GPL-3.0-or-later).
You are free to use, modify, and redistribute under the terms of the GPL.
See [LICENSE](./LICENSE) for full details.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/kgruiz/stealth-crawler

Awesome Lists containing this project

README

Stealth Crawler