https://github.com/soenneker/soenneker.playwrights.crawler
A configurable Playwright crawler with rich stealth and control options.
https://github.com/soenneker/soenneker.playwrights.crawler
browser chrome chromium crawl crawler csharp dotnet playwright playwrightcrawler playwrights scrape scraper stealth util
Last synced: 10 days ago
JSON representation
A configurable Playwright crawler with rich stealth and control options.
- Host: GitHub
- URL: https://github.com/soenneker/soenneker.playwrights.crawler
- Owner: soenneker
- License: mit
- Created: 2026-03-27T13:05:11.000Z (3 months ago)
- Default Branch: main
- Last Pushed: 2026-06-09T23:38:05.000Z (14 days ago)
- Last Synced: 2026-06-10T00:19:17.890Z (14 days ago)
- Topics: browser, chrome, chromium, crawl, crawler, csharp, dotnet, playwright, playwrightcrawler, playwrights, scrape, scraper, stealth, util
- Language: C#
- Homepage: https://soenneker.com
- Size: 239 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Contributing: .github/CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Code of conduct: .github/CODE_OF_CONDUCT.md
- Security: .github/SECURITY.md
Awesome Lists containing this project
README
[](https://www.nuget.org/packages/soenneker.playwrights.crawler/)
[](https://github.com/soenneker/soenneker.playwrights.crawler/actions/workflows/publish-package.yml)
[](https://www.nuget.org/packages/soenneker.playwrights.crawler/)
#  Soenneker.Playwrights.Crawler
A configurable Playwright crawler for mirroring sites to disk with support for:
- HTML-only or full resource capture
- crawl limits by depth, page count, duration, and storage
- same-host restrictions with optional cross-origin asset capture
- DOM attribute resource discovery for lazy widgets and deferred assets
- throttling, retries, slow mode, and cooldown behavior
- optional stealth launch/context settings
## Related Repos
You might also be interested in:
- [soenneker.playwrights.installation](https://github.com/soenneker/soenneker.playwrights.installation) for ensuring Playwright browsers are installed before runtime.
- [soenneker.playwrights.extensions.stealth](https://github.com/soenneker/soenneker.playwrights.extensions.stealth) for stealth-oriented Chromium launch and browser-context extensions.
## Installation
```bash
dotnet add package Soenneker.Playwrights.Crawler
```
## Register With DI
```csharp
using Microsoft.Extensions.DependencyInjection;
using Soenneker.Playwrights.Crawler.Registrars;
var services = new ServiceCollection();
services.AddLogging();
services.AddPlaywrightCrawlerAsSingleton();
```
Use `AddPlaywrightCrawlerAsScoped()` if you prefer a scoped lifetime.
## Basic Usage
```csharp
using Soenneker.Playwrights.Crawler.Abstract;
using Soenneker.Playwrights.Crawler.Dtos;
using Soenneker.Playwrights.Crawler.Enums;
IPlaywrightCrawler crawler = serviceProvider.GetRequiredService();
PlaywrightCrawlResult result = await crawler.Crawl(new PlaywrightCrawlOptions
{
Url = "https://example.com",
SaveDirectory = @"C:\temp\example",
Mode = PlaywrightCrawlMode.Full,
MaxDepth = 2,
ClearSaveDirectory = true,
SameHostOnly = true
});
```
## Advanced Example
```csharp
using Soenneker.Playwrights.Crawler.Abstract;
using Soenneker.Playwrights.Crawler.Dtos;
using Soenneker.Playwrights.Crawler.Enums;
using Soenneker.Playwrights.Extensions.Stealth.Options;
PlaywrightCrawlResult result = await crawler.Crawl(new PlaywrightCrawlOptions
{
Url = "https://example.com",
SaveDirectory = @"C:\temp\example",
Mode = PlaywrightCrawlMode.Full,
MaxDepth = 2,
MaxPages = 50,
MaxStorageBytes = 250_000_000,
MaxDuration = TimeSpan.FromMinutes(10),
SameHostOnly = true,
IgnoreQueryStringsInDuplicateDetection = true,
FormatHtml = true,
IncludeCrossOriginAssets = true,
RewriteCrossOriginAssetUrls = true,
ClearSaveDirectory = true,
OverwriteExistingFiles = true,
Headless = true,
UseStealth = true,
ThrottleMode = PlaywrightCrawlThrottleMode.Automatic,
NavigationTimeoutMs = 45_000,
WaitUntil = WaitUntilState.NetworkIdle,
PostNavigationDelayMs = 0,
ContinueOnPageError = true,
StealthLaunchOptions = new StealthLaunchOptions
{
IgnoreDetectableDefaultArguments = true
},
StealthContextOptions = new StealthContextOptions
{
NormalizeDocumentHeaders = true,
EnableCdpDomainHardening = false
},
Policy = new PlaywrightCrawlPolicy
{
GlobalMaxConcurrency = 20,
PerDomainMaxConcurrency = 2,
PerIpMaxConcurrency = 2,
MinimumDelayBetweenRequestsMs = 750,
DelayJitterMaxMs = 500,
RequestTimeoutMs = 30_000,
MaxRetries = 4
}
});
```
## Modes
### `HtmlOnly`
Saves only rendered HTML documents discovered during the crawl.
### `Full`
Saves:
- rendered HTML documents
- same-origin network resources observed while pages load
- resource URLs discovered in DOM attributes such as `data-src` and `data-css-url`
- optional cross-origin assets under `_external` when `IncludeCrossOriginAssets = true`
- optional rewriting of cross-origin asset URLs in saved HTML when `RewriteCrossOriginAssetUrls = true`
- optional lazy-load scrolling to capture below-the-fold media
- optional rewriting of same-origin absolute URLs in saved HTML and CSS to root-relative paths when `RewriteSameOriginAbsoluteUrls = true`
## Key Options
| Option | Description |
| --- | --- |
| `Url` | Required absolute `http` or `https` root URL. |
| `SaveDirectory` | Required output directory for mirrored content. |
| `MaxDepth` | Link depth to follow from the root page. `0` crawls only the starting page. |
| `MaxPages` | Optional hard cap on visited pages. |
| `MaxStorageBytes` | Optional hard cap on bytes written to disk. |
| `MaxDuration` | Optional maximum crawl duration. |
| `SameHostOnly` | Restricts queued pages to the same host as the root URL. |
| `IgnoreQueryStringsInDuplicateDetection` | Treats query-string variants as the same page when detecting duplicates. |
| `FormatHtml` | Formats saved HTML documents with `Soenneker.Html.Formatter` when `true`. Defaults to `false`. |
| `IncludeCrossOriginAssets` | In `Full` mode, saves cross-origin resources under `_external`. |
| `RewriteCrossOriginAssetUrls` | Rewrites saved HTML so captured cross-origin asset URLs point at the local `_external` copy. Requires `IncludeCrossOriginAssets`. |
| `RewriteSameOriginAbsoluteUrls` | Rewrites same-origin absolute URLs in saved HTML and CSS to root-relative paths, such as `https://example.com/script.js` to `/script.js`. |
| `TriggerLazyLoading` | In `Full` mode, scrolls pages after navigation to trigger lazy-loaded media before resources are saved. Defaults to `true`. |
| `LazyLoadScrollStepPx` | Pixel distance for each lazy-load scroll step. |
| `LazyLoadScrollDelayMs` | Delay after each lazy-load scroll step. |
| `LazyLoadMaxScrolls` | Maximum number of lazy-load scroll steps per page. |
| `ClearSaveDirectory` | Deletes the output directory before crawling. |
| `OverwriteExistingFiles` | Controls whether existing files can be replaced. |
| `Headless` | Runs Chromium headlessly when `true`. |
| `UseStealth` | Enables the Soenneker stealth Playwright extensions. |
| `ThrottleMode` | Controls automatic pacing and adaptive throttling. Defaults to `Automatic`; use `Disabled` to bypass automatic pacing, slow mode, cooldown waiting, and implicit post-navigation jitter. |
| `NavigationTimeoutMs` | Navigation timeout per page. |
| `WaitUntil` | Playwright load state awaited during navigation. Defaults to `NetworkIdle`. |
| `PostNavigationDelayMs` | Extra delay after navigation to allow late assets to settle. |
| `ContinueOnPageError` | Continues crawling after an individual page fails. |
| `Policy` | Crawl throttling, retries, concurrency, slow mode, and cooldown configuration. |
## Result
`Crawl()` returns `PlaywrightCrawlResult`, which includes:
- crawl timing (`StartedAtUtc`, `CompletedAtUtc`, `Duration`)
- page counts (`PagesDiscovered`, `PagesVisited`)
- file counts (`HtmlFilesSaved`, `AssetFilesSaved`)
- total bytes written (`BytesWritten`)
- stop reasons (`StorageLimitReached`, `DurationLimitReached`, `PageLimitReached`)
- per-file details in `Files`
- page-level failures in `Errors`
## Output Layout
Saved files preserve URL structure so the output can be served by a simple static web server.
Examples:
- `https://example.com/` -> `index.html`
- `https://example.com/docs/getting-started` -> `docs/getting-started/index.html`
- `https://example.com/script.js` -> `/script.js` inside saved HTML when same-origin URL rewriting is enabled
- `https://cdn.example.com/app.css` -> `_external/cdn.example.com/app.css` when cross-origin asset capture is enabled
- a saved page can reference that asset as `../../_external/cdn.example.com/app.css` when URL rewriting is enabled
## Behavior Notes
- Playwright browser installation is ensured automatically before the crawl starts.
- Duplicate detection ignores query strings by default.
- HTML formatting is opt-in and uses `Soenneker.Html.Formatter` when `FormatHtml = true`.
- Challenge and captcha-like pages contribute to the crawler's blocking and slow-mode signals.
- Setting `ThrottleMode = PlaywrightCrawlThrottleMode.Disabled` keeps configured concurrency limits and retries, but skips the crawler's automatic pacing and adaptive slowdown behavior.
- Cross-origin URL rewriting only applies to captured cross-origin assets that are actually available on disk.
- `Full` mode captures resources observed during page loads, but the rewrite pass is limited to captured cross-origin asset URLs rather than a full offline-mirroring transform.
- Some response types are intentionally skipped, such as empty bodies and certain framework/internal fetch endpoints.