{"id":47764966,"url":"https://github.com/soenneker/soenneker.playwrights.crawler","last_synced_at":"2026-06-14T00:01:33.656Z","repository":{"id":347390257,"uuid":"1193658774","full_name":"soenneker/soenneker.playwrights.crawler","owner":"soenneker","description":"A configurable Playwright crawler with rich stealth and control options.","archived":false,"fork":false,"pushed_at":"2026-06-09T23:38:05.000Z","size":245,"stargazers_count":0,"open_issues_count":2,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-06-10T00:19:17.890Z","etag":null,"topics":["browser","chrome","chromium","crawl","crawler","csharp","dotnet","playwright","playwrightcrawler","playwrights","scrape","scraper","stealth","util"],"latest_commit_sha":null,"homepage":"https://soenneker.com","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/soenneker.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":".github/CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":".github/CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":".github/SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"soenneker","thanks_dev":"soenneker"}},"created_at":"2026-03-27T13:05:11.000Z","updated_at":"2026-06-09T22:50:51.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/soenneker/soenneker.playwrights.crawler","commit_stats":null,"previous_names":["soenneker/soenneker.playwrights.crawler"],"tags_count":74,"template":false,"template_full_name":null,"purl":"pkg:github/soenneker/soenneker.playwrights.crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soenneker%2Fsoenneker.playwrights.crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soenneker%2Fsoenneker.playwrights.crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soenneker%2Fsoenneker.playwrights.crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soenneker%2Fsoenneker.playwrights.crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/soenneker","download_url":"https://codeload.github.com/soenneker/soenneker.playwrights.crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/soenneker%2Fsoenneker.playwrights.crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34304629,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-13T02:00:06.617Z","response_time":62,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["browser","chrome","chromium","crawl","crawler","csharp","dotnet","playwright","playwrightcrawler","playwrights","scrape","scraper","stealth","util"],"created_at":"2026-04-03T06:04:25.656Z","updated_at":"2026-06-14T00:01:33.580Z","avatar_url":"https://github.com/soenneker.png","language":"C#","funding_links":["https://github.com/sponsors/soenneker","https://thanks.dev/soenneker"],"categories":[],"sub_categories":[],"readme":"[![](https://img.shields.io/nuget/v/soenneker.playwrights.crawler.svg?style=for-the-badge)](https://www.nuget.org/packages/soenneker.playwrights.crawler/)\n[![](https://img.shields.io/github/actions/workflow/status/soenneker/soenneker.playwrights.crawler/publish-package.yml?style=for-the-badge)](https://github.com/soenneker/soenneker.playwrights.crawler/actions/workflows/publish-package.yml)\n[![](https://img.shields.io/nuget/dt/soenneker.playwrights.crawler.svg?style=for-the-badge)](https://www.nuget.org/packages/soenneker.playwrights.crawler/)\n\n# ![](https://user-images.githubusercontent.com/4441470/224455560-91ed3ee7-f510-4041-a8d2-3fc093025112.png) Soenneker.Playwrights.Crawler\n\nA configurable Playwright crawler for mirroring sites to disk with support for:\n\n- HTML-only or full resource capture\n- crawl limits by depth, page count, duration, and storage\n- same-host restrictions with optional cross-origin asset capture\n- DOM attribute resource discovery for lazy widgets and deferred assets\n- throttling, retries, slow mode, and cooldown behavior\n- optional stealth launch/context settings\n\n## Related Repos\n\nYou might also be interested in:\n\n- [soenneker.playwrights.installation](https://github.com/soenneker/soenneker.playwrights.installation) for ensuring Playwright browsers are installed before runtime.\n- [soenneker.playwrights.extensions.stealth](https://github.com/soenneker/soenneker.playwrights.extensions.stealth) for stealth-oriented Chromium launch and browser-context extensions.\n\n## Installation\n\n```bash\ndotnet add package Soenneker.Playwrights.Crawler\n```\n\n## Register With DI\n\n```csharp\nusing Microsoft.Extensions.DependencyInjection;\nusing Soenneker.Playwrights.Crawler.Registrars;\n\nvar services = new ServiceCollection();\n\nservices.AddLogging();\nservices.AddPlaywrightCrawlerAsSingleton();\n```\n\nUse `AddPlaywrightCrawlerAsScoped()` if you prefer a scoped lifetime.\n\n## Basic Usage\n\n```csharp\nusing Soenneker.Playwrights.Crawler.Abstract;\nusing Soenneker.Playwrights.Crawler.Dtos;\nusing Soenneker.Playwrights.Crawler.Enums;\n\nIPlaywrightCrawler crawler = serviceProvider.GetRequiredService\u003cIPlaywrightCrawler\u003e();\n\nPlaywrightCrawlResult result = await crawler.Crawl(new PlaywrightCrawlOptions\n{\n    Url = \"https://example.com\",\n    SaveDirectory = @\"C:\\temp\\example\",\n    Mode = PlaywrightCrawlMode.Full,\n    MaxDepth = 2,\n    ClearSaveDirectory = true,\n    SameHostOnly = true\n});\n```\n\n## Advanced Example\n\n```csharp\nusing Soenneker.Playwrights.Crawler.Abstract;\nusing Soenneker.Playwrights.Crawler.Dtos;\nusing Soenneker.Playwrights.Crawler.Enums;\nusing Soenneker.Playwrights.Extensions.Stealth.Options;\n\nPlaywrightCrawlResult result = await crawler.Crawl(new PlaywrightCrawlOptions\n{\n    Url = \"https://example.com\",\n    SaveDirectory = @\"C:\\temp\\example\",\n    Mode = PlaywrightCrawlMode.Full,\n    MaxDepth = 2,\n    MaxPages = 50,\n    MaxStorageBytes = 250_000_000,\n    MaxDuration = TimeSpan.FromMinutes(10),\n    SameHostOnly = true,\n    IgnoreQueryStringsInDuplicateDetection = true,\n    FormatHtml = true,\n    IncludeCrossOriginAssets = true,\n    RewriteCrossOriginAssetUrls = true,\n    ClearSaveDirectory = true,\n    OverwriteExistingFiles = true,\n    Headless = true,\n    UseStealth = true,\n    ThrottleMode = PlaywrightCrawlThrottleMode.Automatic,\n    NavigationTimeoutMs = 45_000,\n    WaitUntil = WaitUntilState.NetworkIdle,\n    PostNavigationDelayMs = 0,\n    ContinueOnPageError = true,\n    StealthLaunchOptions = new StealthLaunchOptions\n    {\n        IgnoreDetectableDefaultArguments = true\n    },\n    StealthContextOptions = new StealthContextOptions\n    {\n        NormalizeDocumentHeaders = true,\n        EnableCdpDomainHardening = false\n    },\n    Policy = new PlaywrightCrawlPolicy\n    {\n        GlobalMaxConcurrency = 20,\n        PerDomainMaxConcurrency = 2,\n        PerIpMaxConcurrency = 2,\n        MinimumDelayBetweenRequestsMs = 750,\n        DelayJitterMaxMs = 500,\n        RequestTimeoutMs = 30_000,\n        MaxRetries = 4\n    }\n});\n```\n\n## Modes\n\n### `HtmlOnly`\n\nSaves only rendered HTML documents discovered during the crawl.\n\n### `Full`\n\nSaves:\n\n- rendered HTML documents\n- same-origin network resources observed while pages load\n- resource URLs discovered in DOM attributes such as `data-src` and `data-css-url`\n- optional cross-origin assets under `_external` when `IncludeCrossOriginAssets = true`\n- optional rewriting of cross-origin asset URLs in saved HTML when `RewriteCrossOriginAssetUrls = true`\n- optional lazy-load scrolling to capture below-the-fold media\n- optional rewriting of same-origin absolute URLs in saved HTML and CSS to root-relative paths when `RewriteSameOriginAbsoluteUrls = true`\n\n## Key Options\n\n| Option | Description |\n| --- | --- |\n| `Url` | Required absolute `http` or `https` root URL. |\n| `SaveDirectory` | Required output directory for mirrored content. |\n| `MaxDepth` | Link depth to follow from the root page. `0` crawls only the starting page. |\n| `MaxPages` | Optional hard cap on visited pages. |\n| `MaxStorageBytes` | Optional hard cap on bytes written to disk. |\n| `MaxDuration` | Optional maximum crawl duration. |\n| `SameHostOnly` | Restricts queued pages to the same host as the root URL. |\n| `IgnoreQueryStringsInDuplicateDetection` | Treats query-string variants as the same page when detecting duplicates. |\n| `FormatHtml` | Formats saved HTML documents with `Soenneker.Html.Formatter` when `true`. Defaults to `false`. |\n| `IncludeCrossOriginAssets` | In `Full` mode, saves cross-origin resources under `_external`. |\n| `RewriteCrossOriginAssetUrls` | Rewrites saved HTML so captured cross-origin asset URLs point at the local `_external` copy. Requires `IncludeCrossOriginAssets`. |\n| `RewriteSameOriginAbsoluteUrls` | Rewrites same-origin absolute URLs in saved HTML and CSS to root-relative paths, such as `https://example.com/script.js` to `/script.js`. |\n| `TriggerLazyLoading` | In `Full` mode, scrolls pages after navigation to trigger lazy-loaded media before resources are saved. Defaults to `true`. |\n| `LazyLoadScrollStepPx` | Pixel distance for each lazy-load scroll step. |\n| `LazyLoadScrollDelayMs` | Delay after each lazy-load scroll step. |\n| `LazyLoadMaxScrolls` | Maximum number of lazy-load scroll steps per page. |\n| `ClearSaveDirectory` | Deletes the output directory before crawling. |\n| `OverwriteExistingFiles` | Controls whether existing files can be replaced. |\n| `Headless` | Runs Chromium headlessly when `true`. |\n| `UseStealth` | Enables the Soenneker stealth Playwright extensions. |\n| `ThrottleMode` | Controls automatic pacing and adaptive throttling. Defaults to `Automatic`; use `Disabled` to bypass automatic pacing, slow mode, cooldown waiting, and implicit post-navigation jitter. |\n| `NavigationTimeoutMs` | Navigation timeout per page. |\n| `WaitUntil` | Playwright load state awaited during navigation. Defaults to `NetworkIdle`. |\n| `PostNavigationDelayMs` | Extra delay after navigation to allow late assets to settle. |\n| `ContinueOnPageError` | Continues crawling after an individual page fails. |\n| `Policy` | Crawl throttling, retries, concurrency, slow mode, and cooldown configuration. |\n\n## Result\n\n`Crawl()` returns `PlaywrightCrawlResult`, which includes:\n\n- crawl timing (`StartedAtUtc`, `CompletedAtUtc`, `Duration`)\n- page counts (`PagesDiscovered`, `PagesVisited`)\n- file counts (`HtmlFilesSaved`, `AssetFilesSaved`)\n- total bytes written (`BytesWritten`)\n- stop reasons (`StorageLimitReached`, `DurationLimitReached`, `PageLimitReached`)\n- per-file details in `Files`\n- page-level failures in `Errors`\n\n## Output Layout\n\nSaved files preserve URL structure so the output can be served by a simple static web server.\n\nExamples:\n\n- `https://example.com/` -\u003e `index.html`\n- `https://example.com/docs/getting-started` -\u003e `docs/getting-started/index.html`\n- `https://example.com/script.js` -\u003e `/script.js` inside saved HTML when same-origin URL rewriting is enabled\n- `https://cdn.example.com/app.css` -\u003e `_external/cdn.example.com/app.css` when cross-origin asset capture is enabled\n- a saved page can reference that asset as `../../_external/cdn.example.com/app.css` when URL rewriting is enabled\n\n## Behavior Notes\n\n- Playwright browser installation is ensured automatically before the crawl starts.\n- Duplicate detection ignores query strings by default.\n- HTML formatting is opt-in and uses `Soenneker.Html.Formatter` when `FormatHtml = true`.\n- Challenge and captcha-like pages contribute to the crawler's blocking and slow-mode signals.\n- Setting `ThrottleMode = PlaywrightCrawlThrottleMode.Disabled` keeps configured concurrency limits and retries, but skips the crawler's automatic pacing and adaptive slowdown behavior.\n- Cross-origin URL rewriting only applies to captured cross-origin assets that are actually available on disk.\n- `Full` mode captures resources observed during page loads, but the rewrite pass is limited to captured cross-origin asset URLs rather than a full offline-mirroring transform.\n- Some response types are intentionally skipped, such as empty bodies and certain framework/internal fetch endpoints.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoenneker%2Fsoenneker.playwrights.crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsoenneker%2Fsoenneker.playwrights.crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsoenneker%2Fsoenneker.playwrights.crawler/lists"}