{"id":21346991,"url":"https://github.com/chataize/rabbit-hole","last_synced_at":"2026-05-18T05:10:21.560Z","repository":{"id":262619674,"uuid":"863225317","full_name":"chataize/rabbit-hole","owner":"chataize","description":"C# library for scraping text content from websites.","archived":false,"fork":false,"pushed_at":"2024-11-17T15:48:49.000Z","size":92,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-22T16:29:46.479Z","etag":null,"topics":["automatic","bot","chataize","content","csharp","dll","dotnet","html","html-agility-pack","lib","library","page","scraper","scraping","scraping-websites","site","text","web","webpage","website"],"latest_commit_sha":null,"homepage":"https://www.chataize.com","language":"C#","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/chataize.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-25T23:54:08.000Z","updated_at":"2024-11-17T15:48:12.000Z","dependencies_parsed_at":"2024-11-13T12:32:23.765Z","dependency_job_id":null,"html_url":"https://github.com/chataize/rabbit-hole","commit_stats":null,"previous_names":["chataize/rabbit-hole"],"tags_count":9,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chataize%2Frabbit-hole","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chataize%2Frabbit-hole/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chataize%2Frabbit-hole/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/chataize%2Frabbit-hole/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/chataize","download_url":"https://codeload.github.com/chataize/rabbit-hole/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":243822277,"owners_count":20353499,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automatic","bot","chataize","content","csharp","dll","dotnet","html","html-agility-pack","lib","library","page","scraper","scraping","scraping-websites","site","text","web","webpage","website"],"created_at":"2024-11-22T02:12:17.064Z","updated_at":"2026-05-18T05:10:21.553Z","avatar_url":"https://github.com/chataize.png","language":"C#","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Rabbit Hole\n\nRabbit Hole is a small, deterministic web text scraper for .NET. It discovers links within a root URL and extracts readable text from HTML pages. The output is a Markdown-like string suited for indexing, summarization, or offline processing.\n\n## Use cases\n\n- Build a lightweight search index for a site\n- Feed content into an LLM or summarization pipeline\n- Snapshot documentation pages for offline use\n- Validate a sitemap against actual in-page links\n\n## Features\n\n- Async breadth-first link discovery with de-duplication\n- Scope control to the root URL prefix\n- Skips common non-HTML assets by extension\n- HTML-only parsing based on Content-Type\n- Metadata extraction: title, meta description, meta keywords\n- Markdown-like content output for headings, paragraphs, and lists\n- Inline links and images preserved in the output\n- Cancellation support for long-running crawls\n\n## Requirements\n\n- .NET 10 (net10.0)\n\n## Install\n\n```bash\ndotnet add package ChatAIze.RabbitHole\n```\n\n## Quick start\n\n```csharp\nusing ChatAIze.RabbitHole;\n\nvar scraper = new WebsiteScraper();\n\nawait foreach (var link in scraper.ScrapeLinksAsync(\"https://example.com\", depth: 2))\n{\n    Console.WriteLine(link);\n}\n\nvar page = await scraper.ScrapeContentAsync(\"https://example.com\");\nConsole.WriteLine(page.Title);\nConsole.WriteLine(page.Content);\n```\n\n## Usage patterns\n\n### Crawl links, then fetch content\n\n```csharp\nusing ChatAIze.RabbitHole;\n\nvar scraper = new WebsiteScraper();\n\nawait foreach (var link in scraper.ScrapeLinksAsync(\"https://example.com\", depth: 3))\n{\n    var page = await scraper.ScrapeContentAsync(link);\n    Console.WriteLine($\"{page.Url} -\u003e {page.Title}\");\n}\n```\n\n### Cancel a long crawl\n\n```csharp\nusing ChatAIze.RabbitHole;\n\nvar scraper = new WebsiteScraper();\nusing var cts = new CancellationTokenSource(TimeSpan.FromSeconds(30));\n\nawait foreach (var link in scraper.ScrapeLinksAsync(\"https://example.com\", depth: 3, cts.Token))\n{\n    Console.WriteLine(link);\n}\n```\n\n### Filter links before scraping content\n\n```csharp\nusing ChatAIze.RabbitHole;\n\nvar scraper = new WebsiteScraper();\n\nawait foreach (var link in scraper.ScrapeLinksAsync(\"https://example.com\", depth: 3))\n{\n    if (!link.Contains(\"/docs/\"))\n    {\n        continue;\n    }\n\n    var page = await scraper.ScrapeContentAsync(link);\n    Console.WriteLine(page.Content);\n}\n```\n\n## Link discovery details\n\n- The root URL is always yielded first.\n- The crawl is breadth-first; the root is depth 1.\n- Links discovered on a page are yielded immediately.\n- Pages are only fetched if their depth is strictly less than the `depth` parameter.\n  - Example: `depth: 2` fetches the root page and yields its links, but does not fetch those links.\n  - Example: `depth: 3` fetches the root page and each linked page once, but does not go deeper.\n- URLs are normalized by trimming, lowercasing, and removing query strings and fragments.\n- Only URLs that start with the root URL prefix are considered in-scope.\n- Root-relative links (starting with `/`) are resolved against the root host.\n- Relative links without a leading slash are ignored.\n- The crawler ignores `mailto:`, `tel:`, and anchor-only (`#...`) links.\n- Responses are only parsed when the Content-Type is `text/html`.\n- Non-HTML assets are filtered by extension (see `WebsiteScraper` for the list).\n\n## Content extraction details\n\n- Non-HTML responses return a `PageDetails` instance with null metadata and content.\n- Standard metadata is extracted when available:\n  - `\u003ctitle\u003e`\n  - `\u003cmeta name=\"description\"\u003e`\n  - `\u003cmeta name=\"keywords\"\u003e`\n- Content is selected from `article`, `main`, or `div.content`, falling back to the entire document.\n- Output is a Markdown-like text representation:\n  - Headings `h1`-`h6` map to `#`-style headings\n  - Paragraphs become plain text with inline links and images preserved\n  - Lists become `-` or numbered list items\n- Whitespace is collapsed to keep the output readable.\n\n## Output format\n\nThe output is Markdown-like and optimized for readability, not strict Markdown compliance.\n\n```text\n# Welcome\n\nThis is a [link](https://example.com/about).\n\n- First item\n- Second item\n```\n\n## Error handling and resiliency\n\n- `ScrapeLinksAsync` performs best-effort crawling and skips pages that fail to load or parse.\n- `ScrapeContentAsync` throws `HttpRequestException` for non-success status codes.\n- Cancellation is honored during link crawling and during content fetches.\n\n## Limitations and notes\n\n- No JavaScript rendering; content must be present in the HTML response.\n- No robots.txt handling or rate limiting is built in. Be mindful when crawling.\n- Lowercasing and query/fragment removal may collapse distinct URLs on case-sensitive servers.\n- In-scope checks use a simple string prefix; paths like `/docs` and `/docs-old` are both treated as in-scope.\n- Root-relative URLs are resolved with scheme and host only, which drops non-default ports.\n- Only anchor tags (`\u003ca href=...\u003e`) are used for link discovery.\n\n## API reference\n\n### `WebsiteScraper`\n\n```csharp\npublic async IAsyncEnumerable\u003cstring\u003e ScrapeLinksAsync(\n    string url,\n    int depth = 2,\n    CancellationToken cancellationToken = default)\n\npublic async ValueTask\u003cPageDetails\u003e ScrapeContentAsync(\n    string url,\n    CancellationToken cancellationToken = default)\n```\n\n### `PageDetails`\n\n```csharp\npublic sealed record PageDetails(\n    string Url,\n    string? Title,\n    string? Description,\n    string? Keywords,\n    string? Content);\n```\n\n## Development\n\nBuild the library:\n\n```bash\ndotnet build\n```\n\nRun the preview app:\n\n```bash\ndotnet run --project ChatAIze.RabbitHole.Preview\n```\n\n## Links\n- GitHub: https://github.com/chataize/rabbit-hole\n- Chataize organization: https://github.com/chataize\n- Website: https://www.chataize.com\n\n## License\n\nGPL-3.0-or-later. See `LICENSE.txt`.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchataize%2Frabbit-hole","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchataize%2Frabbit-hole","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchataize%2Frabbit-hole/lists"}