{"id":27296531,"url":"https://github.com/tctien342/simple-doc-crawler","last_synced_at":"2026-03-03T09:32:35.331Z","repository":{"id":285346002,"uuid":"957802852","full_name":"tctien342/simple-doc-crawler","owner":"tctien342","description":"Craw all sub page from given URL to markdown","archived":false,"fork":false,"pushed_at":"2025-03-31T08:45:29.000Z","size":43,"stargazers_count":1,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-01T11:53:03.354Z","etag":null,"topics":["cli","crawler","llm","markdown"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tctien342.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-03-31T06:47:33.000Z","updated_at":"2025-03-31T08:56:04.000Z","dependencies_parsed_at":null,"dependency_job_id":"ed61615e-1a46-4506-b0cd-19d198ff9070","html_url":"https://github.com/tctien342/simple-doc-crawler","commit_stats":null,"previous_names":["tctien342/simple-doc-crawler"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tctien342/simple-doc-crawler","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tctien342%2Fsimple-doc-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tctien342%2Fsimple-doc-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tctien342%2Fsimple-doc-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tctien342%2Fsimple-doc-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tctien342","download_url":"https://codeload.github.com/tctien342/simple-doc-crawler/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tctien342%2Fsimple-doc-crawler/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30039885,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-03T06:58:30.252Z","status":"ssl_error","status_checked_at":"2026-03-03T06:58:15.329Z","response_time":61,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cli","crawler","llm","markdown"],"created_at":"2025-04-11T23:42:42.297Z","updated_at":"2026-03-03T09:32:35.297Z","avatar_url":"https://github.com/tctien342.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Doc Crawler\n\nA Node.js/TypeScript CLI tool for crawling documentation websites and exporting them to Markdown, built with BunJS.\n\n## Features\n\n- Parallel web crawling with configurable concurrency\n- Domain-specific crawling option\n- Automatic conversion from HTML to Markdown\n- Generates a well-formatted document with table of contents\n- Handles timeouts and crawling limits for stability\n- Built with BunJS for optimal performance\n\n## Installation\n\n### From npm\n```bash\n# Install globally\nnpm install -g @saintno/doc-export\n\n# Or with yarn\nyarn global add @saintno/doc-export\n\n# Or with pnpm\npnpm add -g @saintno/doc-export\n```\n\n### From source\n\n#### Prerequisites\n\n- [Bun](https://bun.sh/) (v1.0.0 or higher)\n\n#### Setup\n\n1. Clone this repository\n2. Install dependencies:\n\n```bash\nbun install\n```\n\n3. Build the project:\n\n```bash\nbun run build\n```\n\n## Usage\n\nBasic usage:\n\n```bash\n# If installed from npm:\ndoc-export --url https://example.com/docs --output ./output\n\n# If running from source:\nbun run start --url https://example.com/docs --output ./output\n```\n\n### Command Line Options\n\n| Option | Alias | Description | Default |\n|--------|-------|-------------|---------|\n| --url | -u | URL to start crawling from (required) | - |\n| --output | -o | Output directory for the Markdown (required) | - |\n| --concurrency | -c | Maximum number of concurrent requests | 5 |\n| --same-domain | -s | Only crawl pages within the same domain | true |\n| --max-urls | | Maximum URLs to crawl per domain | 200 |\n| --request-timeout | | Request timeout in milliseconds | 5000 |\n| --max-runtime | | Maximum crawler run time in milliseconds | 30000 |\n| --allowed-prefixes | | Comma-separated list of URL prefixes to crawl | - |\n| --split-pages | | How to split pages: \"none\", \"subdirectories\", or \"flat\" | none |\n\n## Example\n\n```bash\ndoc-export --url https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide --output ./javascript-guide --concurrency 3 --max-urls 300 --request-timeout 10000 --max-runtime 60000\n```\n\nThis example will:\n- Start crawling from the MDN JavaScript Guide\n- Save files in the ./javascript-guide directory\n- Use 3 concurrent requests\n- Crawl up to 300 URLs (default is 200)\n- Set request timeout to 10 seconds (default is 5 seconds)\n- Run the crawler for a maximum of 60 seconds (default is 30 seconds)\n\n### URL Prefix Filtering Example\n\nTo only crawl URLs with specific prefixes:\n\n```bash\ndoc-export --url https://developer.mozilla.org/en-US/docs/Web/JavaScript --output ./javascript-guide --allowed-prefixes https://developer.mozilla.org/en-US/docs/Web/JavaScript/Guide,https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference\n```\n\nThis example will:\n- Start crawling from the MDN JavaScript documentation\n- Only process URLs that start with the specified prefixes (Guide and Reference sections)\n- Ignore other URLs even if they are within the same domain\n\n## How It Works\n\n1. **Crawling Phase**: The tool starts from the provided URL and crawls all linked pages (respecting domain restrictions and URL prefix filters if specified)\n2. **Processing Phase**: Each HTML page is converted to Markdown using Turndown\n3. **Aggregation Phase**: All Markdown content is combined into a single document with a table of contents\n\n### Filtering Options\n\nThe crawler supports two types of URL filtering:\n\n1. **Domain Filtering** (`--same-domain`): When enabled, only URLs from the same domain as the starting URL will be crawled.\n2. **Prefix Filtering** (`--allowed-prefixes`): When specified, only URLs that start with one of the provided prefixes will be crawled. This is useful for limiting the crawl to specific sections of a website.\n\nThese filters can be combined to precisely target the content you want to extract.\n\n## Implementation Details\n\n- Uses bloom filters for efficient link deduplication\n- Implements connection reuse with undici fetch\n- Handles memory management for large documents\n- Processes pages in parallel for maximum efficiency\n- Implements timeouts and limits to prevent crawling issues\n\n## PDF Support\n\nThe current implementation outputs Markdown (.md) files. To convert to PDF, you can use a third-party tool such as:\n\n- [Pandoc](https://pandoc.org/): `pandoc -f markdown -t pdf -o output.pdf document.md`\n- [mdpdf](https://github.com/BlueHatbRit/mdpdf): `mdpdf document.md`\n- Or any online Markdown to PDF converter\n\n## License\n\nMIT","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftctien342%2Fsimple-doc-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftctien342%2Fsimple-doc-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftctien342%2Fsimple-doc-crawler/lists"}