{"id":16278595,"url":"https://github.com/sigoden/rag-crawler","last_synced_at":"2025-07-24T07:38:34.802Z","repository":{"id":246589553,"uuid":"821576911","full_name":"sigoden/rag-crawler","owner":"sigoden","description":"Crawl a website to generate knowledge file for RAG","archived":false,"fork":false,"pushed_at":"2024-08-13T01:58:11.000Z","size":129,"stargazers_count":19,"open_issues_count":0,"forks_count":4,"subscribers_count":2,"default_branch":"main","last_synced_at":"2024-12-02T09:25:02.748Z","etag":null,"topics":["crawler","knowledge","llm","rag"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sigoden.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-28T21:39:18.000Z","updated_at":"2024-11-17T21:02:40.000Z","dependencies_parsed_at":"2024-06-28T22:03:06.505Z","dependency_job_id":"4935ff00-dbc6-440b-8f7a-10a9306db14d","html_url":"https://github.com/sigoden/rag-crawler","commit_stats":null,"previous_names":["sigoden/rag-crawler"],"tags_count":7,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigoden%2Frag-crawler","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigoden%2Frag-crawler/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigoden%2Frag-crawler/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sigoden%2Frag-crawler/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sigoden","download_url":"https://codeload.github.com/sigoden/rag-crawler/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228520895,"owners_count":17932652,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["crawler","knowledge","llm","rag"],"created_at":"2024-10-10T18:59:06.545Z","updated_at":"2024-12-06T20:16:25.718Z","avatar_url":"https://github.com/sigoden.png","language":"TypeScript","funding_links":[],"categories":["TypeScript"],"sub_categories":[],"readme":"# rag-crawler\n\n[![CI](https://github.com/sigoden/rag-crawler/actions/workflows/ci.yaml/badge.svg)](https://github.com/sigoden/rag-crawler/actions/workflows/ci.yaml)\n[![NPM Version](https://img.shields.io/npm/v/rag-crawler)](https://www.npmjs.com/package/rag-crawler)\n\nCrawl a website to generate knowledge file for RAG.\n\n## Installation\n\n```bash\nnpm i -g rag-crawler\nyarn add --global rag-crawler\n```\n\n## Usage\n\n```\nUsage: rag-crawler [options] \u003cstartUrl\u003e [outPath]\n\nCrawl a website to generate knowledge file for RAG\n    \nExamples:\n   rag-crawler https://sigoden.github.io/mynotes/languages/\n   rag-crawler https://sigoden.github.io/mynotes/languages/ data.json\n   rag-crawler https://sigoden.github.io/mynotes/languages/ pages/\n   rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/\n\nArguments:\n  startUrl                     The URL to start crawling from. Don't forget trailing slash. [required]\n  outPath                      The output path. If omitted, output to stdout\n\nOptions:\n  --preset \u003cvalue\u003e             Use predefined crawl options (default: \"auto\")\n  -c, --max-connections \u003cint\u003e  Maximum concurrent connections when crawling the pages\n  -e, --exclude \u003cvalues\u003e       Comma-separated list of path names to exclude from crawling\n  --extract \u003ccss-selector\u003e     Extract specific content using a CSS selector, If omitted, extract all content\n  --no-log                     Disable logging\n  -V, --version                output the version number\n  -h, --help                   display help for command\n```\n\n**Output to stdout**\n```\n$ rag-crawler https://sigoden.github.io/mynotes/languages/ \n[\n  {\n    \"path\": \"https://sigoden.github.io/mynotes/languages/\",\n    \"text\": \"# Languages ...\"\n  },\n  {\n    \"path\": \"https://sigoden.github.io/mynotes/languages/shell.html\",\n    \"text\": \"# Shell ...\"\n  }\n  ...\n]\n```\n\n**Output to JSON file**\n```\n$ rag-crawler https://sigoden.github.io/mynotes/languages/ knowledge.json\n```\n\n**Output to separates files**\n\n```\n$ rag-crawler https://sigoden.github.io/mynotes/languages/ pages/\n...\n$ tree pages\npages\n└── mynotes\n    ├── languages\n    │   ├── markdown.md\n    │   ├── nodejs.md\n    │   ├── rust.md\n    │   └── shell.md\n    └── languages.md\n```\n\n**Crawl Markdown files in GitHub Tree**\n\n```\n$ rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/ knowledge.json\n```\n\n\u003e Many documentation sites host their source Markdown files on GitHub. The crawler has been optimized to crawl these files directly from GitHub.\n\n## Preset\n\nA preset consists of predefined crawl options. You can review the predefined presets at [./src/preset.ts](./src/preset.ts).\n\n### Why Use Preset?\n\nLet's use GitHub Wiki as an example. To enhance scraping quality, we need to configure both `--exclude` and `--extract`.\n\n```\n$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --exclude _history --extract '#wiki-body'\n```\n\nSince all GitHub Wiki websites share these crawl options, we can define a preset for reusability.\n\n```js\n{\n  name: \"github-wiki\",\n  test: \"github.com/([^/]+)/([^/]+)/wiki\",\n  options: {\n    exclude: [\"_history\"],\n    extract: \"#wiki-body\",\n  },\n}\n```\n\nThis allows for a simplified command:\n\n```\n$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset github-wiki\n// or\n$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset auto\n// or\n$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json # `--reset` default to 'auto'\n```\n\n\u003e When the preset is set to `auto`, rag-crawler will automatically determine the appropriate preset. It does this by checking if the `startUrl` matches the `test` regex.\n\n### Custom Presets\n\nYou can add custom presets by editing the `~/.rag-crawler.json` file:\n\n```json\n[\n  {\n    \"name\": \"github-wiki\",\n    \"test\": \"github.com/([^/]+)/([^/]+)/wiki\",\n    \"options\": {\n      \"exclude\": [\"_history\"],\n      \"extract\": \"#wiki-body\"\n    }\n  },\n  ...\n]\n```\n\n# License\n\nThe project is under the MIT License, Refer to the [LICENSE](https://github.com/sigoden/rag-crawler/blob/main/LICENSE) file for detailed information.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsigoden%2Frag-crawler","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsigoden%2Frag-crawler","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsigoden%2Frag-crawler/lists"}