Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/sigoden/rag-crawler
Crawl a website to generate knowledge file for RAG
https://github.com/sigoden/rag-crawler
crawler knowledge llm rag
Last synced: 6 days ago
JSON representation
Crawl a website to generate knowledge file for RAG
- Host: GitHub
- URL: https://github.com/sigoden/rag-crawler
- Owner: sigoden
- License: mit
- Created: 2024-06-28T21:39:18.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2024-08-13T01:58:11.000Z (4 months ago)
- Last Synced: 2024-12-02T09:25:02.748Z (10 days ago)
- Topics: crawler, knowledge, llm, rag
- Language: TypeScript
- Homepage:
- Size: 126 KB
- Stars: 19
- Watchers: 2
- Forks: 4
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
- my-awesome-list - rag-crawler
- jimsghstars - sigoden/rag-crawler - Crawl a website to generate knowledge file for RAG (TypeScript)
README
# rag-crawler
[![CI](https://github.com/sigoden/rag-crawler/actions/workflows/ci.yaml/badge.svg)](https://github.com/sigoden/rag-crawler/actions/workflows/ci.yaml)
[![NPM Version](https://img.shields.io/npm/v/rag-crawler)](https://www.npmjs.com/package/rag-crawler)Crawl a website to generate knowledge file for RAG.
## Installation
```bash
npm i -g rag-crawler
yarn add --global rag-crawler
```## Usage
```
Usage: rag-crawler [options] [outPath]Crawl a website to generate knowledge file for RAG
Examples:
rag-crawler https://sigoden.github.io/mynotes/languages/
rag-crawler https://sigoden.github.io/mynotes/languages/ data.json
rag-crawler https://sigoden.github.io/mynotes/languages/ pages/
rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/Arguments:
startUrl The URL to start crawling from. Don't forget trailing slash. [required]
outPath The output path. If omitted, output to stdoutOptions:
--preset Use predefined crawl options (default: "auto")
-c, --max-connections Maximum concurrent connections when crawling the pages
-e, --exclude Comma-separated list of path names to exclude from crawling
--extract Extract specific content using a CSS selector, If omitted, extract all content
--no-log Disable logging
-V, --version output the version number
-h, --help display help for command
```**Output to stdout**
```
$ rag-crawler https://sigoden.github.io/mynotes/languages/
[
{
"path": "https://sigoden.github.io/mynotes/languages/",
"text": "# Languages ..."
},
{
"path": "https://sigoden.github.io/mynotes/languages/shell.html",
"text": "# Shell ..."
}
...
]
```**Output to JSON file**
```
$ rag-crawler https://sigoden.github.io/mynotes/languages/ knowledge.json
```**Output to separates files**
```
$ rag-crawler https://sigoden.github.io/mynotes/languages/ pages/
...
$ tree pages
pages
└── mynotes
├── languages
│ ├── markdown.md
│ ├── nodejs.md
│ ├── rust.md
│ └── shell.md
└── languages.md
```**Crawl Markdown files in GitHub Tree**
```
$ rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/ knowledge.json
```> Many documentation sites host their source Markdown files on GitHub. The crawler has been optimized to crawl these files directly from GitHub.
## Preset
A preset consists of predefined crawl options. You can review the predefined presets at [./src/preset.ts](./src/preset.ts).
### Why Use Preset?
Let's use GitHub Wiki as an example. To enhance scraping quality, we need to configure both `--exclude` and `--extract`.
```
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --exclude _history --extract '#wiki-body'
```Since all GitHub Wiki websites share these crawl options, we can define a preset for reusability.
```js
{
name: "github-wiki",
test: "github.com/([^/]+)/([^/]+)/wiki",
options: {
exclude: ["_history"],
extract: "#wiki-body",
},
}
```This allows for a simplified command:
```
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset github-wiki
// or
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset auto
// or
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json # `--reset` default to 'auto'
```> When the preset is set to `auto`, rag-crawler will automatically determine the appropriate preset. It does this by checking if the `startUrl` matches the `test` regex.
### Custom Presets
You can add custom presets by editing the `~/.rag-crawler.json` file:
```json
[
{
"name": "github-wiki",
"test": "github.com/([^/]+)/([^/]+)/wiki",
"options": {
"exclude": ["_history"],
"extract": "#wiki-body"
}
},
...
]
```# License
The project is under the MIT License, Refer to the [LICENSE](https://github.com/sigoden/rag-crawler/blob/main/LICENSE) file for detailed information.