https://github.com/sigoden/rag-crawler

Crawl a website to generate knowledge file for RAG
https://github.com/sigoden/rag-crawler

crawler knowledge llm rag

Last synced: 4 months ago
JSON representation

Crawl a website to generate knowledge file for RAG

Host: GitHub
URL: https://github.com/sigoden/rag-crawler
Owner: sigoden
License: mit
Created: 2024-06-28T21:39:18.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-13T01:58:11.000Z (over 1 year ago)
Last Synced: 2024-12-02T09:25:02.748Z (12 months ago)
Topics: crawler, knowledge, llm, rag
Language: TypeScript
Homepage:
Size: 126 KB
Stars: 19
Watchers: 2
Forks: 4
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

my-awesome-list - rag-crawler
jimsghstars - sigoden/rag-crawler - Crawl a website to generate knowledge file for RAG (TypeScript)

README

# rag-crawler

[![CI](https://github.com/sigoden/rag-crawler/actions/workflows/ci.yaml/badge.svg)](https://github.com/sigoden/rag-crawler/actions/workflows/ci.yaml)
[![NPM Version](https://img.shields.io/npm/v/rag-crawler)](https://www.npmjs.com/package/rag-crawler)

Crawl a website to generate knowledge file for RAG.

## Installation

```bash
npm i -g rag-crawler
yarn add --global rag-crawler
```

## Usage

```
Usage: rag-crawler [options] [outPath]

Crawl a website to generate knowledge file for RAG

Examples:
rag-crawler https://sigoden.github.io/mynotes/languages/
rag-crawler https://sigoden.github.io/mynotes/languages/ data.json
rag-crawler https://sigoden.github.io/mynotes/languages/ pages/
rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/

Arguments:
startUrl The URL to start crawling from. Don't forget trailing slash. [required]
outPath The output path. If omitted, output to stdout

Options:
--preset Use predefined crawl options (default: "auto")
-c, --max-connections Maximum concurrent connections when crawling the pages
-e, --exclude Comma-separated list of path names to exclude from crawling
--extract Extract specific content using a CSS selector, If omitted, extract all content
--no-log Disable logging
-V, --version output the version number
-h, --help display help for command
```

**Output to stdout**
```
$ rag-crawler https://sigoden.github.io/mynotes/languages/
[
{
"path": "https://sigoden.github.io/mynotes/languages/",
"text": "# Languages ..."
},
{
"path": "https://sigoden.github.io/mynotes/languages/shell.html",
"text": "# Shell ..."
}
...
]
```

**Output to JSON file**
```
$ rag-crawler https://sigoden.github.io/mynotes/languages/ knowledge.json
```

**Output to separates files**

```
$ rag-crawler https://sigoden.github.io/mynotes/languages/ pages/
...
$ tree pages
pages
└── mynotes
├── languages
│ ├── markdown.md
│ ├── nodejs.md
│ ├── rust.md
│ └── shell.md
└── languages.md
```

**Crawl Markdown files in GitHub Tree**

```
$ rag-crawler https://github.com/sigoden/mynotes/tree/main/src/languages/ knowledge.json
```

> Many documentation sites host their source Markdown files on GitHub. The crawler has been optimized to crawl these files directly from GitHub.

## Preset

A preset consists of predefined crawl options. You can review the predefined presets at [./src/preset.ts](./src/preset.ts).

### Why Use Preset?

Let's use GitHub Wiki as an example. To enhance scraping quality, we need to configure both `--exclude` and `--extract`.

```
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --exclude _history --extract '#wiki-body'
```

Since all GitHub Wiki websites share these crawl options, we can define a preset for reusability.

```js
{
name: "github-wiki",
test: "github.com/([^/]+)/([^/]+)/wiki",
options: {
exclude: ["_history"],
extract: "#wiki-body",
},
}
```

This allows for a simplified command:

```
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset github-wiki
// or
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json --preset auto
// or
$ rag-crawler https://github.com/sigoden/aichat/wiki wiki.json # `--reset` default to 'auto'
```

> When the preset is set to `auto`, rag-crawler will automatically determine the appropriate preset. It does this by checking if the `startUrl` matches the `test` regex.

### Custom Presets

You can add custom presets by editing the `~/.rag-crawler.json` file:

```json
[
{
"name": "github-wiki",
"test": "github.com/([^/]+)/([^/]+)/wiki",
"options": {
"exclude": ["_history"],
"extract": "#wiki-body"
}
},
...
]
```

# License

The project is under the MIT License, Refer to the [LICENSE](https://github.com/sigoden/rag-crawler/blob/main/LICENSE) file for detailed information.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/sigoden/rag-crawler

Awesome Lists containing this project

README