https://github.com/philippe2803/contentmap
Build a RAG dataset for your domain in just a few lines of codes, using your XML sitemap
https://github.com/philippe2803/contentmap
python rag sqlite
Last synced: about 6 hours ago
JSON representation
Build a RAG dataset for your domain in just a few lines of codes, using your XML sitemap
- Host: GitHub
- URL: https://github.com/philippe2803/contentmap
- Owner: philippe2803
- Created: 2024-01-08T15:38:37.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2024-08-24T10:20:52.000Z (almost 2 years ago)
- Last Synced: 2024-08-25T10:56:17.983Z (almost 2 years ago)
- Topics: python, rag, sqlite
- Language: Python
- Homepage: https://philippeoger.com/pages/can-we-rag-the-whole-web
- Size: 192 KB
- Stars: 29
- Watchers: 1
- Forks: 2
- Open Issues: 2
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
Awesome Lists containing this project
README
# Content map
A way to share content from a specific domain using SQLite as an alternative to
RSS feeds. The purpose of this library is to simply create a dataset for all the
content on your website, using the XML sitemap as a starting point.
Possibility to include vector search similarity features in the dataset very easily.
Article that explains the rationale behind this type of datasets [here](https://philippeoger.com/pages/can-we-rag-the-whole-web/).
## Installation
```bash
pip install contentmap
```
## Quickstart
To build your contentmap.db with vector search capabilities and containing all
your content using your XML sitemap as a starting point, you only need to write the
following:
```python
from contentmap.sitemap import SitemapToContentDatabase
database = SitemapToContentDatabase(
sitemap_sources=["https://yourblog.com/sitemap.xml"],
concurrency=10,
include_vss=True
)
database.build()
```
This will automatically create the SQLite database file, with vector search
capabilities (piggybacking on sqlite-vss integration on Langchain).
Thanks to @medoror for contributing.