https://github.com/philippe2803/contentmap

Build a RAG dataset for your domain in just a few lines of codes, using your XML sitemap
https://github.com/philippe2803/contentmap

python rag sqlite

Last synced: 26 days ago
JSON representation

Build a RAG dataset for your domain in just a few lines of codes, using your XML sitemap

Host: GitHub
URL: https://github.com/philippe2803/contentmap
Owner: philippe2803
Created: 2024-01-08T15:38:37.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-08-24T10:20:52.000Z (almost 2 years ago)
Last Synced: 2024-08-25T10:56:17.983Z (almost 2 years ago)
Topics: python, rag, sqlite
Language: Python
Homepage: https://philippeoger.com/pages/can-we-rag-the-whole-web
Size: 192 KB
Stars: 29
Watchers: 1
Forks: 2
Open Issues: 2
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md

Awesome Lists containing this project

README

# Content map

A way to share content from a specific domain using SQLite as an alternative to
RSS feeds. The purpose of this library is to simply create a dataset for all the
content on your website, using the XML sitemap as a starting point.

Possibility to include vector search similarity features in the dataset very easily.

Article that explains the rationale behind this type of datasets [here](https://philippeoger.com/pages/can-we-rag-the-whole-web/).

## Installation

```bash

pip install contentmap

```

## Quickstart

To build your contentmap.db with vector search capabilities and containing all
your content using your XML sitemap as a starting point, you only need to write the
following:

```python
from contentmap.sitemap import SitemapToContentDatabase

database = SitemapToContentDatabase(
sitemap_sources=["https://yourblog.com/sitemap.xml"],
concurrency=10,
include_vss=True
)
database.build()

```

This will automatically create the SQLite database file, with vector search
capabilities (piggybacking on sqlite-vss integration on Langchain).

Thanks to @medoror for contributing.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/philippe2803/contentmap

Awesome Lists containing this project

README