https://tidyverse.github.io/ragnar/

Last synced: 3 months ago
JSON representation

Host: GitHub
URL: https://tidyverse.github.io/ragnar/
Owner: tidyverse
License: other
Created: 2025-01-20T19:30:06.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-03-27T15:43:14.000Z (4 months ago)
Last Synced: 2025-04-02T02:37:37.111Z (3 months ago)
Language: R
Homepage: https://tidyverse.github.io/ragnar/
Size: 14.4 MB
Stars: 39
Watchers: 4
Forks: 3
Open Issues: 3
Metadata Files:
- Readme: README.Rmd
- License: LICENSE

Awesome Lists containing this project

awesome-generative-ai-data-scientist - Ragnar - Augmented Generation (RAG) workflows. | [Website](https://tidyverse.github.io/ragnar/) | (RAG in R)

README

        ---

output: github_document

---

```{r, include = FALSE}

knitr::opts_chunk$set(

  collapse = TRUE,

  comment = "#>",

  fig.path = "man/figures/README-",

  out.width = "100%"

)

```

# ragnar 

[![R-CMD-check](https://github.com/tidyverse/ragnar/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/tidyverse/ragnar/actions/workflows/R-CMD-check.yaml)

`ragnar` is an R package that helps implement Retrieval-Augmented

Generation (RAG) workflows. It focuses on providing a complete solution

with sensible defaults, while still giving the knowledgeable user

precise control over each steps. We don't believe that you can fully

automate the creation of a good RAG system, so it's important that

`ragnar` is not a black box. `ragnar` is designed to be transparent—you

can inspect easily outputs at intermediate steps to understand what's

happening.

## Installation

``` r

pak::pak("tidyverse/ragnar")

```

## Key Steps

### 1. Document Processing

`ragnar` works with a wide variety of document types, using

[MarkItDown](https://github.com/microsoft/markitdown) to convert content

to Markdown.

Key functions:

-   `ragnar_find_links()`: Find all links in a webpage

-   `ragnar_read()`: Convert a file or URL to markdown

### 2. Text Chunking

Next we divide each document into multiple chunks. Ragnar defaults to a

strategy that preserves some of the semantics of the document, but

provide plenty of options to tweak the approach.

Key functions:

-   `ragnar_chunk()`: Higher-level function that both identifies

    semantic boundaries and chunks text.

-   `ragnar_segment()`: Lower-level function that identifies semantic

    boundaries.

-   `ragnar_chunk_segments()`: Lower-level function that chunks

    pre-segmented text.

### 3. Context Augmentation (Optional)

RAG applications benefit from augmenting text chunks with additional

context, such as document headings and subheadings. While `ragnar`

doesn't directly export functions for this, it supports template-based

augmentation through `ragnar_read(frame_by_tags, split_by_tags)`. Future

versions will support generating context summaries via LLM calls.

Key functions:

-   `ragnar_read()`: Use `frame_by_tags` and/or `split_by_tags`

    arguments to associate text chunks with their document position.

-   `markdown_segment()`: Segment markdown text into a character vector

    using semantic tags (e.g., headings, paragraphs, or code chunks).

-   `markdown_frame()`: Convert markdown text into a dataframe.

### 4. Embedding

`ragnar` can help compute embeddings for each chunk. The goal is for

`ragnar` to provide access to embeddings from popular LLM providers.

Currently only `ollama` and `openai` providers.

Key functions:

-   `embed_ollama()`

-   `embed_openai()`

Note that calling the embedding function directly is typically not

necessary. Instead, the embedding function is specified when a store is

first created, and then automatically called when needed by

`ragnar_retreive()` and `ragnar_store_insert()`.

### 5. Storage

Processed data is stored in a format optimized for efficient searching,

using `duckdb` by default. The API is designed to be extensible,

allowing additional packages to implement support for different storage

providers.

Key functions:

-   `ragnar_store_create()`

-   `ragnar_store_connect()`

-   `ragnar_store_insert()`

### 6. Retrieval

Given a prompt, retrieve related chunks based on embedding distance or

bm25 text search.

Key functions:

-   `ragnar_retrieve()`

-   `ragnar_retrieve_vss()`: Retrieve using [`vss` DuckDB

    extension](https://duckdb.org/docs/extensions/vss.html)

-   `ragnar_retrieve_bm25()`: Retrieve using

    [`full-text search DuckDB extension`](https://duckdb.org/docs/extensions/full_text_search.html)

### 7. Re-ranking (Optional)

Re-ranking of retrieved chunks is planned for future releases.

### 8. Prompt Generation

`ragnar` can equip an `ellmer::Chat` object with a retrieve tool that

enables an LLM to retreive content from a store on-demand.

-   `ragnar_register_tool_retrieve(chat, store)`.

## Usage

Here's an example of using `ragnar` to create a knowledge store from the

*R for Data Science (2e)* book:

```{r, code = readLines("examples/example-create-store.R")}

```

Once the store is set up, you can then retrieve the most relevant text

chunks.

```{r, code = readLines("examples/example-retrieve.R")}

```

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://tidyverse.github.io/ragnar/

Awesome Lists containing this project

README