https://github.com/christianromney/llama-org-rag

Experiment with LlamaIndex to perform RAG over my org-mode notes
https://github.com/christianromney/llama-org-rag
Last synced: 4 months ago
JSON representation
Experiment with LlamaIndex to perform RAG over my org-mode notes
Host: GitHub
URL: https://github.com/christianromney/llama-org-rag
Owner: christianromney
Created: 2024-02-22T02:59:09.000Z (over 2 years ago)
Default Branch: main
Last Pushed: 2024-05-03T22:05:48.000Z (about 2 years ago)
Last Synced: 2024-05-03T23:23:17.309Z (about 2 years ago)
Language: Python
Size: 2 MB
Stars: 1
Watchers: 2
Forks: 0
Open Issues: 5
Metadata Files:
- Readme: README.org
- Changelog: CHANGELOG.md
Awesome Lists containing this project

README

          #+TITLE: org-mode RAG Project with LlamaIndex

* Overview

This project is a proof of concept for a tool to index my org-mode files to

support semantic search using LlamaIndex.

** LlamaIndex

*** Environment Setup

This project uses the [[https://direnv.net/man/direnv-stdlib.1.html#codelayout-python3code][python3 layout]] from [[https://direnv.net/][direnv]]'s standard library. The ~.envrc~

file is managed by [[https://www.agwa.name/projects/git-crypt/][git-crypt]] which handles encryption transparently. This file

resembles the following example:

#+begin_src shell :file .envrc

export PROJECT_NAME=llama-org-rag

export OPENAI_API_KEY=

export TOKENIZERS_PARALLELISM=true

layout python3

#+end_src

With this file in place, running: ~direnv allow~ will enable direnv for this

project and create a python virtual environment under the (ignored) ~.direnv~

directory. Direnv will add the ~python~ and ~pip~ executables from the virtual

environment to the path automatically.

Now we can install this project's requirements with:

#+begin_src shell

pip install -r requirements.txt

#+end_src

**** Optional Tools

I also use [[https://docs.astral.sh/ruff/][ruff]] for Python linting and formatting and [[https://github.com/evilmartians/lefthook][lefthook]] for Git hook

management. The ~ruff.toml~ and ~lefthook.yml~ files in root directory contain their

respective configurations.

I use git-cliff to update the CHANGELOG.md.

*** Conceptual Overview

**** Retrieval Augmented Generation (RAG)

Capitalized nouns in this section correspond to LlamaIndex classes

***** Loading (Reading)

- reading source information from data source(s)

- Connectors (aka Readers) know how to ingest particular formats and sources of data

  - e.g. SimpleDirectoryReader

- Document is an abstraction that acts like generic container for loaded data

  - tracks metadata and relationships among data

  - example metadata: file attributes, parent/child relationships

***** Transforming

- splitting or chunking Documents into Nodes

  - by sentence, character, token, semantic chunk

  - inherit Document metadata

***** Indexing

- Indices are data structures that support efficient retrieval and query,

  particularly /semantic/ query

- use vector Embeddings, which map words to numerical vectors such that related

  concepts are nearby in the vector space

- calculating Embeddings for each Node and associating them

***** Storing

- persisting the index durably

- Indices can persist via their Storage Context directly to files on disk or

- can use some vector store / database

***** Retrieving

- Retrievers fetch Nodes from an index

- Routers select the optimal retriever from one or more possible choices

- embed a query and perform a similarity search against an index (possibly

  backed by a store)

- return top_k relevant Nodes

***** Node Post-Processing

- filter, augment, or reorder each node according to some criteria

***** Augmented Generation

- embed post-processed nodes into LLM context

  - llm prompt contains:

    - system prompt, retrieved nodes*, prior messages*, query

- Response Synthesizer generates a response from an LLM using a query and

  retrieved data

***** Query Engine

- end-to-end pipeline for producing a response to a query using an LLM and

  retrieved content

***** Chat Engine

- end-to-end pipeline for having chat (multiple back and forth Q&A)

***** Agents

- automated *decision maker* powered by an LLM that interacts with the world using

  Tools

*** Tools and Libraries

**** Vector Stores

***** Chroma DB

- in memory or embedded (sqlite) vector db

- experience showed sqlite embeddings get "stuck" in a queue table

***** Lance DB

- embedded vector db persisted to files on disk and run from memory

- initialization requires schema or data from which to infer it

***** Qdrant

- containerized or hosted vector db

- easy setup and usage from llama-index

- what is with the [[https://python-client.qdrant.tech/qdrant_client.http.models.models][horrible]] ui/readability choices for generated Python docs?

  + no list of classes or methods (have to scroll or search the page)

    * no link anchor / heading to each class or method when it does appear

  + poor contrast grey-on-grey color scheme for class and method names

  + if you ever needed an example of why types don't make everything better,

    have fun deciphering this…

    #+begin_example

    shard_key_selector: Optional[Union[int[int], str[str], List[Union[int[int], str[str]]]]] = None

    #+end_example

- otoh, [[https://qdrant.github.io/qdrant/redoc/index.html#tag/collections][REST documentation]] is readable and navigable

- would be nice to have a method for checking if a named collection exists

  + ~get_collection(collection_name="foo")~ [[[https://python-client.qdrant.tech/_modules/qdrant_client/qdrant_client#QdrantClient.get_collection][source]]] throws if collection not

    found

- need to explore features

***** TODO Weaviate

- popular containerized, embedded, or hosted vector db

**** LangChain v. LlamaIndex Impressions

***** API / Design

- LangChain's API is simpler, but seems more limiting than LlamaIndex's

***** Documentation

- LangChain's API docs are [[https://api.python.langchain.com/en/stable/langchain_api_reference.html][well-organized]], readable and link to [[https://api.python.langchain.com/en/stable/_modules/langchain/agents/agent.html#Agent.aplan][source]]

- LLamaIndex's core API docs just [[https://docs.llamaindex.ai/en/stable/api_reference/indices/vector_store.html][ok]] to read

  - don't like organization

    - prefer package/class listing like Javadoc

  - don't link to source

***** Community

- LangChain has lots of [[https://api.python.langchain.com/en/stable/community_api_reference.html#][community packages]]

- LlamaIndex has [[https://llamahub.ai/][LlamaHub]] community package implementations

***** Utilities

- create-llama :: [[https://www.npmjs.com/package/create-llama][node-based]] bootstrapper for LlamaIndex ([[https://blog.llamaindex.ai/create-llama-a-command-line-tool-to-generate-llamaindex-apps-8f7683021191][blog]], [[https://youtu.be/GOv4arrbVi8?si=9-TEs-_SbKUnhgWx][video]])

***** Observability

- LangSmith :: freemium hosted observability tooling ([[https://docs.smith.langchain.com/][docs]])

  - limit 1 project for free "Developer" plan

- DeepEval :: open-source observability for LLM apps ([[https://github.com/confident-ai/deepeval][Github]], [[https://docs.confident-ai.com/][docs]])

  - unit tests can report to Confident-AI (freemium like LangSmith)

  - metrics can be used with any framework

  - LlamaIndex Evaluators included

- openllmetry :: freemium? open-source observability ([[https://github.com/traceloop/openllmetry][Github]], [[https://www.traceloop.com/docs/openllmetry/introduction][docs]])

- Arize Phoneix :: ooh pretty! ([[https://github.com/Arize-ai/phoenix][Github]], [[https://docs.arize.com/phoenix][docs]])

**** Miscellaneous Libraries

- [[https://unstructured-io.github.io/unstructured/][unstructured.io]]'s so-called [[https://github.com/Unstructured-IO/unstructured/blob/1947375b2eee8477f7ac95f55783b8262cb90ca9/unstructured/partition/org.py#L4][org-mode support]] is disappointing

  - uses [[https://github.com/JessicaTegner/pypandoc#usage][pypandoc]] under the hood

  - parses as HTML

  - identifies headings and lists, but none of org's richness

*** RAG Proof of Concept (Python)

The code in [[https://github.com/christianromney/llama-org-rag/blob/main/rag.py][rag.py]] uses LlamaIndex to perform Retrieval Augmented Generation

(RAG) over my org-mode documents (org-roam notes, org todos and org agenda).

*** Output

Figure 1. List of all indexed files

[[file:img/list.png]]

Figure 2. Refreshing the disk index with novelty

[[file:img/refresh.png]]

Figure 3. One-shot query (suitable for automation)

[[file:img/query.png]]

Figure 4. Interactive chat

[[file:img/interactive.png]]

*** Impressions

This section captures what I learned from this experiment. Overall, I think

there's a lot of promise in semantic, generative search over my documents. I

need to learn more about techniques people use to get better results from RAG,

and there are lots of papers from which to draw

[cite:@barnett-SevenFailurePointsRAG-2024].

- I'm slightly disappointed in the LangChain API, Chroma DB, and Unstructured.

- I prefer LlamaIndex's API, though its docs are not as good as LangChain's.

- I dislike Sphinx-generated Python documentation generally for its complexity,

  layout, and theming.

- I like pdoc API documentation very much for its simplicity and clean UI.

- It's easy to forget LLMs don't know simple things, like the current date.

- LlamaIndex's on-disk persisted index refreshing seems broken, producing

  duplicate embeddings.

*** Future Work

- [X] +add result evaluation using a secondary LLM (chatgpt-4-turbo-preview)+ using

- [X] experiment with different retrieval parameters

- [X] persist my index to a proper vector database

- [ ] experiment with better retrieval techniques / architectures (e.g. Crew AI)

- [ ] convert this to a full-fledged agent with access to tools

  - [ ] use ReAct or LLMCompiler to leverage LLMs planning abilities

  - [ ] tools should include Google, Wikipedia, and Wolfram Alpha

  - [ ] a basic tool to get the current date and possibly holiday calendars

- [ ] improve result formatting consistency

- [ ] improve discovery

- [ ] improve performance (latency)

- [ ] periodically update my index `org-rag --refresh` (upsert)

- [ ] experiment with knowledge graph

- [ ] wire this up to an Emacs command (JSON API?)

- [ ] evaluate [[https://blog.streamlit.io/build-a-chatbot-with-custom-data-sources-powered-by-llamaindex/][different UIs]]
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/christianromney/llama-org-rag

Awesome Lists containing this project

README