An open API service indexing awesome lists of open source software.

https://github.com/christianromney/llama-org-rag

Experiment with LlamaIndex to perform RAG over my org-mode notes
https://github.com/christianromney/llama-org-rag

Last synced: 4 months ago
JSON representation

Experiment with LlamaIndex to perform RAG over my org-mode notes

Awesome Lists containing this project

README

          

#+TITLE: org-mode RAG Project with LlamaIndex
* Overview
This project is a proof of concept for a tool to index my org-mode files to
support semantic search using LlamaIndex.
** LlamaIndex
*** Environment Setup
This project uses the [[https://direnv.net/man/direnv-stdlib.1.html#codelayout-python3code][python3 layout]] from [[https://direnv.net/][direnv]]'s standard library. The ~.envrc~
file is managed by [[https://www.agwa.name/projects/git-crypt/][git-crypt]] which handles encryption transparently. This file
resembles the following example:

#+begin_src shell :file .envrc
export PROJECT_NAME=llama-org-rag
export OPENAI_API_KEY=
export TOKENIZERS_PARALLELISM=true
layout python3
#+end_src

With this file in place, running: ~direnv allow~ will enable direnv for this
project and create a python virtual environment under the (ignored) ~.direnv~
directory. Direnv will add the ~python~ and ~pip~ executables from the virtual
environment to the path automatically.

Now we can install this project's requirements with:
#+begin_src shell
pip install -r requirements.txt
#+end_src
**** Optional Tools
I also use [[https://docs.astral.sh/ruff/][ruff]] for Python linting and formatting and [[https://github.com/evilmartians/lefthook][lefthook]] for Git hook
management. The ~ruff.toml~ and ~lefthook.yml~ files in root directory contain their
respective configurations.

I use git-cliff to update the CHANGELOG.md.
*** Conceptual Overview
**** Retrieval Augmented Generation (RAG)
Capitalized nouns in this section correspond to LlamaIndex classes
***** Loading (Reading)
- reading source information from data source(s)
- Connectors (aka Readers) know how to ingest particular formats and sources of data
- e.g. SimpleDirectoryReader
- Document is an abstraction that acts like generic container for loaded data
- tracks metadata and relationships among data
- example metadata: file attributes, parent/child relationships
***** Transforming
- splitting or chunking Documents into Nodes
- by sentence, character, token, semantic chunk
- inherit Document metadata
***** Indexing
- Indices are data structures that support efficient retrieval and query,
particularly /semantic/ query
- use vector Embeddings, which map words to numerical vectors such that related
concepts are nearby in the vector space
- calculating Embeddings for each Node and associating them
***** Storing
- persisting the index durably
- Indices can persist via their Storage Context directly to files on disk or
- can use some vector store / database
***** Retrieving
- Retrievers fetch Nodes from an index
- Routers select the optimal retriever from one or more possible choices
- embed a query and perform a similarity search against an index (possibly
backed by a store)
- return top_k relevant Nodes
***** Node Post-Processing
- filter, augment, or reorder each node according to some criteria
***** Augmented Generation
- embed post-processed nodes into LLM context
- llm prompt contains:
- system prompt, retrieved nodes*, prior messages*, query
- Response Synthesizer generates a response from an LLM using a query and
retrieved data
***** Query Engine
- end-to-end pipeline for producing a response to a query using an LLM and
retrieved content
***** Chat Engine
- end-to-end pipeline for having chat (multiple back and forth Q&A)
***** Agents
- automated *decision maker* powered by an LLM that interacts with the world using
Tools

*** Tools and Libraries
**** Vector Stores
***** Chroma DB
- in memory or embedded (sqlite) vector db
- experience showed sqlite embeddings get "stuck" in a queue table
***** Lance DB
- embedded vector db persisted to files on disk and run from memory
- initialization requires schema or data from which to infer it
***** Qdrant
- containerized or hosted vector db
- easy setup and usage from llama-index
- what is with the [[https://python-client.qdrant.tech/qdrant_client.http.models.models][horrible]] ui/readability choices for generated Python docs?
+ no list of classes or methods (have to scroll or search the page)
* no link anchor / heading to each class or method when it does appear
+ poor contrast grey-on-grey color scheme for class and method names
+ if you ever needed an example of why types don't make everything better,
have fun deciphering this…
#+begin_example
shard_key_selector: Optional[Union[int[int], str[str], List[Union[int[int], str[str]]]]] = None
#+end_example
- otoh, [[https://qdrant.github.io/qdrant/redoc/index.html#tag/collections][REST documentation]] is readable and navigable
- would be nice to have a method for checking if a named collection exists
+ ~get_collection(collection_name="foo")~ [[[https://python-client.qdrant.tech/_modules/qdrant_client/qdrant_client#QdrantClient.get_collection][source]]] throws if collection not
found
- need to explore features
***** TODO Weaviate
- popular containerized, embedded, or hosted vector db
**** LangChain v. LlamaIndex Impressions
***** API / Design
- LangChain's API is simpler, but seems more limiting than LlamaIndex's
***** Documentation
- LangChain's API docs are [[https://api.python.langchain.com/en/stable/langchain_api_reference.html][well-organized]], readable and link to [[https://api.python.langchain.com/en/stable/_modules/langchain/agents/agent.html#Agent.aplan][source]]
- LLamaIndex's core API docs just [[https://docs.llamaindex.ai/en/stable/api_reference/indices/vector_store.html][ok]] to read
- don't like organization
- prefer package/class listing like Javadoc
- don't link to source
***** Community
- LangChain has lots of [[https://api.python.langchain.com/en/stable/community_api_reference.html#][community packages]]
- LlamaIndex has [[https://llamahub.ai/][LlamaHub]] community package implementations
***** Utilities
- create-llama :: [[https://www.npmjs.com/package/create-llama][node-based]] bootstrapper for LlamaIndex ([[https://blog.llamaindex.ai/create-llama-a-command-line-tool-to-generate-llamaindex-apps-8f7683021191][blog]], [[https://youtu.be/GOv4arrbVi8?si=9-TEs-_SbKUnhgWx][video]])
***** Observability
- LangSmith :: freemium hosted observability tooling ([[https://docs.smith.langchain.com/][docs]])
- limit 1 project for free "Developer" plan
- DeepEval :: open-source observability for LLM apps ([[https://github.com/confident-ai/deepeval][Github]], [[https://docs.confident-ai.com/][docs]])
- unit tests can report to Confident-AI (freemium like LangSmith)
- metrics can be used with any framework
- LlamaIndex Evaluators included
- openllmetry :: freemium? open-source observability ([[https://github.com/traceloop/openllmetry][Github]], [[https://www.traceloop.com/docs/openllmetry/introduction][docs]])
- Arize Phoneix :: ooh pretty! ([[https://github.com/Arize-ai/phoenix][Github]], [[https://docs.arize.com/phoenix][docs]])
**** Miscellaneous Libraries
- [[https://unstructured-io.github.io/unstructured/][unstructured.io]]'s so-called [[https://github.com/Unstructured-IO/unstructured/blob/1947375b2eee8477f7ac95f55783b8262cb90ca9/unstructured/partition/org.py#L4][org-mode support]] is disappointing
- uses [[https://github.com/JessicaTegner/pypandoc#usage][pypandoc]] under the hood
- parses as HTML
- identifies headings and lists, but none of org's richness
*** RAG Proof of Concept (Python)
The code in [[https://github.com/christianromney/llama-org-rag/blob/main/rag.py][rag.py]] uses LlamaIndex to perform Retrieval Augmented Generation
(RAG) over my org-mode documents (org-roam notes, org todos and org agenda).

*** Output

Figure 1. List of all indexed files
[[file:img/list.png]]

Figure 2. Refreshing the disk index with novelty
[[file:img/refresh.png]]

Figure 3. One-shot query (suitable for automation)
[[file:img/query.png]]

Figure 4. Interactive chat
[[file:img/interactive.png]]

*** Impressions
This section captures what I learned from this experiment. Overall, I think
there's a lot of promise in semantic, generative search over my documents. I
need to learn more about techniques people use to get better results from RAG,
and there are lots of papers from which to draw
[cite:@barnett-SevenFailurePointsRAG-2024].

- I'm slightly disappointed in the LangChain API, Chroma DB, and Unstructured.
- I prefer LlamaIndex's API, though its docs are not as good as LangChain's.
- I dislike Sphinx-generated Python documentation generally for its complexity,
layout, and theming.
- I like pdoc API documentation very much for its simplicity and clean UI.
- It's easy to forget LLMs don't know simple things, like the current date.
- LlamaIndex's on-disk persisted index refreshing seems broken, producing
duplicate embeddings.

*** Future Work
- [X] +add result evaluation using a secondary LLM (chatgpt-4-turbo-preview)+ using
- [X] experiment with different retrieval parameters
- [X] persist my index to a proper vector database
- [ ] experiment with better retrieval techniques / architectures (e.g. Crew AI)
- [ ] convert this to a full-fledged agent with access to tools
- [ ] use ReAct or LLMCompiler to leverage LLMs planning abilities
- [ ] tools should include Google, Wikipedia, and Wolfram Alpha
- [ ] a basic tool to get the current date and possibly holiday calendars
- [ ] improve result formatting consistency
- [ ] improve discovery
- [ ] improve performance (latency)
- [ ] periodically update my index `org-rag --refresh` (upsert)
- [ ] experiment with knowledge graph
- [ ] wire this up to an Emacs command (JSON API?)
- [ ] evaluate [[https://blog.streamlit.io/build-a-chatbot-with-custom-data-sources-powered-by-llamaindex/][different UIs]]