{"id":44190440,"url":"https://github.com/christianromney/org-rag","last_synced_at":"2026-02-09T17:04:25.351Z","repository":{"id":223415397,"uuid":"759905681","full_name":"christianromney/org-rag","owner":"christianromney","description":"Experimenting with RAG over my org-mode notes","archived":false,"fork":false,"pushed_at":"2024-05-03T22:03:23.000Z","size":32,"stargazers_count":0,"open_issues_count":5,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-05-03T23:22:26.327Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/christianromney.png","metadata":{"files":{"readme":"README.org","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-02-19T15:03:46.000Z","updated_at":"2024-05-03T23:22:37.672Z","dependencies_parsed_at":"2024-05-03T23:22:30.070Z","dependency_job_id":"9a98cd3e-2d9a-48d7-a5e6-5006d094fc60","html_url":"https://github.com/christianromney/org-rag","commit_stats":null,"previous_names":["christianromney/org-rag"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/christianromney/org-rag","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christianromney%2Forg-rag","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christianromney%2Forg-rag/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christianromney%2Forg-rag/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christianromney%2Forg-rag/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/christianromney","download_url":"https://codeload.github.com/christianromney/org-rag/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/christianromney%2Forg-rag/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29273141,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-09T13:47:44.167Z","status":"ssl_error","status_checked_at":"2026-02-09T13:47:43.721Z","response_time":56,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-02-09T17:04:25.295Z","updated_at":"2026-02-09T17:04:25.346Z","avatar_url":"https://github.com/christianromney.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"#+TITLE: Building a RAG Application with LangChain\n* Basic RAG with LangChain and Ollama\n** Environment Setup\nI use my standard setup, with a dedicated conda environment for this project,\nspecified with the direnv layout feature. Required libraries can be installed\nwith:\n\n#+begin_src shell\npip3 install -r requirements.txt\n#+end_src\n\nNote that my .envrc is encrypted with [[https://www.agwa.name/projects/git-crypt/][git-crypt]].\n\n** Using a Local LLM with Ollama\nOllama allows us to run open-source LLMs on local machines. This is useful for\nenhanced privacy since one's private data is never shared with LLM providers.\n\n** Retrieval Augmented Generation\nRAG is the process of augmenting an LLM prompt with relevant context drawn from\none or more documents. Typically, documents are broken down into chunks (which\nmay optionally overlap) of a certain size. These chunks are then split into\ntokens and converted into vectors of real numbers in a process called embedding.\nThese vectors may be stored in a vector database or index. Later, a query may\nalso be converted to a vector embedding which can be used to perform a\nsimilarity search against the index. The top matches may be retrieved and added\nto the context of an LLM prompt, along with the prompt for the model.\n\n*** OrgModeDocumentStore class\nThis Python class organizes and wraps LangChain classes to provide a simplified\nRAG interface over org-mode documents.\n\n#+begin_src python :tangle orgstore.py\nfrom langchain_community.document_loaders import DirectoryLoader, UnstructuredOrgModeLoader\nfrom langchain.text_splitter import SentenceTransformersTokenTextSplitter\nfrom langchain_community.embeddings import OllamaEmbeddings\nfrom langchain_community.vectorstores import Chroma\nfrom langchain.retrievers import ParentDocumentRetriever\nimport os\n\nclass OrgModeDocumentStore:\n  def __init__(self, collection, directory, model=\"mixtral:latest\",\n               search_type=\"mmr\", mmr_diversity=0.75,\n               num_search_results=5, show_progress=False,\n               silent_errors=False):\n    self.collection = collection\n    self.directory = directory\n    if not os.path.exists(directory):\n      raise RuntimeError(f\"Directory {directory} does not exist.\")\n\n    self.index_directory = os.path.join(directory, \".chroma\")\n    if not os.path.exists(self.index_directory):\n      os.mkdir(self.index_directory)\n\n    self.loader = DirectoryLoader(directory, glob=\"**/*.org\", use_multithreading=True,\n                                  silent_errors=silent_errors,\n                                  loader_cls=UnstructuredOrgModeLoader,\n                                  loader_kwargs={\"mode\": \"single\"})\n\n    self.search_type = search_type\n    self.k = num_search_results\n    self.diversity = mmr_diversity\n\n    self.model = model\n    self.embeddings = OllamaEmbeddings(model=model, show_progress=show_progress)\n    self.db = Chroma(collection_name=collection,\n                     embedding_function=self.embeddings,\n                     persist_directory=self.index_directory)\n\n  def __repr__(self):\n    return f\"\"\"\n    OrgModeDocumentStore(\n      collection={self.collection!r},\n      directory={self.directory!r},\n      index_directory={self.index_directory!r},\n      loader={self.loader!r},\n      embeddings={self.embeddings!r},\n      model={self.model!r},\n      db={self.db!r},\n      search_type={self.search_type!r},\n      k={self.k!r},\n      diversity={self.diversity!r}\n    )\"\"\"\n\n  # indexing management\n  def load(self):\n    \"Loads all org-mode documents found under the given directory recursively.\"\n    self.documents = self.loader.load()\n\n  def add_documents(self, docs):\n    \"Adds the given docs to the Chroma vectorstore and returns the document ids.\"\n    return self.db.add_documents(docs)\n\n  def update_document(self, id, doc):\n    \"Updates the single document identified by the id.\"\n    return self.db.update_document(id, doc)\n\n  def update_documents(self, ids, docs):\n    \"Updates the documents identified by the given ids.\"\n    return self.db.update_documents(ids, docs)\n\n  def create_index(self):\n    \"Creates the index from the loaded documents. This should only be run once.\"\n    self.load()\n    if len(self.documents) \u003e 0:\n      print(f\"Indexing {len(self.documents)} documents.\")\n      return self.add_documents(self.documents)\n\n  # query\n  def print_documents(self):\n    \"Print the list of all documents.\"\n    self.load()\n    for d in self.documents:\n      print(d.metadata['source'])\n\n  def similarity_search(self, query):\n    \"Search the vectorstore for docs relevant to the query.\"\n    return self.db.similarity_search(query, self.k)\n\n  def mmr_search(self, query):\n    \"Executes max marginal relevance search for the query.\"\n    return self.db.max_marginal_relevance_search(query, k=self.k, lambda_mult=self.diversity)\n\n  def as_retriever(self):\n    \"Returns a retriever for this vectorstore.\"\n    return self.db.as_retriever()\n#+end_src\n\n*** Loading and Indexing (Chunked) Documents\nThe [[https://python.langchain.com/docs/modules/data_connection/document_loaders/][Document Loader]] abstraction presents a unified interface for loading various\nfile types, including plain text, Markdown, JSON, and more. The constructor\nidentifies the documents to load, and the load() method does the actual work.\n\n**** Splitting Documents into Chunks\n[[https://python.langchain.com/docs/modules/data_connection/document_transformers/][Text Splitters]] break long documents into smaller chunks so we can pass them into\nan LLM context window.\n***** Types of Splitters\n- recursive :: splits on user-defined chars, keeps related chunks next to each\n  other.\n- token :: splits text on tokens\n- character :: splits on user-defined chars\n- semantic chunker :: splits on sentences, then combines adjacent ones if they\n  are semantically similar enough\n\n#+begin_src python :tangle index.py\nfrom orgstore import OrgModeDocumentStore\ncollection = \"org-rag\"\ndirectory = \"/Users/christian/Documents/personal/notes/content/roam/\"\nstore = OrgModeDocumentStore(collection=collection, directory=directory, show_progress=True)\ndocument_ids = store.create_index()\nprint(f\"create_index: {document_ids}\")\n\n# data = zip(document_ids, store.documents)\n# for id, doc in data:\n#   print(f\"{id}: {doc.metadata['source']}\")\n#+end_src\n\n*** Retrieval\nUse the vector store to find relevant documents.\n#+begin_src python :tangle retrieval.py\nfrom orgstore import OrgModeDocumentStore\ncollection = \"org-rag\"\ndirectory = \"/Users/christian/Documents/personal/notes/content/roam/\"\nstore = OrgModeDocumentStore(collection=collection, directory=directory, silent_errors=True)\n\ni, query = 1, \"\"\nprint(\"Enter search query at the prompt or type '?list' for docs, or '?quit' to exit.\\n\")\nwhile not query.lower() == \"?quit\":\n  query = input(f\"{i}\u003e \")\n  if query == \"?quit\":\n    print(\"Goodbye.\")\n  elif query == \"?list\":\n    i += 1\n    store.print_documents()\n  else:\n    i += 1\n    #results = store.as_retriever().get_relevant_documents(query)\n    #results = store.mmr_search(query)\n    results = store.similarity_search(query)\n    for doc in results:\n      print(f\"file: {doc.metadata['source']}, length: {len(doc.page_content)}\")\n      display = input(\"Display page content? (y|n)\u003e \")\n      if display.lower() == \"y\":\n        print(f\"content: {doc.page_content}\\n\" )\n        print(\"-\" * 80)\n#+end_src\n\nI'm not thrilled with these results. The chunks are very small and anecdotally\nnot the most relevant. I'd like to feed more context to an LLM.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristianromney%2Forg-rag","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fchristianromney%2Forg-rag","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fchristianromney%2Forg-rag/lists"}