{"id":15063588,"url":"https://github.com/dwyl/rag-elixir-doc","last_synced_at":"2025-10-11T08:17:40.551Z","repository":{"id":253136666,"uuid":"842569965","full_name":"dwyl/rag-elixir-doc","owner":"dwyl","description":"Livebook to run a Phoenix_LiveView documentation Retrieval-Augmented Generation (RAG) enhanced LLM","archived":false,"fork":false,"pushed_at":"2024-09-11T15:37:59.000Z","size":266,"stargazers_count":12,"open_issues_count":0,"forks_count":2,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-03-30T10:35:07.559Z","etag":null,"topics":["cross-encoder","elixir","embeddings","livebook","llm-inference","rag","retrieval-augmented-generation","sbert"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dwyl.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-08-14T16:06:11.000Z","updated_at":"2025-03-26T03:43:14.000Z","dependencies_parsed_at":"2024-08-14T18:13:10.840Z","dependency_job_id":"fbb6dc01-b478-4a30-a17e-14b074cebdcf","html_url":"https://github.com/dwyl/rag-elixir-doc","commit_stats":null,"previous_names":["dwyl/rag-elixir-doc"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwyl%2Frag-elixir-doc","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwyl%2Frag-elixir-doc/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwyl%2Frag-elixir-doc/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dwyl%2Frag-elixir-doc/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dwyl","download_url":"https://codeload.github.com/dwyl/rag-elixir-doc/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248208613,"owners_count":21065203,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cross-encoder","elixir","embeddings","livebook","llm-inference","rag","retrieval-augmented-generation","sbert"],"created_at":"2024-09-25T00:04:40.266Z","updated_at":"2025-10-11T08:17:35.495Z","avatar_url":"https://github.com/dwyl.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# rag-elixir-doc\n\n\u003ch1 align=\"center\"\u003eBuilding LLM enhanced search with the help of LLMs....\u003c/h1\u003e\n\nWe want to improve the search for the Elixir/Phoenix/Plug/LiveView documentation when using an LLM and experiment a `RAG` pipeline.\n\nAll the tools used here are \"free\", meaning everything is running locally.\n\n[![Run in Livebook](https://livebook.dev/badge/v1/blue.svg)](https://livebook.dev/run?url=https%3A%2F%2Fgithub.com%2Fdwyl%2Frag-elixir-doc%2Fblob%2Fmain%2Frag-elixir.livemd)\n\n## What is `RAG`?\n\nIt is a \"chat with your documents\" process, meaning you ask an LLM model to respond based on additional ressources.\n\nTheses sources may or may not be already incorporated inside the training used for the LLM.\n\nUsing RAG is _not about fine tuning_ the model, which is changing the coefficients or structure of the model based on additional sources.\n\nRAG is about giving an additional context - the \"context window\" - to enhance or constraint the response from the LLM.\n\n\u003e Note that the LLM accepts a limited amounts of tokens, thus the window context is limited.\n\n\n## Scope of this POC:\n\nWe want to improve the LLm's response when we ask questions related to the Elixir/Phoenix/Plug/LiveView documentation. We will build a \"context\" to add information to help the LLM to build a response.\n\nRunning such a helper locally means that we need to have the extra ressources available locally. Our database will be local and our LLM will be run locally, using only local ressources.\n\nWe will extract some markdown files from the Phoenix_LiveView GitHub repo.\n\nWe will use a database to store chunks extracted from these files.\n\n\n- One way is **SQL Full-Text-Search**. If we use `Postgres`, we have a [built-in functionality](https://www.postgresql.org/docs/current/textsearch-intro.html#TEXTSEARCH-DOCUMENT). This works by using ** key words**, so given that we may have various ways to express the same questions, we may want a more semantic search. [A good introduction to SQL FTS](https://leandronsp.com/a-powerful-full-text-search-in-postgresql-in-less-than-20-lines) or [here](https://peterullrich.com/complete-guide-to-full-text-search-with-postgres-and-ecto).\n\n- We can also use the open source version of \"elastisearch\" by Amazon named `opensearch` via a Docker image. We may also use the [local image of ElastiSearch](https://www.elastic.co/guide/en/elasticsearch/reference/current/run-elasticsearch-locally.html). There exists [Text Chunking tools](https://opensearch.org/docs/latest/search-plugins/text-chunking/).  The matching is based on the **BM25** algorithm. This might be an interesting extension to use this instead of the **bi-encoding** step, as a second version. Indeed, Elixir has an Opensearch client: [snap](https://github.com/breakroom/snap)\n\n- we can build a RAG pipeline with an LLM. We follow this route here. We save chunks with their embedding into a database, run a similarity search against our query, and then rerank our top-k reponses, and finally use the last reponse to build our prompt and submit it as a context to the LLM. This will help to constrain responses on questions about Elixir/Plug/Phoenix_Liveview. \n\n\n## Using LLM to find which models to use\n\nWe are using pre-trained NR models. We will also use an LLM to elaborate a response. \n\nThe first question is: what does a RAG pipeline brings?\n\nWe believe the next two paragraphs give some confidence that this technic works, or at least improves some responses.\n\n### Testing Codellama\n\n```text\nQuestion: \"how to handle forms server-side?\"\n```\n\n\u003cdetails\u003e\u003csummary\u003eThe \"raw\" response\u003c/summary\u003e\n\nHandling forms on the server-side involves using the `Phoenix.HTML` module's `form_for` function to create a form struct with the necessary information for submitting the form data to the server.\nHere's an example:\n\n```elixir\nimport Phoenix.HTML\n\n# Create a form struct with the necessary information\nform = form_for(conn, :action, :controller)\n\n# Add fields to the form using the `add_field` function\nform = form |\u003e add_field(:name, :text) |\u003e add_field(:age, :number)\n\n# Use the `handle_event` function to handle form submissions\nhandle_event(form, :submit, conn) do\n  # Do something with the submitted data here, e.g. update a database\nend\n```\n\nThe `form_for` function takes three arguments:\n\n\t* `conn`: The current connection struct\n\t* `action`: The action to perform when submitting the form(e.g. `:create`, `:update`, etc.)\n\t* `controller`: The controller that handles the form submission\n\nOnce you have created the form struct, you can add fields to it using the `add_field` function.\nThis function takes two arguments:\n\n\t* `name`: The name of the field (e.g. `:username`,`:password`, etc.)\n\t* `type`: The type of field (e.g. `:text`, `:number`, etc.)\n\nYou can then use the `handle_event` function to handle form submissions. This function takes three arguments:\n\n\t* `form`: The form struct that was created using the `form_for` function\n\t* `event`: The event that triggered the submission (e.g. `:submit`, etc.)\n\t* `conn`: The current connection struct\n\nInside the `handle_event` function, you can perform any necessary actions when the form is submitted, such as updating a database or sending an email.\n\nTo handle form validation and error handling, you can use the `Phoenix.LiveView` module's `validate` and `error` functions.\n\nFor example:\n\n```elixir\nimport Phoenix.LiveView\n\n# Add validation rules to the form using the `validate` function\nform = form |\u003e validate(:name, :required) |\u003e validate(:age, :number)\n\n# Handle errors in the form submission using the `error` function\nhandle_event(form, :submit, conn) do\n  if error?(conn) do\n    # Do something with the validation errors here, e.g. display an error message\n  else\n    # Do something with the submitted data here, e.g. update a database\n  end\nend\n```\n\nIn this example, we use the `validate` function to add validation rules to the form. The `:required` and `:number` arguments specify that the fields should be required and contain only numbers, respectively.\nIf there are any errors in the submitted data, the `error?` function will return true and we can handle them using the `error` function.\n\u003c/details\u003e\n\u003cbr/\u003e\n\n\u003cdetails\u003e\u003csummary\u003eThe \"RAGed\" response\u003c/summary\u003e\n\nTo handle forms with LiveView, you can use the `Phoenix.Component` module and its functions to define a form component that will submit and validate the form data using LiveView's built-in validation mechanism. Here's an example of how to define a form component in your LiveView:\n\n```elixir\ndefmodule MyAppWeb.Components.Form do\n  use Phoenix.Component\n  def form(assigns) do\n    ~H\"\"\"\n      \u003c.form for={@form} phx-change=\"validate\" phx-submit=\"save\"\u003e\\\n        ...\n      \u003c/.form\u003e\n    \"\"\"\n    end\n  end\n...\n```\n\nIn the above example, we define a `Form` component that renders a form tag with a `phx-change` and `phx-submit` event to handle validation and submission of the form data. The `for` attribute is used to bind the form to a changeset or user parameters, which can be created from a form struct using `Phoenix.Component.to_form/1`.\n\n\nAdditionally, we strongly recommend including a unique HTML \"id\" attribute on the form. When DOM siblings change, elements without an ID will be replaced rather than moved, which can cause issues such as form fields losing focus.\n\nYou may wish for an individual input to use its own change event or to target a different component. This can be accomplished by annotating the input itself with `phx-change`, for example:\n\n```elixir\n\u003c.form for={@form} phx-change=\"validate\" phx-submit=\"save\"\u003e\n  ...\n  \u003c.input field={@form[:email]}  phx-change=\"email_changed\" phx-target={@myself} /\u003e\n\u003c/.form\u003e\n```\n\nThe `phx-change` event will trigger the `email_changed` function on the `@myself` component, which can be used to update the form data or perform other actions based on changes to the input field.\n\nYou may also wish for an individual input to use its own change event or to target a different component. This can be accomplished by annotating the input itself with `phx-change`, for example:\n\n```elixir\n\u003c.form for={@form} phx-change=\"validate\" phx-submit=\"save\"\u003e\n  ...\n  \u003c.input field={@form[:email]}  phx-change=\"email_changed\" phx-target={@myself} /\u003e\n\u003c/.form\u003e\n```\n\nThe `phx-change` event will trigger the `email_changed` function on the `@myself` component, which can be used to update the form data or perform other actions based on changes to the input field.\\n\\nIt's also important to note that you may want to use a unique HTML \\\"id\\\" attribute on the form. When DOM siblings change, elements without an ID will be replaced rather than moved, which can cause issues such as form fields losing focus.\\n\\nYou may also wish for an individual input to use its own change event or to target a different component. This can be accomplished by annotating the input itself with `phx-change`, for example:\n\n```elixir\n\u003c.form for={@form} phx-change=\"validate\" phx-submit=\"save\"\u003e\n  ...\n  \u003c.input field={@form[:email]}  phx-change=\"email_changed\" phx-target={@myself} /\u003e\n\u003c/.form\u003e\n```\n\nThe `phx-change` event will trigger the `email_changed` function on the `@myself` component, which can be used to update the form data or perform other actions based on changes to the input field.\n\u003c/details\u003e\n\u003cbr/\u003e\n\nThe first conclusion is that the RAG pipeline improves a lot the response.\nThis happens also because most probablythe current version of the LLM (July 2024) is not up-to-date with the current state-of-art at the time of writting.\n\nWe continue by asking the LLM to help us with the technical task below. We encounter the main limitation: halucination.\n\n\u003cdetails\u003e\u003csummary\u003eWhich embedding models can we use? Some examples of how LLMs respond to this question\u003c/summary\u003e\n\nWe know that most questions that are submitted to an LLM will produce so-called \"halucinated\" responses, or in other words, invented.\n\nLet's test this with LLama3.1, choosen because you can run it locally via `ollama run llama3.1`.\n\n```txt\nquestion: \"I want to use an Elixir portage of llama. Can you suggest a library?\n\nresponse: \"**Elixir-T5**: This is an Elixir port of a popular LLM architecture (T5). It's designed specifically for use with Elixir and might be worth considering. You may also consider **LLaMA-erl**: This is an Elixir wrapper for the LLaMA model, which is similar to Ollama.\n```\n\nNor `Elixir-T5` nor `LLama-erl` exist!\n\nI tried CodeLLama: `ollama run codellama`:\n\n```txt\nquestion: which Elixir library would you suggest for text chunking to compute embeddings efficiently?\n```\n\n\u003cdetails\u003e\u003csummary\u003ecodellama response\u003c/summary\u003e\n\n```console\nresponse:\nThere are several Elixir libraries available for text chunking and computing embeddings, each with its own strengths and weaknesses. Here are a few popular options:\n\n1. `elixir_bert`: This is an Elixir port of the BERT (Bidirectional Encoder Representations from Transformers) model, which is a powerful language model that can be\nused for text chunking. It provides a simple and efficient interface for computing embeddings, and it supports both CPU and GPU acceleration.\n2. `elixir_bert_embedding`: This is an Elixir wrapper for the BERT embedding layer, which allows you to compute embeddings for input sequences using a pre-trained\nBERT model. It provides a simple interface for text chunking and supports both CPU and GPU acceleration.\n3. `elixir_word2vec`: This is an Elixir port of the Word2Vec model, which is another popular language model that can be used for text chunking. It provides a simple\nand efficient interface for computing embeddings, and it supports both CPU and GPU acceleration.\n4. `elixir_glove`: This is an Elixir port of the GLOVE (Global Vectors for Word Representation) model, which is a language model that can be used for text chunking.\nIt provides a simple and efficient interface for computing embeddings, and it supports both CPU and GPU acceleration.\n\nAll of these libraries should work well for your purposes, but you may want to consider which one best fits your specific needs. If you have any specific requirements\nor preferences (e.g., support for a particular file format or feature), you may want to choose the library that is most suitable for your use case.\n```\n\u003c/details\u003e\n\u003cbr/\u003e\n\nWe check that `elixir_bert` does not exist, nor `elixir_bert_embedding`, nor `elixir_glove` nor `elixir_workd2vec`. \n\nHowever, after researching a bit, the library `ExFinalFusion` exists, a NIF of `FinalFusion`.\n[![Screenshot 2024-08-24 at 19 33 08](https://github.com/user-attachments/assets/11d7daeb-8f3f-49ea-a2af-38c2c654c42d)](https://hexdocs.pm/ex_final_fusion/ExFinalFusion.html#summary)\n\n### Let the LLMs choose the models\n\nThe question:\n\n```console\nI want to code a RAG pipeline in the Elixir langugage using its ML libraries (Nx, Axon, Bumble\nbee). I want to use a bi-encoder to compute embeddings from chunks of text, and then run a\ncross-encoding process. Can you recommend  models firstly for the bi-encoding and then for the\ncross-encoding step, having in mind that I need a model for tokenizing the input.\n```\n\nSome reponses with differents LLMs.\n\n\u003cdetails\u003e\u003csummary\u003ecodellama response\u003c/summary\u003e\n\n```console\nBi-encoding (Tokenization):\n\n* BERT (Bidirectional Encoder Representations from Transformers)\n\t+ This is a popular pre-trained language model that can be fine-tuned for various NLP tasks,\nincluding tokenization.\n\t+ BERT uses a multi-layer bidirectional transformer encoder to generate contextualized\nrepresentations of words in a sentence.\n* RoBERTa (Robustly Optimized BERT Pretraining Approach)\n\t+ This is a variant of BERT that was specifically designed for text classification tasks and\nhas been shown to achieve state-of-the-art results on several benchmark datasets.\n\t+ RoBERTa uses a combination of tokenization and language modeling techniques to generate\n\ncontextualized representations of words in a sentence.\n* DistilBERT (Distilled BERT)\n\t+ This is a smaller and more efficient variant of BERT that has been trained to match the\n\nperformance of the full BERT model while requiring fewer computational resources.\n\t+ DistilBERT uses a combination of tokenization and attention techniques to generate\ncontextualized representations of words in a sentence.\n\nCross-encoding:\n\n* MLP (Multi-layer perceptron)\n\t+ This is a simple feedforward neural network that can be used for classification tasks such\nas text classification, sentiment analysis, and topic modeling.\n\t+ An MLP consists of multiple layers of artificial neurons that learn to represent the input\ndata in a higher dimensional space.\n* CNN (Convolutional neural network)\n\t+ This is a deep learning architecture that is particularly well-suited for text\nclassification tasks where the input data has a spatial or temporal structure.\n\t+ A CNN consists of multiple convolutional layers that learn to extract features from the\n\ninput data, followed by pooling and fully connected layers to make predictions.\n\nIt's important to note that the specific models you choose will depend on your specific use\ncase and requirements. You may also want to consider using a combination of different models\nto achieve better results.\n```\n\u003c/details\u003e\n\u003cbr/\u003e\n\n\u003cdetails\u003e\u003csummary\u003eLlama3 response\u003c/summary\u003e\n\n```console\n**Bi-Encoding Step: Tokenization and Embedding**\n\nFor tokenization, you can use a simple word tokenizer like `Nx.Tokenizer` from the Elixir\n`axon` library. However, if you need more advanced tokenization capabilities (e.g., handling\nout-of-vocabulary words), consider using Bumblebee's `Bee.Tokenizer`. For embedding the\ntokens into vectors, I recommend using:\n\n1. **Universal Sentence Encoder (USE)**: This is a pre-trained language model that can be\nused for token embedding. You can use the Elixir wrapper provided by Axon to load and use\nthe USE model.\n\n**Cross-Encoding Step**\n\nFor cross-encoding, you'll need to take the output of the bi-encoding step and generate a\nnew representation based on it. I recommend using:\n\n1. **Reformer**: This is a state-of-the-art transformer architecture that's well-suited for\nthe RAG pipeline. You can use Bumblebee to load and use the Reformer model.\n\nHere are some additional recommendations:\n\n* For tokenizing input text, you might want to consider using more advanced techniques like\nsubwording (e.g., WordPiece) or character-level encoding.\n* If you're dealing with long documents, you may need to split them into smaller chunks\nbefore applying bi-encoding and cross-encoding. Consider using a technique like document\nchunking or summarization.\n\nTo give you a better understanding of the pipeline flow, here's a high-level outline:\n\n1. **Tokenize input text**: Use `Nx.Tokenizer` or Bumblebee's `Bee.Tokenizer` to break down\nthe text into individual tokens.\n2. **Bi-encode tokens**: Apply USE to embed each token into a vector space (e.g.,\n128-dimensional).\n3. **Cross-encode output**: Use Reformer to generate a new representation based on the\nbi-encoded output.\n\nRemember to preprocess your data (e.g., normalize text, remove stop words) before feeding it\ninto these models.\n\nFor more information and implementation details, I recommend checking out the following\nresources:\n\n* [Bumblebee documentation](https://hexdocs.pm/bumblebee/)\n* [Axon documentation](https://hexdocs.pm/axon/)\n* [Elixir ML\nlibraries](https://elixir-lang.org/docs/master/libraries.html#machine-learning-libraries)\n* [RAG paper and implementation](https://arxiv.org/abs/2006.16262)\n```\n\u003c/details\u003e\n\u003cbr/\u003e\n\n\u003cdetails\u003e\u003csummary\u003eClaude3.5 Sonnet response\u003c/summary\u003e\n\n```console\nBi-encoding step:\n\nFor the bi-encoding step, where you'll encode both the chunks from the documentation and the query, a good choice would be a sentence transformer model. A popular option that works well for semantic search is:\n\nModel: \"sentence-transformers/all-MiniLM-L6-v2\"\n\nThis model provides a good balance between performance and speed. It's relatively small (80MB) but still gives good results for semantic similarity tasks.\n\nCross-encoding step:\n\nFor the cross-encoding step, where you'll want to get a more precise relevance score between the query and the retrieved documents, you could use:\n\nModel: \"cross-encoder/ms-marco-MiniLM-L-6-v2\"\n\nThis model is specifically trained for relevance ranking and should work well for your cross-encoding needs.\n\nTokenization:\n\nFor tokenization, you'll want to use a tokenizer that matches the models you're using. Since both recommended models are based on MiniLM, you can use the MiniLM tokenizer:\n\nTokenizer: \"microsoft/MiniLM-L6-H384-uncased\"\n```\n\u003c/details\u003e\n\u003cbr/\u003e\n\n\u003cdetails\u003e\u003csummary\u003eChatGPT response\u003c/summary\u003e\n\n```console\nTo create a Retrieval-Augmented Generation (RAG) pipeline in Elixir using the Nx, Axon, and Bumblebee libraries, you'll need to carefully choose models that are suitable for both the bi-encoding and cross-encoding processes. Let's break down the components:\n\n1. Bi-Encoder Step:\nPurpose: The bi-encoder step involves encoding chunks of text independently to generate embeddings that can be used for retrieval.\nRecommended Models:\nMiniLM (Microsoft): A small, efficient model designed for sentence embeddings. It's often used for bi-encoder tasks due to its balance between speed and accuracy.\nDistilBERT: A smaller and faster version of BERT, ideal for generating embeddings from text while maintaining a good balance of performance.\nSentence-BERT (SBERT): This is a modification of BERT that is specifically designed for creating dense vector representations of sentences, making it a solid choice for bi-encoding.\n2. Cross-Encoder Step:\nPurpose: The cross-encoder step involves comparing the query and retrieved passages together, typically used to re-rank the retrieved results.\nRecommended Models:\nBERT: The standard BERT model is highly effective for cross-encoding tasks because it considers the interaction between the query and the passage.\nRoBERTa: A variant of BERT with more robust training, which tends to perform better in NLP tasks.\nDeBERTa: Known for its improved handling of word order and dependency information, making it a strong candidate for cross-encoding.\n3. Tokenization:\nPurpose: Tokenizing the input is crucial for both the bi-encoder and cross-encoder steps. The tokenizer needs to be compatible with the chosen models.\nRecommended Tokenizers:\nBERT Tokenizer: Compatible with BERT-based models (BERT, RoBERTa, DistilBERT, etc.).\nSentencePiece Tokenizer: Often used with models like MiniLM and other transformer-based models.\nElixir Implementation:\nLibraries:\nNx: For numerical computation, handling tensors, and integrating with the underlying ML operations.\nAxon: For building neural networks, including the bi-encoder and cross-encoder models.\nBumblebee: For pre-trained models, including tokenizers and transformers, which will help with the embedding and cross-encoding tasks.\nExample Models in Bumblebee:\nFor bi-encoding: Use models like \"MiniLM-L6-v2\" or \"distilbert-base-uncased\" from Bumblebee.\nFor cross-encoding: Use models like \"bert-base-uncased\" or \"roberta-base\".\nSteps to Implement:\nLoad the Model: Use Bumblebee to load the bi-encoder and cross-encoder models.\nTokenize the Input: Tokenize the text using the appropriate tokenizer.\nCompute Embeddings: Generate embeddings using the bi-encoder model.\nRetrieve and Rank: Use the cross-encoder model to compare and rank the retrieved documents.\n```\n\u003c/details\u003e\n\n\n\u003c/details\u003e\n\n### First conclusion for the LLM\n\nIt seems that Claude3.5 Sonnet and ChatGTP give much \"better\" responses than Llama. However, we need an API to reach their REST API whilst we can run la local version of the LLama LLM. \n\nWe will therefore choose LLama, running it at no cost, knowing that it can only get better by choosing a \"paid\" version with OpenAI or better Anthropic.\n\n### Conclusion for the choice of our models\n\nWe will use SBERT based models:\n- the bi-encoder\"sentence-transformers/all-MiniLM-l6-v2\" which also provides a tokenizer,\n- the cross-encoder \"cross-encoder/ms-marco-MiniLM-L-6-v2\" along with the tokenizer \"bert-base-uncased\"\n\n\nWe check that the models are correctly implemented in Elixir by running the code in Python. This is done in [#8](https://github.com/dwyl/rag-elixir-doc/discussions/8) and [#9](https://github.com/dwyl/rag-elixir-doc/discussions/9).\n\n## Source of knowledge\n\nWe firstly seed the vector database with some Github markdown pages from the Elixir documentation.\n\nThe sources will be extracted from the files that the GitHub API returns when querying some directories:\n  - \u003chttps://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/server/\u003e\n  - \u003chttps://api.github.com/repos/phoenixframework/phoenix_live_view/contents/guides/client/\u003e\n  - \u003chttps://github.com/phoenixframework/phoenix_live_view/blob/main/guides/introduction/welcome.md\u003e\n\n  - we can also add some \".ex\" modules when they provide documentation in a moduledoc.\n\n\n## Overview of the RAG process:\n  * installed tools: the database `Postgres` with the `pgvector` extension, the plateform `ollama` to run LLM locally.\n    \n  * Build the external sources.\n    - Download \"external sources\" as a string\n    - chunk the sources\n    - produce an embedding based on a \"sentence-transformer\" model for each chunk\n    - insert chunk + embedding into a Vector database using a HSNW index\n      \n  * Build a RAG pipeline\n    - produce an embedding (a vector representation) from the question\n    - perform a first vector similarity search (HNSW) against the database\n    - rerank the top-k with \"cross-encoding\"\n    - build a prompt by injecting the later result with the query as a context\n    - submit the prompt to the LLM for completion\n\n### Pseudo-code pipeline\n\nThe pipeline will use three SBert based models: \"sentence-transformers/all-MiniLM-L6-v2\" for the embedding, \"cross-encoder/ms-marco-MiniLM-L-6-v2\" for the reranking, and \"bert-base-uncased\" for tokenizing.\n\nIn pseudo-code, we have:\n  \n```elixir\n# Data collection and chunking\ndefmodule DataCollector do\n  def fetch_and_chunk_docs do\n    ...\n  end\nend\n\n# Embedding generation: \"sentence-transformers/all-MiniLM-L6-v2\"\ndefmodule Embedder do\n  def generate_embeddings(text) do\n    ...\n  end\nend\n\n# Semantic search\ndefmodule SemanticSearch do\n  def search(query, top_k) do\n    ...\n  end\nend\n\n# Cross-encoder reranking: \"cross-encoder/ms-marco-MiniLM-L-6-v2\"\ndefmodule CrossEncoder do\n  def rerank(query, documents) do\n    ...\n  end\nend\n\n# Prompt construction\ndefmodule PromptBuilder do\n  def build_prompt(query, context) do\n    ...\n  end\nend\n\n# LLM integration\ndefmodule LLM do\n  def generate_response(prompt) do\n    ...\n  end\nend\n\n# Main RAG pipeline\ndefmodule RAG do\n  def process_query(query) do\n    query\n    |\u003e SemanticSearch.search(10)\n    |\u003e CrossEncoder.rerank(query)\n    |\u003e PromptBuilder.build_prompt(query)\n    |\u003e LLM.generate_response()\n  end\nend\n```\n\n## What is **bi-encoding** and **cross-encoding**?\n\n- [Bi-encoders]: Encode the query and document separately, then compare their vector representations. This is the \"standard\" similarity search.\n\n  Bi-encoding does consider the relationship between the query and each document, but it does so independently for each document. The main problem is that bi-encoding might not capture nuanced differences between documents or complex query-document relationships. `HNSW` indexes or `BM25` can be used for this.\n  \n- Cross-encoders: Take both the query and document as input simultaneously, allowing for more complex interactions between them. It processes the query and document together through a neural network (typically a transformer model like BERT) to produce a single relevance score. This allows the model to capture complex interactions between the query and document at all levels of representation.\n\n  Cross-encoders typically perform better than bi-encoders in terms of accuracy, but are computationally more expensive and slower at inference time.\n  They are not suitable for large-scale retrieval because they require comparing the query with every document from scratch, which doesn't scale well.\n  Therefor, Cross-encoding is often used in a two-stage retrieval process.\n\n- How cross-encoders works in reranking?:\n  - After initial retrieval (e.g., using vector similarity), you pass each query-document pair through the cross-encoder.\n  - The cross-encoder outputs a relevance score for each pair.\n  - Results are then sorted based on these scores, potentially significantly changing the order from the initial retrieval.\n    \n  \n## How to **chunk**? \n\nWe need to define how to ingest these documents to produce _embeddings_ saved into a _vector database_. \n\nDo we run a naive chunk? or [use this package](https://github.com/revelrylabs/text_chunker_ex), or [structured chunks](https://docs.llamaindex.ai/en/stable/examples/retrievers/auto_vs_recursive_retriever/), [Chunk + Document Hybrid Retrieval](https://docs.llamaindex.ai/en/stable/examples/retrievers/multi_doc_together_hybrid/), or use [BM25](https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever/), with an Elixir implementation [BM25](https://github.com/elliotekj/bm25)? \n\n## Which embedding? \n\n- [SBert](https://www.sbert.net/): check: \u003chttps://huggingface.co/spaces/mteb/leaderboard\u003e.\n\n\n## Vector database of Index?\n\n- An index [HNSW](https://github.com/elixir-nx/hnswlib), the Elixir portage of `hnswlib`, a KNN search,\n- or a vector database?\n\t- Postgres with [pgvector](https://github.com/pgvector/pgvector) with the Elixir portage: [pgvector-elixir](https://github.com/pgvector/pgvector-elixir),\n \t- SQLite with [sqlite-vec](https://github.com/asg017/sqlite-vec). The extension has to be installed manually from the repo and loaded (with `exqlite`),\n\t- or [Supabase](https://github.com/supabase/supabase), with an [Elixir client](https://github.com/zoedsoupe/supabase-ex)\n\t- or [ChromaDB](https://github.com/3zcurdia/chroma), with an [Elixir client](https://github.com/3zcurdia/chroma)\n\nWe will use Postgres with the extension `pgvector` and the `HNSW` algorithm. See discussion on the Postgres + pg_vector setup.\n  \n## How to **prompt**? \n\nThis is where we define the scope of the response we want from the LLM, given the retrieved context given by the database nearest neighbour search. \n\nThe LLM should be able to generate an \"accurate\" response constrainted by this context.\n\n## A word on **LLMs**\n\nA Dockyard post on this: \u003chttps://dockyard.com/blog/2023/05/16/open-source-elixir-alternatives-to-chatgpt\u003e.\n\nA comparison of different LLMs (source: Anthropic)\n\u003cimg width=\"592\" alt=\"Screenshot 2024-08-14 at 17 56 40\" src=\"https://github.com/user-attachments/assets/af4ef9ea-88f8-42bf-b963-013ea35d429f\"\u003e\n\n### Pricing\n\n[![Screenshot 2024-08-28 at 21 19 41](https://github.com/user-attachments/assets/a66e8689-a7c2-46e6-a597-fb141426b9cf)](https://openai.com/api/pricing/)\n\n[![Screenshot 2024-08-28 at 21 21 51](https://github.com/user-attachments/assets/081a4b1c-579d-4801-b441-d7f14be9c76a)](https://www.anthropic.com/pricing#anthropic-api)\n\n## Going further?\n\n- accept new documents \"on the fly\" (download a given link), and maybe running the database ingesting in a background job.\n\n\n- use `Opensearch` instead of the bi-encoding:\n\t- install local:\n \t\t- \u003chttps://opensearch.org/docs/latest/install-and-configure/install-opensearch/docker/\u003e\n \t\t- \u003chttps://waytohksharma.medium.com/opensearch-local-running-instance-with-docker-with-m1-1e60d90a263c\u003e\n   \t- ingest data: \u003chttps://opensearch.org/docs/latest/getting-started/ingest-data/\u003e and \u003chttps://opensearch.org/docs/latest/api-reference/document-apis/index-document/\u003e\n\n- clusterise data?\n\n## Source of inspiration. \n\nWhich repos, blog post?\n  - \u003chttps://dockyard.com/blog/2024/05/16/retrieval-augmented-generation-what-it-is-how-to-use-it\u003e\n  - using the cross-encoder: \u003chttps://github.com/elixir-nx/bumblebee/issues/251\u003e\n  - Bumblebee, RAG: \u003chttps://hexdocs.pm/bumblebee/llms_rag.html#introduction\u003e\n  - Supabase: \u003chttps://github.com/supabase-community/chatgpt-your-files\u003e\n  - Langchain: \u003chttps://github.com/brainlid/langchain_demo\u003e\n  - \u003chttps://dockyard.com/blog/2024/05/16/retrieval-augmented-generation-what-it-is-how-to-use-it\u003e\n  - \u003chttps://github.com/nileshtrivedi/autogen\u003e\n  - \u003chttps://dockyard.com/blog/2023/05/16/open-source-elixir-alternatives-to-chatgpt\u003e\n  - \u003chttps://fly.io/phoenix-files/using-llama-cpp-with-elixir-and-rustler/\u003e\n  -  A Fly.io post on using ``llama.cpp` with `Rustler`: \u003chttps://fly.io/phoenix-files/using-llama-cpp-with-elixir-and-rustler/\u003e \n  -  ExLLama: LlammaCpp.rs NIF wrapper for Elixir/Erlang: \u003chttps://hexdocs.pm/ex_llama/readme.html\u003e and \u003chttps://fly.io/phoenix-files/using-llama-cpp-with-elixir-and-rustler/\u003e\n  -  ollama-ex to run LLM locally: \u003chttps://hexdocs.pm/ollama/Ollama.html\u003e\n    \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdwyl%2Frag-elixir-doc","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdwyl%2Frag-elixir-doc","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdwyl%2Frag-elixir-doc/lists"}