https://github.com/blib-la/ask-poddy

Ask Poddy: Run Open Source LLMs and Embeddings as OpenAI-Compatible Serverless Endpoints (Tutorial)
https://github.com/blib-la/ask-poddy

ai embedding endpoint infinity llm nextjs openai rag runpod serverless vllm worker

Last synced: 2 months ago
JSON representation

Ask Poddy: Run Open Source LLMs and Embeddings as OpenAI-Compatible Serverless Endpoints (Tutorial)

Host: GitHub
URL: https://github.com/blib-la/ask-poddy
Owner: blib-la
License: agpl-3.0
Created: 2024-06-13T14:06:17.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-07-19T13:47:35.000Z (about 1 year ago)
Last Synced: 2025-04-08T15:46:18.632Z (6 months ago)
Topics: ai, embedding, endpoint, infinity, llm, nextjs, openai, rag, runpod, serverless, vllm, worker
Language: TypeScript
Homepage:
Size: 7.18 MB
Stars: 10
Watchers: 0
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

          
Ask Poddy


![A screenshot of the Ask Poddy web app showing a chat between the user and the AI](./assets/20240610_screenshot_ask_poddy_what_is_a_network_volume.png)

**Ask Poddy** _(named after ["Poddy"](./public/poddy.png), the [RunPod](https://runpod.io) bot on

[Discord](https://discord.gg/cUpRmau42V))_ is a user-friendly RAG (Retrieval-Augmented Generation)

web application designed to showcase the ease of setting up OpenAI-compatible APIs using open-source

models running serverless on [RunPod](https://runpod.io). Built with [Next.js](https://nextjs.org/),

[React](https://reactjs.org/), [Tailwind](https://tailwindcss.com/),

[Vercel AI SDK](https://sdk.vercel.ai/docs/introduction), and

[LangChain](https://js.langchain.com/), it uses

[Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) as LLM and

[multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) for

text embeddings.

This tutorial will guide you through deploying **Ask Poddy** in your environment, enabling it to

answer questions related to [RunPod](https://runpod.io) effectively, by leveraging the open-source

workers [worker-vllm](https://github.com/runpod-workers/worker-vllm) and

[worker-infinity-embedding](https://github.com/runpod-workers/worker-infinity-embedding).

---




-   [Concept](#concept)

-   [Tutorial: Setting Up "Ask Poddy" in Your Environment](#tutorial-setting-up-ask-poddy-in-your-environment)

    -   [Prerequisites](#prerequisites)

    -   [1. Clone the Repository](#1-clone-the-repository)

    -   [2. Install Dependencies](#2-install-dependencies)

    -   [3. Set Up RunPod Serverless Endpoints](#3-set-up-runpod-serverless-endpoints)

        -   [3.1 Network Volumes](#31-network-volumes)

        -   [3.2 Worker-vLLM Endpoint](#32-worker-vllm-endpoint)

        -   [3.3 Worker-Infinity-Embedding Endpoint](#33-worker-infinity-embedding-endpoint)

    -   [4. Configure Environment Variables](#4-configure-environment-variables)

    -   [5. Populate the Vector Store](#5-populate-the-vector-store)

    -   [6. Start the Local Web Server](#6-start-the-local-web-server)

    -   [7. Ask Poddy](#7-ask-poddy)




---

## Concept

**Ask Poddy** is designed to demonstrate the integration of serverless OpenAI-compatible APIs with

open-source models. The application runs locally (but it could also be deployed into the cloud),

while the computational heavy lifting is handled by serverless endpoints on

[RunPod](https://runpod.io). This architecture allows seamless use of existing OpenAI-compatible

tools and frameworks without needing to develop custom APIs.

Here's how RAG works in **Ask Poddy**:

![Diagram showing how the RAG process works](./assets/20240613_diagram_rag.png)

1. **User**: Asks a question.

2. **Vector Store**: The question is sent to LangChain, which uses the

   [worker-infinity-embedding](https://github.com/runpod-workers/worker-infinity-embedding) endpoint

   to convert the question into an embedding using the

   [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)

   model.

3. **Vector Store**: Performs a similarity search to find relevant documents based on the question.

4. **AI SDK**: The retrieved documents and the user's question are sent to the

   [worker-vllm](https://github.com/runpod-workers/worker-vllm) endpoint.

5. **worker-vllm**: Generates an answer using the

   [Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model.

6. **User**: Receives the answer.

> [!TIP] 

> You can [choose any of the supported models](https://docs.vllm.ai/en/latest/models/supported_models.html) that come with [vLLM](https://github.com/vllm-project/vllm). 




---

## Tutorial: Setting Up "Ask Poddy" in Your Environment

### Prerequisites

-   [git](https://git-scm.com/book/en/v2/Getting-Started-Installing-Git) installed

-   [Node.js and npm](https://nodejs.org/en) installed

-   [RunPod](https://www.runpod.io/) account

### 1. Clone the Repository

1. Clone the **Ask Poddy** repository and go into the cloned directory:

```bash

git clone https://github.com/blib-la/ask-poddy.git

cd ask-poddy

```

2. Clone the [RunPod docs](https://github.com/runpod/docs) repository into

   `ask-poddy/data/runpod-docs`.

```bash

git clone https://github.com/runpod/docs.git ./data/runpod-docs

```

> [!NOTE] 

> The [RunPod docs](https://github.com/runpod/docs) repository contains the [RunPod documentation](https://docs.runpod.io) that **Ask Poddy** will use to answer

> questions.

3. Copy the `img` folder from `./data/runpod-docs/static/img` to `./public`

> [!NOTE] 

> This makes it possible for **Ask Poddy** to include images from the [RunPod documentation](https://docs.runpod.io).




### 2. Install Dependencies

Navigate to the `ask-poddy` directory and install the dependencies:

```bash

npm install

```




### 3. Set Up RunPod Serverless Endpoints

#### 3.1 Network Volumes

1. Create two network volumes with 15GB storage each in the same data center as the serverless

   endpoints.

    - Volume for embeddings: `infinity_embeddings`

    - Volume for LLM: `vllm_llama3`

> [!NOTE] 

> Using network volumes ensures that the models and embeddings are stored persistently, allowing for

> faster subsequent requests as the data does not need to be downloaded or recreated each time.

#### 3.2 Worker-vLLM Endpoint

1. [Follow the guide for setting up the vLLM endpoint](https://docs.runpod.io/serverless/workers/vllm/get-started),

   but make sure to use the `meta-llama/Meta-Llama-3-8B-Instruct` model instead of the one mentioned

   in the guide. And also make sure to select the network volume `vllm_llama3` when creating the

   endpoint.

> [!TIP] 

> The worker is using [worker-vllm](https://github.com/runpod-workers/worker-vllm).

#### 3.3 Worker-Infinity-Embedding Endpoint

1. [Create a new template](https://docs.runpod.io/pods/templates/manage-templates#creating-a-template)

2. Use the Docker image `runpod/worker-infinity-embedding:stable-cuda12.1.0` from

   [worker-infinity-embedding](https://github.com/runpod-workers/worker-infinity-embedding) and set

   the environment variable `MODEL_NAMES` to `intfloat/multilingual-e5-large-instruct`.

3. [Create a serverless endpoint](https://docs.runpod.io/serverless/workers/get-started#deploy-a-serverless-endpoint)

   and make sure to select the network volume `infinity_embeddings`.




### 4. Configure Environment Variables

1. [Generate your RunPod API key](https://docs.runpod.io/get-started/api-keys)

2. Find the endpoint IDs underneath the

   [deployed serverless endpoints](https://www.runpod.io/console/serverless).



3. Create your `.env.local` based on [.env.local.example](./.env.local.example) or by creating a

   file with the following variables:

```bash

RUNPOD_API_KEY=your_runpod_api_key

RUNPOD_ENDPOINT_ID_VLLM=your_vllm_endpoint_id

RUNPOD_ENDPOINT_ID_EMBEDDING=your_embedding_endpoint_id

```




### 5. Populate the Vector Store

To populate the vector store, run the following command:

```bash

npm run populate

```

> [!NOTE] 

> The first run will take some time as the worker downloads the embeddings model

> ([multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct)).

> Subsequent requests will use the downloaded model stored in the network volume.

This command reads all markdown documents from the `ask-poddy/data/runpod-docs/` folder, creates

embeddings using the embedding endpoint running on RunPod, and stores these embeddings in the local

vector store:

![Diagram showing how the vector store gets populated with documents](./assets/20240613_diagram_populate_vector_store.png)

1. **Documents**: The markdown documents from the `ask-poddy/data/runpod-docs/` folder are read by

   LangChain.

2. **Chunks**: LangChain converts the documents into smaller chunks, which are then sent to the

   `worker-infinity-embedding` endpoint.

3. **worker-infinity-embedding**: Receives chunks, generates embeddings using the

   `multilingual-e5-large-instruct` model, and sends them back.

4. **Vector Store**: LangChain saves these embeddings in the local vector store (`HNSWlib`).

> [!TIP] 

> A vector store is a database that stores embeddings (vector representations of text) to

> enable efficient similarity search. This is crucial for the RAG process as it allows the system to

> quickly retrieve relevant documents based on the user's question.




### 6. Start the Local Web Server

1. Start the local web server:

```bash

npm run dev

```

2. Open http://localhost:3000 to access the UI.




### 7. Ask Poddy

Now that everything is running, you can ask your [RunPod](https://runpod.io)-related question, like:

-   What is RunPod?

-   How do I create a serverless endpoint?

-   What are the benefits of using a network volume?

-   How can I become a host for the community cloud?

-   Can RunPod help my startup to get going?

> [!NOTE]

> The first run will take some time as the worker downloads the LLM

> ([Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)).

> Subsequent requests will use the downloaded model stored in the network volume.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/blib-la/ask-poddy

Awesome Lists containing this project

README

Ask Poddy