{"id":26498241,"url":"https://github.com/andrewhsugithub/rag-research-agent","last_synced_at":"2026-04-12T21:36:17.909Z","repository":{"id":280694398,"uuid":"942852153","full_name":"andrewhsugithub/RAG-Research-Agent","owner":"andrewhsugithub","description":"YouTube RAG Research Agent","archived":false,"fork":false,"pushed_at":"2025-03-19T22:31:39.000Z","size":251,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-19T22:37:55.578Z","etag":null,"topics":["chromadb","langgraph","llama","llamaindex","llamaindex-rag","oll","qdrant","rag","reranker","youtube"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/andrewhsugithub.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-04T19:26:44.000Z","updated_at":"2025-03-19T22:31:42.000Z","dependencies_parsed_at":null,"dependency_job_id":"9316f1fa-7fb8-455b-903a-6665b9c673ff","html_url":"https://github.com/andrewhsugithub/RAG-Research-Agent","commit_stats":null,"previous_names":["andrewhsugithub/yt-rag"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhsugithub%2FRAG-Research-Agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhsugithub%2FRAG-Research-Agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhsugithub%2FRAG-Research-Agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/andrewhsugithub%2FRAG-Research-Agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/andrewhsugithub","download_url":"https://codeload.github.com/andrewhsugithub/RAG-Research-Agent/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":244629434,"owners_count":20484200,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chromadb","langgraph","llama","llamaindex","llamaindex-rag","oll","qdrant","rag","reranker","youtube"],"created_at":"2025-03-20T14:27:47.696Z","updated_at":"2026-04-11T00:53:10.131Z","avatar_url":"https://github.com/andrewhsugithub.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# RAG Research\nThis project is a simple implementation with a RAG system with YouTube data using LlamaIndex for efficient data retrieval and Qdrant or Chroma as the VectorDB to store and search the vectors. It also includes an optional Web Research Workflow that leverages real-time web data.\n\n## RAG-Only Usage\n- Replace `\u003cquery\u003e` with your query and `\u003cyoutube_url\u003e` with the YouTube URL.\n    ```bash\n    yt-dlp -f bestaudio --extract-audio --audio-format mp3 \u003cyoutube_url\u003e -o \"audio/audio.mp3\"\n    cd src/rag\n    uv run whisper.py # Stop here if only want to load the data\n    uv run rag.py --query \u003cquery\u003e --path \"../../qdrant\" --collection \"yt\" --qdrant\n    ```\n- Explaination of args for `rag.py`:\n    - `--query`: The query you want to search for\n    - `--path`: The path to the VectorDB on disk\n    - `--collection`: The collection name in the VectorDB\n    - `--qdrant`: Use Qdrant as the VectorDB (default)\n    - `--chroma`: Use Chroma as the VectorDB\n\n### Using Cloud VectorDBs instead of Local VectorDBs\n- if use [Qdrant](https://qdrant.com/) or [Pinecone](https://www.pinecone.io/) *(Note: not supported yet)*\n    ```bash\n    cp .env.example .env\n    ```\n    Copy your **QDRANT_API_KEY** and **QDRANT_URL** to the .env file\n\n## Web Research Workflow\n- This workflow adds RAG to the workflow implemented in [Ollama Deep Researcher](https://github.com/langchain-ai/ollama-deep-researcher), see it for more details.\n- The RAG system is used on **hf_docs** dataset to answer the query by default\n- Modify to use duckduckgo as the search API\n- Graph Workflow:\n\n    ![graph](output.png)\n\n### Usage\n- Spin up [Ollama](https://github.com/ollama/ollama) server:\n    ```bash\n    ollama serve\n    ```\n    \u003e **_NOTE:_** Pull the model you want first, for example: `ollama pull deepseek-r1:8b`\n\n- See [Ollama Deep Researcher](https://github.com/langchain-ai/ollama-deep-researcher) for details on the environment variables.\n    ```bash\n    cp .env.example .env\n    ```\n\n- If want to use your YouTube data as the dataset for the RAG system, follow the steps in the [RAG-Only Usage](#rag-only-usage) section to load the data first. \n    \u003e **_DON'T_** run the `rag.py` script.\n\n- Run the workflow:\n    ```bash\n    uvx --refresh --from \"langgraph-cli[inmem]\" --with-editable . --python 3.11 langgraph dev\n    ```\n    \u003e **_NOTE:_** in `graph.py`, in the `rag_research` function, see comments if you want to use mock rag data instead of the real data.\n\n## Examples:\n### hf_docs\n- RAG-Only Usage:\n    Uses the [HF Docs](https://huggingface.co/datasets/hf_docs) dataset\n    ```bash\n    cd src/rag\n    uv run hf_docs.py\n    uv run rag.py --query \"How to create a pipeline object?\" --path \"../../qdrant\" --collection \"hf_docs\" --qdrant\n    ```\n    See [llama3.1_hf_qdrant.txt](llama3.1_hf_qdrant.txt) for the output.\n    \n- Web Research Workflow:\n    - uses `deepseek-r1:8b` model\n    ```bash\n    ollama pull deepseek-r1:8b\n    ollama serve\n    uvx --refresh --from \"langgraph-cli[inmem]\" --with-editable . --python 3.11 langgraph dev\n    ```\n    1. Prompt 1: What's Model Context Protocol?\n        - See [output_What's Model Context Protocol?.md](output_What's%20Model%20Context%20Protocol%3F.md) for the output.\n    2. Prompt 2: What are the FAANG companies?\n        - See [output_What are the FAANG companies?.md](output_What%20are%20the%20FAANG%20companies%3F.md) for the output.\n    3. Prompt 3: How to create a custom huggingface pipeline object?\n        - See [output_How to create a custom huggingface pipeline object?.md](output_How%20to%20create%20a%20custom%20huggingface%20pipeline%20object%3F.md) for the output.\n\n## Technologies Used for RAG System:\n- [LlamaIndex](https://docs.llamaindex.ai/en/stable/)\n\n- Embeddings (Loads from [HuggingFace](https://huggingface.co/)):\n    - dense vectors: [gte-small](https://huggingface.co/thenlper/gte-small) \n    - sparse vectors: [Splade_PP_en_v1](https://huggingface.co/prithivida/Splade_PP_en_v1)\n\n- VectorDBs:\n    - Support Hybrid Vectors (dense + sparse)\n        - [Qdrant](https://qdrant.tech/)\n        - [Pinecone](https://www.pinecone.io/) *(Note: not supported yet)*\n        \u003e Note: sparse vectors defaults to [prithvida/Splade_PP_en_v1](https://huggingface.co/prithivida/Splade_PP_en_v1)\n    - Dense Vectors: [Chroma](https://chroma.farfetch.com/)\n    - Sparse Vectors: [BM25](https://docs.llamaindex.ai/en/stable/examples/retrievers/bm25_retriever)\n\n- Reranker:\n    - [bge-m3](https://huggingface.co/BAAI/bge-m3)\n\n- Language Models (Loads from [HuggingFace](https://huggingface.co/)):\n    - [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct)\n    - [Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewhsugithub%2Frag-research-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fandrewhsugithub%2Frag-research-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fandrewhsugithub%2Frag-research-agent/lists"}