https://github.com/samjoesilvano/multi-source-knowledge-retrieval-system
An end-to-end multi-source knowledge retrieval system using LangChain, FAISS, and OpenAI embeddings. This Retrieval-Augmented Generation (RAG) pipeline intelligently searches across Wikipedia, arXiv, and custom websites, optimizing source selection and delivering precise, real-time results based on query relevance.
https://github.com/samjoesilvano/multi-source-knowledge-retrieval-system
ai-pipeline document-search faiss information-retrieval knowledge-retrieval langchain langchain-agents langchain-tools machine-learning multi-source-retrieval natural-language-processing openai-embeddings python retrieval-augmented-generation semantic-search
Last synced: 6 months ago
JSON representation
An end-to-end multi-source knowledge retrieval system using LangChain, FAISS, and OpenAI embeddings. This Retrieval-Augmented Generation (RAG) pipeline intelligently searches across Wikipedia, arXiv, and custom websites, optimizing source selection and delivering precise, real-time results based on query relevance.
- Host: GitHub
- URL: https://github.com/samjoesilvano/multi-source-knowledge-retrieval-system
- Owner: SamJoeSilvano
- License: gpl-3.0
- Created: 2024-10-24T12:41:13.000Z (12 months ago)
- Default Branch: main
- Last Pushed: 2024-11-03T15:02:38.000Z (11 months ago)
- Last Synced: 2025-04-08T01:42:37.612Z (6 months ago)
- Topics: ai-pipeline, document-search, faiss, information-retrieval, knowledge-retrieval, langchain, langchain-agents, langchain-tools, machine-learning, multi-source-retrieval, natural-language-processing, openai-embeddings, python, retrieval-augmented-generation, semantic-search
- Language: Jupyter Notebook
- Homepage:
- Size: 5.03 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Multi-Source Knowledge Retrieval System
A high-speed, Retrieval-Augmented Generation (RAG) pipeline that integrates multiple knowledge sources to deliver precise, relevant search results. This end-to-end system is built using LangChain and FAISS, and leverages OpenAI embeddings to perform semantic searches across platforms such as [Wikipedia](https://www.wikipedia.org/), [arXiv](https://arxiv.org/), and custom user-defined websites.
## Features
- **RAG Pipeline with LangChain & FAISS**:
Efficient semantic search using FAISS, enhanced by OpenAI embeddings, to achieve high-speed, accurate query matching.- **Multi-Source Retrieval**:
Seamlessly integrates Wikipedia, arXiv research papers, and custom websites for diverse data retrieval.- **Smart Agent Selection**:
Employs LangChain agents and tools to intelligently select the optimal source based on the query, optimizing response relevance.- **Real-Time API Access**:
Utilizes real-time API connections for updated and precise content retrieval across multiple repositories.## Project Overview
This project addresses the need for comprehensive information retrieval across various sources by:
1. Building a retrieval-augmented generation pipeline to automate document retrieval from multiple sources.
2. Optimizing search relevance through OpenAI-powered embeddings and FAISS similarity search.3. Enabling smart decision-making on which source to use based on query context, using LangChain agents.
## Technology Stack
- **LangChain**: Framework for creating language model applications, utilized for agent-based source selection and tool integration.
- **FAISS (Facebook AI Similarity Search)**: Provides vector-based similarity search for efficient retrieval.- **OpenAI Embeddings**: Supports semantic search by converting queries and documents into high-dimensional embeddings for better relevance.
## Installation
1. **Clone the repository**:
```bash
git clone https://github.com/your-username/multi-source-knowledge-retrieval.git
cd multi-source-knowledge-retrieval2. **Install dependencies**:
```bash
pip install -r requirements.txt3. **Set up API keys**:
- Add your OpenAI API key and any other API credentials in an .env file:
```bash
OPENAI_API_KEY="your_openai_api_key"
LANGCHAIN_API_KEY="your_langchain_api_key"
GROQ_API_KEY="your_groq_api_key"
OTHER_API_KEYS="..."## Usage
1. **Run the pipeline**:
- Go to 'agents' folder and execute the following command to run all the cells in the jupyter file and save the output back to the same file
```bash
jupyter nbconvert --execute --to notebook agents.ipynb2. **Make a query**:
- Use the terminal interface to input queries.
- The system will automatically select the best source and return relevant information.## Results
This project delivers rapid, relevant responses by leveraging the strengths of multiple knowledge sources. With FAISS, it ensures efficient vector-based matching, while LangChain agents guarantee the best source selection for every query.## Future Enhancements
- **Additional Knowledge Sources**: Expand integration to other repositories or specific domain databases.
- **Enhanced Customization**: User-defined filtering to limit results based on document type, date, or other attributes.## License
This project is licensed under the MIT License. See the [LICENSE](LICENSE.txt) file for details.