https://github.com/mattjesc/mlops-framework-dds-ss-llm

MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM
https://github.com/mattjesc/mlops-framework-dds-ss-llm

ai api llm machine-learning ml mlops rag semantic-search semanticsearch

Last synced: 3 months ago
JSON representation

MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

Host: GitHub
URL: https://github.com/mattjesc/mlops-framework-dds-ss-llm
Owner: Mattjesc
Created: 2024-08-28T14:12:26.000Z (9 months ago)
Default Branch: main
Last Pushed: 2024-08-28T14:26:29.000Z (9 months ago)
Last Synced: 2025-01-18T04:32:11.128Z (4 months ago)
Topics: ai, api, llm, machine-learning, ml, mlops, rag, semantic-search, semanticsearch
Language: Python
Homepage:
Size: 5.86 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# MLOps Framework for Dynamic Dataset Selection using Semantic Search and LLM

![image](https://github.com/user-attachments/assets/514b382d-1b0b-4027-880c-d07b658bd1ff)

## Overview

This project provides an MLOps framework for dynamically selecting datasets to augment Large Language Models (LLM) using semantic search techniques. The framework supports both API-based and local model approaches, allowing for flexible deployment in cloud, hybrid, or local environments.

## Features

- **Dynamic Dataset Selection**: Automatically selects relevant datasets based on user queries using semantic search.
- **Semantic Search Integration**: Enhances LLM responses with external data retrieved using semantic search techniques.
- **MLOps Practices**: Incorporates automation, monitoring, and reproducibility for efficient model management.
- **Flexible Deployment**: Supports cloud, hybrid, and local architectures.

## Prerequisites

Before you begin, ensure you have met the following requirements:

- **Python 3.7+**
- **API-based Approach**:
- OpenAI API Key
- **Local Model Approach**:
- CUDA-enabled GPU (recommended for performance)
- PyTorch with CUDA support

## Installation

### API-based Approach

1. Clone the repository:
```bash
git clone https://github.com/yourusername/your-repo.git
cd your-repo
```

2. Install the required packages:
```bash
pip install -r requirements_API.txt
```

**Note**: Adjust dependencies accordingly as future versions might not be compatible.

3. Set your OpenAI API key:
```bash
export OPENAI_API_KEY=your_api_key_here
```

### Local Model Approach

1. Clone the repository:
```bash
git clone https://github.com/yourusername/your-repo.git
cd your-repo
```

2. Install the required packages:
```bash
pip install -r requirements_local.txt
```

**Note**: Adjust dependencies accordingly as future versions might not be compatible.

## Usage

### API-based Approach

1. Run the Streamlit app:
```bash
streamlit run app_API.py
```

2. Open your web browser and navigate to the URL displayed in the terminal.

### Local Model Approach

1. Run the Streamlit app:
```bash
streamlit run app_local.py
```

2. Open your web browser and navigate to the URL displayed in the terminal.

## Customization

- **UI Framework**: This project includes a simple Streamlit UI as an example. You are free to customize the UI or use any other framework that suits your needs.

## Configuration

- **Dataset Mapping**: Modify the `DATASET_MAPPING` dictionary in `app_API.py` or `app_local.py` to include your dataset paths and keywords.
- **Model Configuration**:
- **Local Model Approach**: Choose a model from Hugging Face's model hub and update the `load_model_and_tokenizer` function in `app_local.py` accordingly.
- **API-based Approach**: While this example uses the OpenAI API, you can modify the `run_rag_pipeline` function in `app_API.py` to use any other API provider of your choice.

## Architecture and Workflow

### Keyword Detection

The framework uses a simple keyword-based detection mechanism to identify relevant datasets. When a user query is submitted, the system converts the query to lowercase and checks it against the keys in the `DATASET_MAPPING` dictionary. If a keyword from the query matches a key in the dictionary, the corresponding dataset is loaded and used for semantic search.

### Semantic Search

Semantic search is performed using a pre-trained model from the Hugging Face Transformers library. The query and dataset entries are converted into embeddings, and cosine similarity is used to find the most relevant documents. The top results are then used to augment the LLM response.

### LLM Augmentation

For the API-based approach, the augmented prompt is sent to the OpenAI API, which returns a response generated by the LLM. For the local model approach, the augmented prompt is processed by a locally hosted model from Hugging Face, generating a response based on the augmented context.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/mattjesc/mlops-framework-dds-ss-llm

Awesome Lists containing this project

README