https://github.com/nathadriele/biophenotype-rag

This project implements a RAG (Retrieval-Augmented Generation) application to answer questions about phenotypes using biological and genomic data. The pipeline integrates information retrieval with response generation via language models (LLM), facilitating accurate analysis of phenotypic data.
https://github.com/nathadriele/biophenotype-rag

anaconda genomic-data-analysis grafana groq jupyter-notebook langchain llm monitoring phenotypes pinecone pipeline-ingestion postgresql prefect pytest rag sql streamlit

Last synced: about 1 month ago
JSON representation

Host: GitHub
URL: https://github.com/nathadriele/biophenotype-rag
Owner: nathadriele
Created: 2024-09-08T21:24:59.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-01-29T14:29:43.000Z (4 months ago)
Last Synced: 2025-02-13T17:19:10.558Z (3 months ago)
Topics: anaconda, genomic-data-analysis, grafana, groq, jupyter-notebook, langchain, llm, monitoring, phenotypes, pinecone, pipeline-ingestion, postgresql, prefect, pytest, rag, sql, streamlit
Language: Jupyter Notebook
Homepage:
Size: 584 KB
Stars: 5
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

![Phenotype RAG Bio-Phenotype Insights Assistant](https://github.com/user-attachments/assets/af61c859-7f99-421b-b353-03b72e8b4fa6)

# 🤖🧬 Phenotype RAG: Bio-Phenotype Insights Assistant

https://github.com/user-attachments/assets/ea5a7935-fc04-4c2b-8656-309de25a7d29

📌 You can explore and interact with the Bio-Phenotype by accessing the app through the following link: https://dry-recipe-9383.ploomberapp.io.

## 🧬 Project Overview
This project, **Phenotype RAG**, was developed as the final assignment for the LLM Zoomcamp. It implements a Retrieval-Augmented Generation (RAG) system that intelligently answers questions related to phenotypes by utilizing both a knowledge base and large language models (LLMs). The system is designed to assist with queries about phenotypes in fields such as genetics, evolutionary biology, and medical diagnostics. By integrating retrieval and generation capabilities, the project provides precise and contextually accurate information, making it a powerful tool for phenotype-related research and clinical applications.

## 🧬 Problem Description
Phenotyping is essential in fields such as genetics, evolutionary biology, and medical diagnostics, enabling researchers and clinicians to analyze observable traits shaped by genetic and environmental factors. However, the sheer volume and complexity of phenotype data pose significant challenges in efficiently accessing and retrieving relevant information. This project tackles these challenges by developing an intelligent assistant designed to answer complex phenotype-related queries. Utilizing Retrieval-Augmented Generation (RAG) techniques, the system integrates the reasoning capabilities of large language models (LLMs) with the accuracy of a curated knowledge base, enhancing the accessibility and precision of phenotype information for researchers, healthcare professionals, and educators.

## 🧬 Project Objectives
The **Phenotype RAG** project aims to achieve the following objectives:
- **1. Enhance Data Retrieval**: Implement a Retrieval-Augmented Generation (RAG) system to efficiently access and retrieve accurate information about phenotypes from a comprehensive knowledge base.
- **2. Improve Query Accuracy**: Utilize advanced language models to reformulate and optimize queries, ensuring that the answers provided are contextually relevant and precise.
- **3. Offer Educational Value**: Create an accessible platform for students and professionals to learn about phenotyping, improving their grasp of complex concepts through a user-friendly interface.
- **4. Ensure Scalability and Flexibility**: Develop a system with a flexible architecture that can integrate with various tools and adapt to different research needs, promoting scalability and adaptability in diverse applications.
- **5. Foster Collaboration**: Make the project's code and documentation available to the community, encouraging collaborative development and knowledge sharing to advance the field.

## 🧬 Technologies and Tools Used
### ⚗️ Key Technologies

- **Anaconda**: Used for managing dependencies and environment configurations.
- **Docker**: Containerizes the application for easy deployment and consistent execution across different platforms.
- **Grafana**: Provides monitoring and visualization dashboards to track application performance and usage metrics.
- **Streamlit**: Offers a user-friendly interface for interacting with the **Phenotype RAG** system.
- **Prefect**: Orchestrates data ingestion workflows to ensure smooth and automated processes.

## 🧬 LLMs Used
- **gemma2-9b-it**: Utilized for question reformulation, optimizing queries for better understanding.
- **mixtral-8x7b-32768**: Powers the retrieval-augmented generation by processing large volumes of text and delivering more contextually accurate answers.
- **all-MiniLM-L6-v2**: Handles embedding generation and semantic search, allowing for precise query-to-answer matching.
- **Groq**: Integrates with the system for efficient vector processing during the search phase.
- **Pinecone**: Manages vector indexing and provides fast, scalable retrieval of information using semantic search.

### ⚗️ Other Tools Used for Development
- **Pytest**: Ensures code reliability through unit and integration tests.
- **Git**: Version control for tracking changes and collaboration.
- **Visual Studio Code**: Integrated development environment (IDE) for writing and debugging code.
- **Jupyter Notebook**: Facilitates exploratory data analysis and preprocessing through interactive notebooks.
- **PostgreSQL**: Relational database used for storing and querying structured data.

## 🧬 Project Structure
The project is organized into the following directories and files:

```py
phenotype-rag/
├── bio-phenotype/
│ ├── data/
│ │ └──
│ ├── sql/
│ │ ├── .env
│ │ └── create_table.py
│ ├── tests/
│ │ └── test.py
│ ├── __init__.py
│ ├── main.py
│ ├── prefect_ingest.py
│ ├── requirements.txt
│ └── utils.py
├── data/
│ └── bio-phenotype.csv
├── grafana/
│ └── monitoring/
│ ├──
│ └──
├── images/
│ ├── app.png
│ ├── grafana.png
│ ├── gloq.png
│ └── pinecone.png
├── notebook/
│ ├── .env
│ └── vector_Indexing_.ipynb
├── docker-compose.yaml
├── README.md
├── requirements.txt
└── test.py
``` # Root folder for the main application logic # Directory to hold project-specific datasets bio-phenotype.csv # Main dataset: includes phenotype-related questions and answers # Directory for database management and schema scripts # Environment file storing sensitive credentials database connection strings # Python script to automate the creation of tables in PostgreSQL # Directory for unit tests to ensure code quality and correctness # Python script containing test cases for core functionalities of the project # Initializes the `bio-phenotype` package, making its modules importable across the project # Streamlit application entry point; defines the UI and handles user interaction # Prefect workflow script that automates data ingestion and processing tasks # Lists Python dependencies needed to run the project (for pip-based installations) # Contains utility functions for data processing, I/O operations, and common tasks # Contains raw data files that can be accessed across different components # Same dataset as in `bio-phenotype/data`, accessible for testing and backup # Directory for Grafana monitoring setup docker-compose.yaml # Docker Compose configuration for setting up Grafana grafana_datasources.yaml # Configuration file that defines the data sources Grafana will connect to PostgreSQL # Directory for storing project-related images and screenshots # Screenshot of the Streamlit app's interface # Screenshot of the Grafana monitoring dashboard, displaying key metrics # Screenshot of Groq AI acceleration with integrated API Keys # Screenshot of Pinecone vector database powering semantic and similarity searches # Directory containing Jupyter notebooks for exploratory data analysis (EDA) and model experimentation # Environment file specifically for notebook-related configurations (API keys, credentials) # Notebook for vectorizing data and indexing it into the semantic search system (Pinecone) # Primary Docker Compose file to orchestrate multi-container setups, including app, database, and Grafana # Project documentation with detailed instructions on usage, setup, and project purpose # Python dependencies for the entire project (ensuring the environment is consistent across machines) # Standalone test script covering various components, including ingestion, database interactions, and the API

## 🧬 Phenotype Dataset
The dataset used for this project contains questions and answers about phenotypes, with a focus on genetic research, evolutionary biology, and medical diagnostics. It explores how phenotypic traits relate to cognitive function, disease susceptibility, and treatment outcomes, highlighting the role of phenotyping in personalized medicine. The dataset also covers the impact of traits on aging, chronic diseases, and mental health disorders. Phenotypic trait analysis is crucial in understanding genetic predispositions, environmental adaptations, and evolutionary processes. This resource supports the development of diagnostic tools, therapeutic strategies, and health interventions by linking observable traits to genetic and environmental factors. Additionally, it is valuable for research in agricultural phenotypes, such as plant growth and disease resistance.

### 📝 Some Questions and Answers
![image](https://github.com/user-attachments/assets/9340d71a-9c3f-4013-9931-f8c904f0ed7a)

## 🧬 Project Execution Locally
### ⚗️ Pre-requisites
Ensure the following are installed on your machine:

- Anaconda (latest version)
- Python (version 3.10 or later)
- PostgreSQL (latest version)
- Grafana (latest version)

### ⚗️ Environment Setup
1. Clone the repository:

```py
git clone https://github.com/nathadriele/biophenotype-rag.git
cd bio-phenotype
```

2. Create and activate the virtual environment:

```py
conda create -n bio-phenotype python=3.10
conda activate bio-phenotype
```

3. Install dependencies:

```py
pip install -r requirements.txt
```

## 🧬 Data Exploration and Preprocessing
- Start the `vector_Indexing_.ipynb` notebook with **Jupyter**:

```py
jupyter notebook
```

## 🧬 Running the Application
To run the application, you will need access keys (API Key) for both **GroqCloud** and **Pinecone**. You will create and substitute them, as well as create an Index in **Pinecone**. You will need accounts on both platforms.

### Step 1: Create API Key on GroqCloud

![gloq](https://github.com/user-attachments/assets/474b3dbc-54d9-4965-ad54-6d505e77ddb2)

- Create or log into your **GroqCloud** account and navigate to **`API Keys` > `Create API Key`**.
- Copy and save the **`Key`** in a text editor for later use.

### Step 2: Create an Index on Pinecone

![pinecone](https://github.com/user-attachments/assets/68ccc4a9-0e0e-4209-a96e-f76448cc98ba)

- On the **Pinecone** website, go to **Indexes > Create Index**.
- Configure the index as follows:
- Default / bio
- **Dimensions**: 384
- **Metric**: Cosine
- **Capacity mode**: Serverless
- **Cloud provider**: AWS
- **Region**: Virginia | us-east-1
- Complete the setup by clicking on **Create Index**.

**Note**: The region can be changed without significantly affecting the code. However, altering other configurations would require significant code adjustments.

### Step 3: Add the API Keys to Environment Files
After completing the previous steps, add your API keys to the `.env` files in the `notebook` and `lang-bio-groq` folders, as shown below:

![image](https://github.com/user-attachments/assets/0acae8c7-6f60-4c20-bc5e-b987074e9d85)

Make sure to replace `your-pinecone-api-key` and `your-groqcloud-api-key` with the actual keys you generated earlier.

### Step 4: Running the Application Locally
To run the application locally, you may need to adjust the configurations in the .env file to match your environment. This also applies to the Grafana setup parameters shown below.

- In the **Anaconda Prompt**, ensure you are in the `lang-bio-groq` folder and run the following command:

```py
streamlit run main.py
```

## 🧬 Monitoring and Performance Metrics
![grafana](https://github.com/user-attachments/assets/8027f204-6112-4cb5-99f0-f2cf07a039a1)

**Grafana** is used to monitor performance, and the image displays a dashboard configured with key performance metrics. In this example, it is evident:

- **Average Response Time**: The current average response time, which is tracked in real-time to ensure system responsiveness.
- **Record Count by Month**: This chart tracks the number of records entered into the system.
- **Total Conversations**: The gauge shows a total of conversations monitored, with the status represented in green, indicating acceptable levels.
- **Distribution of Questions and Answers**: The average question length, and the average response length is significantly higher, at 161 characters. This highlights the tendency for longer responses compared to the questions.

## 🧬 Contribution of the Phenotype RAG Application
The **Phenotype RAG: Bio-Phenotype Insights Assistant** enhances research and practice in genetics and medical diagnostics by integrating retrieval and generation of phenotype information. It facilitates efficient access to complex data, supports accurate diagnostics, and provides a valuable educational tool. With flexible architecture, the application improves interaction with large volumes of data and fosters innovation through a collaborative and accessible approach for the community.

![app](https://github.com/user-attachments/assets/38ac1d64-2eaf-436a-8c9d-e7c3eec72fae)

## More Information
This project was developed as the final assignment for the **LLM Zoomcamp** course.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/nathadriele/biophenotype-rag

Awesome Lists containing this project

README