https://github.com/ansh-info/stark-agent
STaRK: Agentic AI benchmark, which is designed to evaluate how well LLMs and retrieval systems work with semi-structured knowledge bases.
https://github.com/ansh-info/stark-agent
agentic-ai agents gpt-4 graph information-retrieval knowledge-base knowledge-graph llm multimodal nlp openai openai-api semi-structured-data
Last synced: 2 months ago
JSON representation
STaRK: Agentic AI benchmark, which is designed to evaluate how well LLMs and retrieval systems work with semi-structured knowledge bases.
- Host: GitHub
- URL: https://github.com/ansh-info/stark-agent
- Owner: ansh-info
- License: mit
- Created: 2025-02-23T08:23:00.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2025-02-27T23:58:13.000Z (over 1 year ago)
- Last Synced: 2025-06-01T01:59:46.179Z (about 1 year ago)
- Topics: agentic-ai, agents, gpt-4, graph, information-retrieval, knowledge-base, knowledge-graph, llm, multimodal, nlp, openai, openai-api, semi-structured-data
- Language: Python
- Homepage: https://stark.stanford.edu/
- Size: 2.53 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# STaRK Benchmark Evaluation
A hierarchical agent system for evaluating LLM retrieval performance on semi-structured knowledge bases.
Overview •
Features •
Architecture •
Getting Started •
Usage •
Metrics •
Why STaRK?
## 📋 Overview
STaRK (Semi-structured Text and Relational Knowledge) is a comprehensive benchmark designed to evaluate how well large language models (LLMs) and retrieval systems work with semi-structured knowledge bases (SKBs). These knowledge bases combine structured data (e.g., entity relationships) with unstructured data (e.g., textual descriptions), representing real-world knowledge complexity.
This project implements a hierarchical LLM-powered agent system to process, evaluate, and visualize the performance of retrieval systems on the STaRK benchmark, spanning three key domains:
1. **Product Search**: Detailed product metadata, reviews, and relationships
2. **Academic Paper Search**: Paper citations, author relationships, and content
3. **Precision Medicine**: Drug-disease interactions and clinical trial data
## 🌟 Features
- **Hierarchical Agent System**: Multi-agent architecture with specialized agents for different tasks
- **Comprehensive Evaluation**: Calculate standard retrieval metrics (MRR, MAP, NDCG, Recall@K, etc.)
- **Interactive Knowledge Graph**: Visualize node relationships and similarities
- **Streamlit UI**: User-friendly interface for uploading data, running evaluations, and analyzing results
- **Memory Optimization**: Efficiently process large embedding datasets with chunked computation
- **LLM-powered Analysis**: Natural language interface to query and understand results
- **Data Enrichment**: Tools to enhance datasets with additional features
- **Remote Processing**: Support for both local and remote (GPT-4o-mini) evaluation
## 🧠 Architecture
The system implements a hierarchical agent structure:
### Data Flow Diagram with Remote Processing
```mermaid
graph TD
A[User] -->|Upload Embeddings| B[Streamlit UI]
B -->|Configure Settings| B1[Local/Remote Selection]
B -->|Trigger Evaluation| C[Main Agent]
C -->|Route Request| D[StarkQA Agent]
D -->|Local Processing| E1[Local Evaluation Tool]
D -->|Remote Processing| E2[GPT-4o-mini API]
E1 -->|Read| F1[Query Embeddings]
E1 -->|Read| F2[Node Embeddings]
E1 -->|Process| G1[Local Similarity Computation]
E2 -->|Send| F3[Base64 Encoded Files]
E2 -->|Process| G2[Remote Computation]
G1 -->|Calculate| H1[Local Metrics]
G2 -->|Calculate| H2[Remote Metrics]
G1 -->|Generate| I1[Knowledge Graph]
H1 -->|Store In| J[Shared State]
H2 -->|Store In| J
I1 -->|Store In| J
J -->|Display In| K1[Metrics Visualization]
J -->|Display In| K2[Graph Visualization]
J -->|Reference For| K3[Chat Interface]
K1 -->|Show In| B
K2 -->|Show In| B
K3 -->|Show In| B
%% Styling
classDef user fill:#f96,stroke:#333,stroke-width:2px
classDef ui fill:#9cf,stroke:#333,stroke-width:2px
classDef agent fill:#fcf,stroke:#333,stroke-width:2px
classDef local fill:#cfc,stroke:#333,stroke-width:2px
classDef remote fill:#f99,stroke:#333,stroke-width:2px
classDef data fill:#fc9,stroke:#333,stroke-width:2px
classDef process fill:#ff9,stroke:#333,stroke-width:2px
classDef state fill:#c9f,stroke:#333,stroke-width:2px
classDef output fill:#9c9,stroke:#333,stroke-width:2px
class A user
class B,B1 ui
class C,D agent
class E1,G1,H1,I1 local
class E2,G2,H2 remote
class F1,F2,F3 data
class J state
class K1,K2,K3 output
```
### Agents
- **Main Agent**: Supervisory agent that routes user queries and orchestrates tasks
- **StarkQA Agent**: Specialized agent focused on evaluation and metrics
- **Query Agent**: Handles natural language queries against the knowledge graph
- **Enrichment Agent**: Provides data enhancement capabilities
### Tools
- **Evaluation Tool**: Processes embeddings and computes similarities
- **Metrics Tool**: Calculates standard retrieval metrics
- **Visualization Tool**: Generates interactive visualizations
- **Knowledge Graph Tool**: Builds and queries the knowledge graph
- **Feature Generation Tool**: Enhances datasets with additional features
## 📊 Evaluation Pipeline
1. **Data Loading**: Process query and node embeddings from parquet files
2. **Similarity Computation**: Calculate similarities between queries and nodes using optimized batching
3. **Metrics Calculation**: Compute standard retrieval metrics (MRR, MAP, Recall@K)
4. **Knowledge Graph Generation**: Create interactive visualizations of node relationships
5. **Result Analysis**: Provide natural language insights about evaluation results
## Memory-Optimized Component Architecture
```mermaid
graph TD
subgraph "UI Layer"
A1[File Upload]
A2[Configuration Settings]
A3[Results Display]
A4[Chat Interface]
A5[Demo & Samples]
end
subgraph "Agent Layer"
B1[Main Agent]
B2[StarkQA Agent]
B3[Enrichment Agent]
B4[Query Agent]
end
subgraph "Tool Layer"
C1[Evaluation Tool]
C2[Knowledge Graph Generator]
C3[Metrics Calculator]
C4[Feature Generator]
C5[Query Processor]
end
subgraph "Computation Layer"
D1[Local Processing]
D2[Remote GPT-4o-mini]
D3[Memory Optimization]
end
subgraph "Data Layer"
E1[Query Embeddings]
E2[Node Embeddings]
E3[Shared State]
E4[Visualization Cache]
end
A1 --> B1
A2 --> B1
B1 --> B2
B1 --> B3
B1 --> B4
B2 --> C1
B2 --> C2
B2 --> C3
B3 --> C4
B4 --> C5
C1 --> D1
C1 --> D2
C2 --> D1
C3 --> D1
C3 --> D2
C4 --> D1
C5 --> D1
D1 --> D3
D3 --> E1
D3 --> E2
D1 --> E3
D2 --> E3
E3 --> E4
E4 --> A3
E4 --> A4
E3 --> A5
%% Styling
classDef ui fill:#9cf,stroke:#333,stroke-width:2px
classDef agent fill:#fcf,stroke:#333,stroke-width:2px
classDef tool fill:#ff9,stroke:#333,stroke-width:2px
classDef compute fill:#f99,stroke:#333,stroke-width:2px
classDef data fill:#cfc,stroke:#333,stroke-width:2px
class A1,A2,A3,A4,A5 ui
class B1,B2,B3,B4 agent
class C1,C2,C3,C4,C5 tool
class D1,D2,D3 compute
class E1,E2,E3,E4 data
```
## 🚀 Getting Started
### Prerequisites
- Python 3.10+
- LangChain/LangGraph
- OpenAI API key (for GPT-4-mini)
- PyTorch
- Streamlit
### Installation
```bash
# Clone the repository
git clone https://github.com/ansh-info/stark-agent
cd stark-agent
# Create and activate virtual environment
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
# Set up your OpenAI API key
export OPENAI_API_KEY=your_api_key_here # On Windows: set OPENAI_API_KEY=your_api_key_here
```
### Running the Application
```bash
# Start the Streamlit app
streamlit run app/demo_app.py
```
## 📝 Usage
1. **Upload Files**:
- Query Embeddings (parquet format)
- Node Embeddings (parquet format)
2. **Configure Settings**:
- Batch Size: Number of queries to process at once
- Processing Type: Local or Remote (GPT-4o-mini)
- Evaluation Split: Data split to evaluate on
- Model Selection: Choose LLM for analysis
3. **Run Evaluation**:
- Click "Run Evaluation" to start the process
- Monitor progress in the UI
4. **Analyze Results**:
- View metrics visualization
- Explore the knowledge graph
- Ask questions in natural language
## 📈 Evaluation Metrics
The system calculates and displays the following metrics:
| Metric | Description |
| --------------- | ----------------------------------------------------------------------------- |
| **MRR** | Mean Reciprocal Rank - Measures where first correct answers appear in ranking |
| **MAP** | Mean Average Precision - Measures precision across all relevant items |
| **R-Precision** | Precision at the position equal to number of relevant items |
| **Recall@K** | Proportion of relevant items found in top K results |
| **Hit@K** | Whether any relevant item appears in top K results |
| **NDCG** | Normalized Discounted Cumulative Gain - Evaluates ranking quality |
## Evaluation Process
```mermaid
graph TD
A[Start Evaluation] --> B1{Local or Remote?}
B1 -->|Local| C1[Load Parquet Files]
B1 -->|Remote| C2[Encode Files to Base64]
C1 --> D1[Parse Answer IDs]
C2 --> D2[Send to GPT-4o-mini]
D1 --> E1[Convert to Tensors]
D2 --> E2[Await Remote Response]
E1 --> F1[Compute Similarities in Chunks]
F1 --> G1[Calculate Memory-Optimized Metrics]
E2 --> G2[Process Remote Results]
G1 --> H1[Store Local Results]
G2 --> H2[Store Remote Results]
H1 --> I[Generate Visualizations]
H2 --> I
I --> J1[Interactive Knowledge Graph]
I --> J2[Metric Dashboards]
I --> J3[Performance Reports]
J1 --> K[End Evaluation]
J2 --> K
J3 --> K
%% Styling
classDef start fill:#9f9,stroke:#333,stroke-width:2px
classDef decision fill:#fcf,stroke:#333,stroke-width:2px
classDef local fill:#9cf,stroke:#333,stroke-width:1px
classDef remote fill:#f99,stroke:#333,stroke-width:1px
classDef compute fill:#fc9,stroke:#333,stroke-width:1px
classDef output fill:#cfc,stroke:#333,stroke-width:1px
classDef final fill:#f96,stroke:#333,stroke-width:2px
class A start
class B1 decision
class C1,D1,E1,F1,G1,H1 local
class C2,D2,E2,G2,H2 remote
class I compute
class J1,J2,J3 output
class K final
```
## Advanced Memory Optimization Strategy
```mermaid
graph TD
A[Large Embedding Data] --> B{Size > Memory?}
B -->|No| C1[Direct Processing]
B -->|Yes| C2[Memory-Optimized Flow]
C2 --> D[Batched Processing Strategy]
D --> E1[Chunking]
D --> E2[Streaming]
D --> E3[Checkpointing]
E1 --> F1[Process in Chunks]
E1 --> F2[Size: 1000 queries/chunk]
E2 --> G1[Stream Results]
E2 --> G2[Progressive UI Updates]
E3 --> H1[Save Intermediate Results]
E3 --> H2[Resume Capability]
F1 --> I[Merge Chunk Results]
G1 --> I
H1 --> I
C1 --> J[Calculate Final Metrics]
I --> J
J --> K[Store in Shared State]
K --> L[Visualization & Report]
%% Styling
classDef data fill:#9cf,stroke:#333,stroke-width:2px
classDef decision fill:#fcf,stroke:#333,stroke-width:2px
classDef strategy fill:#f99,stroke:#333,stroke-width:2px
classDef technique fill:#cfc,stroke:#333,stroke-width:1px
classDef detail fill:#ff9,stroke:#333,stroke-width:1px
classDef process fill:#fc9,stroke:#333,stroke-width:2px
classDef output fill:#9f9,stroke:#333,stroke-width:2px
class A data
class B decision
class C1,C2 strategy
class D strategy
class E1,E2,E3 technique
class F1,F2,G1,G2,H1,H2 detail
class I,J,K process
class L output
```
## 🔍 Knowledge Graph Visualization
The interactive knowledge graph visualization provides:
- **Node Representation**: Entities in the knowledge base
- **Edge Connections**: Similarity relationships between nodes
- **Color Coding**: Visual differentiation of node types
- **Interactive Exploration**: Zoom, pan, and click for details
- **Filtering**: Focus on specific node types or relationship strengths
## 📊 Sample Reports
The system generates comprehensive reports including:
- **Performance Summary**: Key metrics and performance indicators
- **Detailed Analysis**: Breakdown of performance across query types
- **Comparative Metrics**: Comparison against baseline systems
- **Knowledge Graph Insights**: Network analysis of node relationships
- **Recommendations**: Suggestions for system improvement
## 🧩 Project Structure
```
.
├── .env # Environment variables
├── agents/
│ ├── __init__.py
│ ├── main_agent.py # Main supervisory agent
│ └── stark_agent.py # StarkQA evaluation agent
├── config/
│ └── config.py # Configuration settings
├── state/
│ └── shared_state.py # Shared state management
├── tests/
│ ├── __init__.py
│ └── test_stark_evaluation.py
├── tools/
│ ├── __init__.py
│ └── stark/
│ ├── __init__.py
│ └── evaluation_retrival.py # Core evaluation tool
├── ui/
│ ├── stark_app.py # Streamlit interface
│ └── demo_app.py
└── utils/
├── __init__.py
└── llm.py # LLM utilities
```
## 🔬 Why STaRK?
STaRK addresses critical challenges in modern retrieval systems:
1. **Semi-structured Data**: Real-world knowledge bases combine structured and unstructured data
2. **Complex Queries**: Users formulate queries involving multiple relationships and text
3. **Domain Diversity**: Different domains require specialized evaluation approaches
### Target Users
- **ML Engineers**: Evaluating retrieval system performance
- **Researchers**: Benchmarking LLM capabilities on complex knowledge bases
- **Product Teams**: Optimizing search and recommendation systems
- **Domain Experts**: Analyzing domain-specific retrieval effectiveness
### Real-world Applications
- **E-commerce**: Improved product search leveraging reviews and specifications
- **Academic Research**: Enhanced literature search across citation networks
- **Healthcare**: Better drug discovery through relationship analysis
- **Enterprise Search**: More effective information retrieval from corporate knowledge bases
## Agent Communication Pattern
```mermaid
sequenceDiagram
participant User
participant UI as Streamlit UI
participant MA as Main Agent
participant SQA as StarkQA Agent
participant ET as Evaluation Tool
participant SS as Shared State
User->>UI: Upload Files & Configure
UI->>MA: Request Evaluation
MA->>SQA: Route Evaluation Request
SQA->>ET: Invoke Tool
ET->>ET: Process Embeddings
Note over ET: Memory-optimized processing
ET->>SS: Store Results
ET->>SQA: Return Results
SQA->>MA: Update Status
MA->>UI: Display Results
UI->>User: Show Visualizations
User->>UI: Ask Question
UI->>MA: Process Query
MA->>SQA: Request Context
SQA->>SS: Retrieve Data
SS->>SQA: Return Context
SQA->>MA: Provide Answer
MA->>UI: Display Answer
UI->>User: Show Response
```
## 🤝 Contributing
We welcome contributions! Please feel free to submit a Pull Request.
1. Fork the repository
2. Create your feature branch (`git checkout -b feature/amazing-feature`)
3. Commit your changes (`git commit -m 'Add some amazing feature'`)
4. Push to the branch (`git push origin feature/amazing-feature`)
5. Open a Pull Request
## 📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
## 🙏 Acknowledgements
- STaRK benchmark creators (Stanford SNAP Group)
- LangChain and LangGraph for agent framework
- Streamlit for the UI framework
- PyTorch and TorchMetrics for evaluation metrics
- OpenAI for GPT-4-mini access
- This project was developed for Team VPE during the Biodatathon - [VirtualPatientEngine/AIAgents4Pharma](https://github.com/VirtualPatientEngine/AIAgents4Pharma)
## 📚 Citation
If you use the STaRK Benchmark Suite in your research or project, please cite:
```bibtex
@software{stark_benchmark,
author = {Kumar, Ansh and Apoorva Gupta},
title = {STaRK: Benchmarking LLM Retrieval on Semi-structured Knowledge Bases},
url = {https://github.com/ansh-info/stark-agent},
year = {2025},
month = {February},
note = {STaRK Agent: A hierarchical agent system for evaluating LLM retrieval performance}
}
```