https://github.com/ansh-info/stark-agent

STaRK: Agentic AI benchmark, which is designed to evaluate how well LLMs and retrieval systems work with semi-structured knowledge bases.
https://github.com/ansh-info/stark-agent
agentic-ai agents gpt-4 graph information-retrieval knowledge-base knowledge-graph llm multimodal nlp openai openai-api semi-structured-data
Last synced: 2 months ago
JSON representation
STaRK: Agentic AI benchmark, which is designed to evaluate how well LLMs and retrieval systems work with semi-structured knowledge bases.
Host: GitHub
URL: https://github.com/ansh-info/stark-agent
Owner: ansh-info
License: mit
Created: 2025-02-23T08:23:00.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-02-27T23:58:13.000Z (over 1 year ago)
Last Synced: 2025-06-01T01:59:46.179Z (about 1 year ago)
Topics: agentic-ai, agents, gpt-4, graph, information-retrieval, knowledge-base, knowledge-graph, llm, multimodal, nlp, openai, openai-api, semi-structured-data
Language: Python
Homepage: https://stark.stanford.edu/
Size: 2.53 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          # STaRK Benchmark Evaluation



  





  A hierarchical agent system for evaluating LLM retrieval performance on semi-structured knowledge bases.





  Overview •

  Features •

  Architecture •

  Getting Started •

  Usage •

  Metrics •

  Why STaRK?



## 📋 Overview

STaRK (Semi-structured Text and Relational Knowledge) is a comprehensive benchmark designed to evaluate how well large language models (LLMs) and retrieval systems work with semi-structured knowledge bases (SKBs). These knowledge bases combine structured data (e.g., entity relationships) with unstructured data (e.g., textual descriptions), representing real-world knowledge complexity.

This project implements a hierarchical LLM-powered agent system to process, evaluate, and visualize the performance of retrieval systems on the STaRK benchmark, spanning three key domains:

1. **Product Search**: Detailed product metadata, reviews, and relationships

2. **Academic Paper Search**: Paper citations, author relationships, and content

3. **Precision Medicine**: Drug-disease interactions and clinical trial data

## 🌟 Features

- **Hierarchical Agent System**: Multi-agent architecture with specialized agents for different tasks

- **Comprehensive Evaluation**: Calculate standard retrieval metrics (MRR, MAP, NDCG, Recall@K, etc.)

- **Interactive Knowledge Graph**: Visualize node relationships and similarities

- **Streamlit UI**: User-friendly interface for uploading data, running evaluations, and analyzing results

- **Memory Optimization**: Efficiently process large embedding datasets with chunked computation

- **LLM-powered Analysis**: Natural language interface to query and understand results

- **Data Enrichment**: Tools to enhance datasets with additional features

- **Remote Processing**: Support for both local and remote (GPT-4o-mini) evaluation

## 🧠 Architecture

The system implements a hierarchical agent structure:

### Data Flow Diagram with Remote Processing

```mermaid

graph TD

    A[User] -->|Upload Embeddings| B[Streamlit UI]

    B -->|Configure Settings| B1[Local/Remote Selection]

    B -->|Trigger Evaluation| C[Main Agent]

    C -->|Route Request| D[StarkQA Agent]

    D -->|Local Processing| E1[Local Evaluation Tool]

    D -->|Remote Processing| E2[GPT-4o-mini API]

    E1 -->|Read| F1[Query Embeddings]

    E1 -->|Read| F2[Node Embeddings]

    E1 -->|Process| G1[Local Similarity Computation]

    E2 -->|Send| F3[Base64 Encoded Files]

    E2 -->|Process| G2[Remote Computation]

    G1 -->|Calculate| H1[Local Metrics]

    G2 -->|Calculate| H2[Remote Metrics]

    G1 -->|Generate| I1[Knowledge Graph]

    H1 -->|Store In| J[Shared State]

    H2 -->|Store In| J

    I1 -->|Store In| J

    J -->|Display In| K1[Metrics Visualization]

    J -->|Display In| K2[Graph Visualization]

    J -->|Reference For| K3[Chat Interface]

    K1 -->|Show In| B

    K2 -->|Show In| B

    K3 -->|Show In| B

    %% Styling

    classDef user fill:#f96,stroke:#333,stroke-width:2px

    classDef ui fill:#9cf,stroke:#333,stroke-width:2px

    classDef agent fill:#fcf,stroke:#333,stroke-width:2px

    classDef local fill:#cfc,stroke:#333,stroke-width:2px

    classDef remote fill:#f99,stroke:#333,stroke-width:2px

    classDef data fill:#fc9,stroke:#333,stroke-width:2px

    classDef process fill:#ff9,stroke:#333,stroke-width:2px

    classDef state fill:#c9f,stroke:#333,stroke-width:2px

    classDef output fill:#9c9,stroke:#333,stroke-width:2px

    class A user

    class B,B1 ui

    class C,D agent

    class E1,G1,H1,I1 local

    class E2,G2,H2 remote

    class F1,F2,F3 data

    class J state

    class K1,K2,K3 output

```

### Agents

- **Main Agent**: Supervisory agent that routes user queries and orchestrates tasks

- **StarkQA Agent**: Specialized agent focused on evaluation and metrics

- **Query Agent**: Handles natural language queries against the knowledge graph

- **Enrichment Agent**: Provides data enhancement capabilities

### Tools

- **Evaluation Tool**: Processes embeddings and computes similarities

- **Metrics Tool**: Calculates standard retrieval metrics

- **Visualization Tool**: Generates interactive visualizations

- **Knowledge Graph Tool**: Builds and queries the knowledge graph

- **Feature Generation Tool**: Enhances datasets with additional features

## 📊 Evaluation Pipeline

1. **Data Loading**: Process query and node embeddings from parquet files

2. **Similarity Computation**: Calculate similarities between queries and nodes using optimized batching

3. **Metrics Calculation**: Compute standard retrieval metrics (MRR, MAP, Recall@K)

4. **Knowledge Graph Generation**: Create interactive visualizations of node relationships

5. **Result Analysis**: Provide natural language insights about evaluation results

## Memory-Optimized Component Architecture

```mermaid

graph TD

    subgraph "UI Layer"

        A1[File Upload]

        A2[Configuration Settings]

        A3[Results Display]

        A4[Chat Interface]

        A5[Demo & Samples]

    end

    subgraph "Agent Layer"

        B1[Main Agent]

        B2[StarkQA Agent]

        B3[Enrichment Agent]

        B4[Query Agent]

    end

    subgraph "Tool Layer"

        C1[Evaluation Tool]

        C2[Knowledge Graph Generator]

        C3[Metrics Calculator]

        C4[Feature Generator]

        C5[Query Processor]

    end

    subgraph "Computation Layer"

        D1[Local Processing]

        D2[Remote GPT-4o-mini]

        D3[Memory Optimization]

    end

    subgraph "Data Layer"

        E1[Query Embeddings]

        E2[Node Embeddings]

        E3[Shared State]

        E4[Visualization Cache]

    end

    A1 --> B1

    A2 --> B1

    B1 --> B2

    B1 --> B3

    B1 --> B4

    B2 --> C1

    B2 --> C2

    B2 --> C3

    B3 --> C4

    B4 --> C5

    C1 --> D1

    C1 --> D2

    C2 --> D1

    C3 --> D1

    C3 --> D2

    C4 --> D1

    C5 --> D1

    D1 --> D3

    D3 --> E1

    D3 --> E2

    D1 --> E3

    D2 --> E3

    E3 --> E4

    E4 --> A3

    E4 --> A4

    E3 --> A5

    %% Styling

    classDef ui fill:#9cf,stroke:#333,stroke-width:2px

    classDef agent fill:#fcf,stroke:#333,stroke-width:2px

    classDef tool fill:#ff9,stroke:#333,stroke-width:2px

    classDef compute fill:#f99,stroke:#333,stroke-width:2px

    classDef data fill:#cfc,stroke:#333,stroke-width:2px

    class A1,A2,A3,A4,A5 ui

    class B1,B2,B3,B4 agent

    class C1,C2,C3,C4,C5 tool

    class D1,D2,D3 compute

    class E1,E2,E3,E4 data

```

## 🚀 Getting Started

### Prerequisites

- Python 3.10+

- LangChain/LangGraph

- OpenAI API key (for GPT-4-mini)

- PyTorch

- Streamlit

### Installation

```bash

# Clone the repository

git clone https://github.com/ansh-info/stark-agent

cd stark-agent

# Create and activate virtual environment

python -m venv venv

source venv/bin/activate  # On Windows: venv\Scripts\activate

# Install dependencies

pip install -r requirements.txt

# Set up your OpenAI API key

export OPENAI_API_KEY=your_api_key_here  # On Windows: set OPENAI_API_KEY=your_api_key_here

```

### Running the Application

```bash

# Start the Streamlit app

streamlit run app/demo_app.py

```

## 📝 Usage

1. **Upload Files**:

   - Query Embeddings (parquet format)

   - Node Embeddings (parquet format)

2. **Configure Settings**:

   - Batch Size: Number of queries to process at once

   - Processing Type: Local or Remote (GPT-4o-mini)

   - Evaluation Split: Data split to evaluate on

   - Model Selection: Choose LLM for analysis

3. **Run Evaluation**:

   - Click "Run Evaluation" to start the process

   - Monitor progress in the UI

4. **Analyze Results**:

   - View metrics visualization

   - Explore the knowledge graph

   - Ask questions in natural language

## 📈 Evaluation Metrics

The system calculates and displays the following metrics:

| Metric          | Description                                                                   |

| --------------- | ----------------------------------------------------------------------------- |

| **MRR**         | Mean Reciprocal Rank - Measures where first correct answers appear in ranking |

| **MAP**         | Mean Average Precision - Measures precision across all relevant items         |

| **R-Precision** | Precision at the position equal to number of relevant items                   |

| **Recall@K**    | Proportion of relevant items found in top K results                           |

| **Hit@K**       | Whether any relevant item appears in top K results                            |

| **NDCG**        | Normalized Discounted Cumulative Gain - Evaluates ranking quality             |

## Evaluation Process

```mermaid

graph TD

    A[Start Evaluation] --> B1{Local or Remote?}

    B1 -->|Local| C1[Load Parquet Files]

    B1 -->|Remote| C2[Encode Files to Base64]

    C1 --> D1[Parse Answer IDs]

    C2 --> D2[Send to GPT-4o-mini]

    D1 --> E1[Convert to Tensors]

    D2 --> E2[Await Remote Response]

    E1 --> F1[Compute Similarities in Chunks]

    F1 --> G1[Calculate Memory-Optimized Metrics]

    E2 --> G2[Process Remote Results]

    G1 --> H1[Store Local Results]

    G2 --> H2[Store Remote Results]

    H1 --> I[Generate Visualizations]

    H2 --> I

    I --> J1[Interactive Knowledge Graph]

    I --> J2[Metric Dashboards]

    I --> J3[Performance Reports]

    J1 --> K[End Evaluation]

    J2 --> K

    J3 --> K

    %% Styling

    classDef start fill:#9f9,stroke:#333,stroke-width:2px

    classDef decision fill:#fcf,stroke:#333,stroke-width:2px

    classDef local fill:#9cf,stroke:#333,stroke-width:1px

    classDef remote fill:#f99,stroke:#333,stroke-width:1px

    classDef compute fill:#fc9,stroke:#333,stroke-width:1px

    classDef output fill:#cfc,stroke:#333,stroke-width:1px

    classDef final fill:#f96,stroke:#333,stroke-width:2px

    class A start

    class B1 decision

    class C1,D1,E1,F1,G1,H1 local

    class C2,D2,E2,G2,H2 remote

    class I compute

    class J1,J2,J3 output

    class K final

```

## Advanced Memory Optimization Strategy

```mermaid

graph TD

    A[Large Embedding Data] --> B{Size > Memory?}

    B -->|No| C1[Direct Processing]

    B -->|Yes| C2[Memory-Optimized Flow]

    C2 --> D[Batched Processing Strategy]

    D --> E1[Chunking]

    D --> E2[Streaming]

    D --> E3[Checkpointing]

    E1 --> F1[Process in Chunks]

    E1 --> F2[Size: 1000 queries/chunk]

    E2 --> G1[Stream Results]

    E2 --> G2[Progressive UI Updates]

    E3 --> H1[Save Intermediate Results]

    E3 --> H2[Resume Capability]

    F1 --> I[Merge Chunk Results]

    G1 --> I

    H1 --> I

    C1 --> J[Calculate Final Metrics]

    I --> J

    J --> K[Store in Shared State]

    K --> L[Visualization & Report]

    %% Styling

    classDef data fill:#9cf,stroke:#333,stroke-width:2px

    classDef decision fill:#fcf,stroke:#333,stroke-width:2px

    classDef strategy fill:#f99,stroke:#333,stroke-width:2px

    classDef technique fill:#cfc,stroke:#333,stroke-width:1px

    classDef detail fill:#ff9,stroke:#333,stroke-width:1px

    classDef process fill:#fc9,stroke:#333,stroke-width:2px

    classDef output fill:#9f9,stroke:#333,stroke-width:2px

    class A data

    class B decision

    class C1,C2 strategy

    class D strategy

    class E1,E2,E3 technique

    class F1,F2,G1,G2,H1,H2 detail

    class I,J,K process

    class L output

```

## 🔍 Knowledge Graph Visualization

The interactive knowledge graph visualization provides:

- **Node Representation**: Entities in the knowledge base

- **Edge Connections**: Similarity relationships between nodes

- **Color Coding**: Visual differentiation of node types

- **Interactive Exploration**: Zoom, pan, and click for details

- **Filtering**: Focus on specific node types or relationship strengths

## 📊 Sample Reports

The system generates comprehensive reports including:

- **Performance Summary**: Key metrics and performance indicators

- **Detailed Analysis**: Breakdown of performance across query types

- **Comparative Metrics**: Comparison against baseline systems

- **Knowledge Graph Insights**: Network analysis of node relationships

- **Recommendations**: Suggestions for system improvement

## 🧩 Project Structure

```

.

├── .env                  # Environment variables

├── agents/

│   ├── __init__.py

│   ├── main_agent.py     # Main supervisory agent

│   └── stark_agent.py    # StarkQA evaluation agent

├── config/

│   └── config.py         # Configuration settings

├── state/

│   └── shared_state.py   # Shared state management

├── tests/

│   ├── __init__.py

│   └── test_stark_evaluation.py

├── tools/

│   ├── __init__.py

│   └── stark/

│       ├── __init__.py

│       └── evaluation_retrival.py  # Core evaluation tool

├── ui/

│   ├── stark_app.py            # Streamlit interface

│   └── demo_app.py

└── utils/

    ├── __init__.py

    └── llm.py            # LLM utilities

```

## 🔬 Why STaRK?

STaRK addresses critical challenges in modern retrieval systems:

1. **Semi-structured Data**: Real-world knowledge bases combine structured and unstructured data

2. **Complex Queries**: Users formulate queries involving multiple relationships and text

3. **Domain Diversity**: Different domains require specialized evaluation approaches

### Target Users

- **ML Engineers**: Evaluating retrieval system performance

- **Researchers**: Benchmarking LLM capabilities on complex knowledge bases

- **Product Teams**: Optimizing search and recommendation systems

- **Domain Experts**: Analyzing domain-specific retrieval effectiveness

### Real-world Applications

- **E-commerce**: Improved product search leveraging reviews and specifications

- **Academic Research**: Enhanced literature search across citation networks

- **Healthcare**: Better drug discovery through relationship analysis

- **Enterprise Search**: More effective information retrieval from corporate knowledge bases

## Agent Communication Pattern

```mermaid

sequenceDiagram

    participant User

    participant UI as Streamlit UI

    participant MA as Main Agent

    participant SQA as StarkQA Agent

    participant ET as Evaluation Tool

    participant SS as Shared State

    User->>UI: Upload Files & Configure

    UI->>MA: Request Evaluation

    MA->>SQA: Route Evaluation Request

    SQA->>ET: Invoke Tool

    ET->>ET: Process Embeddings

    Note over ET: Memory-optimized processing

    ET->>SS: Store Results

    ET->>SQA: Return Results

    SQA->>MA: Update Status

    MA->>UI: Display Results

    UI->>User: Show Visualizations

    User->>UI: Ask Question

    UI->>MA: Process Query

    MA->>SQA: Request Context

    SQA->>SS: Retrieve Data

    SS->>SQA: Return Context

    SQA->>MA: Provide Answer

    MA->>UI: Display Answer

    UI->>User: Show Response

```

## 🤝 Contributing

We welcome contributions! Please feel free to submit a Pull Request.

1. Fork the repository

2. Create your feature branch (`git checkout -b feature/amazing-feature`)

3. Commit your changes (`git commit -m 'Add some amazing feature'`)

4. Push to the branch (`git push origin feature/amazing-feature`)

5. Open a Pull Request

## 📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

## 🙏 Acknowledgements

- STaRK benchmark creators (Stanford SNAP Group)

- LangChain and LangGraph for agent framework

- Streamlit for the UI framework

- PyTorch and TorchMetrics for evaluation metrics

- OpenAI for GPT-4-mini access

- This project was developed for Team VPE during the Biodatathon - [VirtualPatientEngine/AIAgents4Pharma](https://github.com/VirtualPatientEngine/AIAgents4Pharma)

## 📚 Citation

If you use the STaRK Benchmark Suite in your research or project, please cite:

```bibtex

@software{stark_benchmark,

  author = {Kumar, Ansh and Apoorva Gupta},

  title = {STaRK: Benchmarking LLM Retrieval on Semi-structured Knowledge Bases},

  url = {https://github.com/ansh-info/stark-agent},

  year = {2025},

  month = {February},

  note = {STaRK Agent: A hierarchical agent system for evaluating LLM retrieval performance}

}

```
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/ansh-info/stark-agent

Awesome Lists containing this project

README