{"id":25904063,"url":"https://github.com/ansh-info/stark-agent","last_synced_at":"2026-04-29T16:38:00.158Z","repository":{"id":279026985,"uuid":"937509511","full_name":"ansh-info/stark-agent","owner":"ansh-info","description":"STaRK: Agentic AI benchmark, which is designed to evaluate how well LLMs and retrieval systems work with semi-structured knowledge bases.","archived":false,"fork":false,"pushed_at":"2025-02-27T23:58:13.000Z","size":2650,"stargazers_count":0,"open_issues_count":1,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-06-01T01:59:46.179Z","etag":null,"topics":["agentic-ai","agents","gpt-4","graph","information-retrieval","knowledge-base","knowledge-graph","llm","multimodal","nlp","openai","openai-api","semi-structured-data"],"latest_commit_sha":null,"homepage":"https://stark.stanford.edu/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ansh-info.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-02-23T08:23:00.000Z","updated_at":"2025-02-27T23:58:16.000Z","dependencies_parsed_at":null,"dependency_job_id":"50b91c95-02f5-499c-8549-19d8005c43e2","html_url":"https://github.com/ansh-info/stark-agent","commit_stats":null,"previous_names":["ansh-info/stark-agent"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2Fstark-agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2Fstark-agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2Fstark-agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2Fstark-agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ansh-info","download_url":"https://codeload.github.com/ansh-info/stark-agent/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ansh-info%2Fstark-agent/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":258567350,"owners_count":22721618,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agentic-ai","agents","gpt-4","graph","information-retrieval","knowledge-base","knowledge-graph","llm","multimodal","nlp","openai","openai-api","semi-structured-data"],"created_at":"2025-03-03T04:17:06.224Z","updated_at":"2026-04-29T16:38:00.124Z","avatar_url":"https://github.com/ansh-info.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# STaRK Benchmark Evaluation\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://stark.stanford.edu/images/logo.png\" alt=\"STaRK Logo\" width=\"200\"/\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003cstrong\u003eA hierarchical agent system for evaluating LLM retrieval performance on semi-structured knowledge bases.\u003c/strong\u003e\n\u003c/p\u003e\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"#overview\"\u003eOverview\u003c/a\u003e •\n  \u003ca href=\"#features\"\u003eFeatures\u003c/a\u003e •\n  \u003ca href=\"#architecture\"\u003eArchitecture\u003c/a\u003e •\n  \u003ca href=\"#getting-started\"\u003eGetting Started\u003c/a\u003e •\n  \u003ca href=\"#usage\"\u003eUsage\u003c/a\u003e •\n  \u003ca href=\"#evaluation-metrics\"\u003eMetrics\u003c/a\u003e •\n  \u003ca href=\"#why-stark\"\u003eWhy STaRK?\u003c/a\u003e\n\u003c/p\u003e\n\n## 📋 Overview\n\nSTaRK (Semi-structured Text and Relational Knowledge) is a comprehensive benchmark designed to evaluate how well large language models (LLMs) and retrieval systems work with semi-structured knowledge bases (SKBs). These knowledge bases combine structured data (e.g., entity relationships) with unstructured data (e.g., textual descriptions), representing real-world knowledge complexity.\n\nThis project implements a hierarchical LLM-powered agent system to process, evaluate, and visualize the performance of retrieval systems on the STaRK benchmark, spanning three key domains:\n\n1. **Product Search**: Detailed product metadata, reviews, and relationships\n2. **Academic Paper Search**: Paper citations, author relationships, and content\n3. **Precision Medicine**: Drug-disease interactions and clinical trial data\n\n## 🌟 Features\n\n- **Hierarchical Agent System**: Multi-agent architecture with specialized agents for different tasks\n- **Comprehensive Evaluation**: Calculate standard retrieval metrics (MRR, MAP, NDCG, Recall@K, etc.)\n- **Interactive Knowledge Graph**: Visualize node relationships and similarities\n- **Streamlit UI**: User-friendly interface for uploading data, running evaluations, and analyzing results\n- **Memory Optimization**: Efficiently process large embedding datasets with chunked computation\n- **LLM-powered Analysis**: Natural language interface to query and understand results\n- **Data Enrichment**: Tools to enhance datasets with additional features\n- **Remote Processing**: Support for both local and remote (GPT-4o-mini) evaluation\n\n## 🧠 Architecture\n\nThe system implements a hierarchical agent structure:\n\n### Data Flow Diagram with Remote Processing\n\n```mermaid\ngraph TD\n    A[User] --\u003e|Upload Embeddings| B[Streamlit UI]\n    B --\u003e|Configure Settings| B1[Local/Remote Selection]\n    B --\u003e|Trigger Evaluation| C[Main Agent]\n\n    C --\u003e|Route Request| D[StarkQA Agent]\n\n    D --\u003e|Local Processing| E1[Local Evaluation Tool]\n    D --\u003e|Remote Processing| E2[GPT-4o-mini API]\n\n    E1 --\u003e|Read| F1[Query Embeddings]\n    E1 --\u003e|Read| F2[Node Embeddings]\n    E1 --\u003e|Process| G1[Local Similarity Computation]\n\n    E2 --\u003e|Send| F3[Base64 Encoded Files]\n    E2 --\u003e|Process| G2[Remote Computation]\n\n    G1 --\u003e|Calculate| H1[Local Metrics]\n    G2 --\u003e|Calculate| H2[Remote Metrics]\n\n    G1 --\u003e|Generate| I1[Knowledge Graph]\n\n    H1 --\u003e|Store In| J[Shared State]\n    H2 --\u003e|Store In| J\n    I1 --\u003e|Store In| J\n\n    J --\u003e|Display In| K1[Metrics Visualization]\n    J --\u003e|Display In| K2[Graph Visualization]\n    J --\u003e|Reference For| K3[Chat Interface]\n\n    K1 --\u003e|Show In| B\n    K2 --\u003e|Show In| B\n    K3 --\u003e|Show In| B\n\n    %% Styling\n    classDef user fill:#f96,stroke:#333,stroke-width:2px\n    classDef ui fill:#9cf,stroke:#333,stroke-width:2px\n    classDef agent fill:#fcf,stroke:#333,stroke-width:2px\n    classDef local fill:#cfc,stroke:#333,stroke-width:2px\n    classDef remote fill:#f99,stroke:#333,stroke-width:2px\n    classDef data fill:#fc9,stroke:#333,stroke-width:2px\n    classDef process fill:#ff9,stroke:#333,stroke-width:2px\n    classDef state fill:#c9f,stroke:#333,stroke-width:2px\n    classDef output fill:#9c9,stroke:#333,stroke-width:2px\n\n    class A user\n    class B,B1 ui\n    class C,D agent\n    class E1,G1,H1,I1 local\n    class E2,G2,H2 remote\n    class F1,F2,F3 data\n    class J state\n    class K1,K2,K3 output\n```\n\n### Agents\n\n- **Main Agent**: Supervisory agent that routes user queries and orchestrates tasks\n- **StarkQA Agent**: Specialized agent focused on evaluation and metrics\n- **Query Agent**: Handles natural language queries against the knowledge graph\n- **Enrichment Agent**: Provides data enhancement capabilities\n\n### Tools\n\n- **Evaluation Tool**: Processes embeddings and computes similarities\n- **Metrics Tool**: Calculates standard retrieval metrics\n- **Visualization Tool**: Generates interactive visualizations\n- **Knowledge Graph Tool**: Builds and queries the knowledge graph\n- **Feature Generation Tool**: Enhances datasets with additional features\n\n## 📊 Evaluation Pipeline\n\n1. **Data Loading**: Process query and node embeddings from parquet files\n2. **Similarity Computation**: Calculate similarities between queries and nodes using optimized batching\n3. **Metrics Calculation**: Compute standard retrieval metrics (MRR, MAP, Recall@K)\n4. **Knowledge Graph Generation**: Create interactive visualizations of node relationships\n5. **Result Analysis**: Provide natural language insights about evaluation results\n\n## Memory-Optimized Component Architecture\n\n```mermaid\ngraph TD\n    subgraph \"UI Layer\"\n        A1[File Upload]\n        A2[Configuration Settings]\n        A3[Results Display]\n        A4[Chat Interface]\n        A5[Demo \u0026 Samples]\n    end\n\n    subgraph \"Agent Layer\"\n        B1[Main Agent]\n        B2[StarkQA Agent]\n        B3[Enrichment Agent]\n        B4[Query Agent]\n    end\n\n    subgraph \"Tool Layer\"\n        C1[Evaluation Tool]\n        C2[Knowledge Graph Generator]\n        C3[Metrics Calculator]\n        C4[Feature Generator]\n        C5[Query Processor]\n    end\n\n    subgraph \"Computation Layer\"\n        D1[Local Processing]\n        D2[Remote GPT-4o-mini]\n        D3[Memory Optimization]\n    end\n\n    subgraph \"Data Layer\"\n        E1[Query Embeddings]\n        E2[Node Embeddings]\n        E3[Shared State]\n        E4[Visualization Cache]\n    end\n\n    A1 --\u003e B1\n    A2 --\u003e B1\n    B1 --\u003e B2\n    B1 --\u003e B3\n    B1 --\u003e B4\n\n    B2 --\u003e C1\n    B2 --\u003e C2\n    B2 --\u003e C3\n    B3 --\u003e C4\n    B4 --\u003e C5\n\n    C1 --\u003e D1\n    C1 --\u003e D2\n    C2 --\u003e D1\n    C3 --\u003e D1\n    C3 --\u003e D2\n    C4 --\u003e D1\n    C5 --\u003e D1\n\n    D1 --\u003e D3\n    D3 --\u003e E1\n    D3 --\u003e E2\n    D1 --\u003e E3\n    D2 --\u003e E3\n\n    E3 --\u003e E4\n    E4 --\u003e A3\n    E4 --\u003e A4\n    E3 --\u003e A5\n\n    %% Styling\n    classDef ui fill:#9cf,stroke:#333,stroke-width:2px\n    classDef agent fill:#fcf,stroke:#333,stroke-width:2px\n    classDef tool fill:#ff9,stroke:#333,stroke-width:2px\n    classDef compute fill:#f99,stroke:#333,stroke-width:2px\n    classDef data fill:#cfc,stroke:#333,stroke-width:2px\n\n    class A1,A2,A3,A4,A5 ui\n    class B1,B2,B3,B4 agent\n    class C1,C2,C3,C4,C5 tool\n    class D1,D2,D3 compute\n    class E1,E2,E3,E4 data\n```\n\n## 🚀 Getting Started\n\n### Prerequisites\n\n- Python 3.10+\n- LangChain/LangGraph\n- OpenAI API key (for GPT-4-mini)\n- PyTorch\n- Streamlit\n\n### Installation\n\n```bash\n# Clone the repository\ngit clone https://github.com/ansh-info/stark-agent\ncd stark-agent\n\n# Create and activate virtual environment\npython -m venv venv\nsource venv/bin/activate  # On Windows: venv\\Scripts\\activate\n\n# Install dependencies\npip install -r requirements.txt\n\n# Set up your OpenAI API key\nexport OPENAI_API_KEY=your_api_key_here  # On Windows: set OPENAI_API_KEY=your_api_key_here\n```\n\n### Running the Application\n\n```bash\n# Start the Streamlit app\nstreamlit run app/demo_app.py\n```\n\n## 📝 Usage\n\n1. **Upload Files**:\n\n   - Query Embeddings (parquet format)\n   - Node Embeddings (parquet format)\n\n2. **Configure Settings**:\n\n   - Batch Size: Number of queries to process at once\n   - Processing Type: Local or Remote (GPT-4o-mini)\n   - Evaluation Split: Data split to evaluate on\n   - Model Selection: Choose LLM for analysis\n\n3. **Run Evaluation**:\n\n   - Click \"Run Evaluation\" to start the process\n   - Monitor progress in the UI\n\n4. **Analyze Results**:\n   - View metrics visualization\n   - Explore the knowledge graph\n   - Ask questions in natural language\n\n## 📈 Evaluation Metrics\n\nThe system calculates and displays the following metrics:\n\n| Metric          | Description                                                                   |\n| --------------- | ----------------------------------------------------------------------------- |\n| **MRR**         | Mean Reciprocal Rank - Measures where first correct answers appear in ranking |\n| **MAP**         | Mean Average Precision - Measures precision across all relevant items         |\n| **R-Precision** | Precision at the position equal to number of relevant items                   |\n| **Recall@K**    | Proportion of relevant items found in top K results                           |\n| **Hit@K**       | Whether any relevant item appears in top K results                            |\n| **NDCG**        | Normalized Discounted Cumulative Gain - Evaluates ranking quality             |\n\n## Evaluation Process\n\n```mermaid\ngraph TD\n    A[Start Evaluation] --\u003e B1{Local or Remote?}\n\n    B1 --\u003e|Local| C1[Load Parquet Files]\n    B1 --\u003e|Remote| C2[Encode Files to Base64]\n\n    C1 --\u003e D1[Parse Answer IDs]\n    C2 --\u003e D2[Send to GPT-4o-mini]\n\n    D1 --\u003e E1[Convert to Tensors]\n    D2 --\u003e E2[Await Remote Response]\n\n    E1 --\u003e F1[Compute Similarities in Chunks]\n    F1 --\u003e G1[Calculate Memory-Optimized Metrics]\n\n    E2 --\u003e G2[Process Remote Results]\n\n    G1 --\u003e H1[Store Local Results]\n    G2 --\u003e H2[Store Remote Results]\n\n    H1 --\u003e I[Generate Visualizations]\n    H2 --\u003e I\n\n    I --\u003e J1[Interactive Knowledge Graph]\n    I --\u003e J2[Metric Dashboards]\n    I --\u003e J3[Performance Reports]\n\n    J1 --\u003e K[End Evaluation]\n    J2 --\u003e K\n    J3 --\u003e K\n\n    %% Styling\n    classDef start fill:#9f9,stroke:#333,stroke-width:2px\n    classDef decision fill:#fcf,stroke:#333,stroke-width:2px\n    classDef local fill:#9cf,stroke:#333,stroke-width:1px\n    classDef remote fill:#f99,stroke:#333,stroke-width:1px\n    classDef compute fill:#fc9,stroke:#333,stroke-width:1px\n    classDef output fill:#cfc,stroke:#333,stroke-width:1px\n    classDef final fill:#f96,stroke:#333,stroke-width:2px\n\n    class A start\n    class B1 decision\n    class C1,D1,E1,F1,G1,H1 local\n    class C2,D2,E2,G2,H2 remote\n    class I compute\n    class J1,J2,J3 output\n    class K final\n```\n\n## Advanced Memory Optimization Strategy\n\n```mermaid\ngraph TD\n    A[Large Embedding Data] --\u003e B{Size \u003e Memory?}\n    B --\u003e|No| C1[Direct Processing]\n    B --\u003e|Yes| C2[Memory-Optimized Flow]\n\n    C2 --\u003e D[Batched Processing Strategy]\n\n    D --\u003e E1[Chunking]\n    D --\u003e E2[Streaming]\n    D --\u003e E3[Checkpointing]\n\n    E1 --\u003e F1[Process in Chunks]\n    E1 --\u003e F2[Size: 1000 queries/chunk]\n\n    E2 --\u003e G1[Stream Results]\n    E2 --\u003e G2[Progressive UI Updates]\n\n    E3 --\u003e H1[Save Intermediate Results]\n    E3 --\u003e H2[Resume Capability]\n\n    F1 --\u003e I[Merge Chunk Results]\n    G1 --\u003e I\n    H1 --\u003e I\n\n    C1 --\u003e J[Calculate Final Metrics]\n    I --\u003e J\n\n    J --\u003e K[Store in Shared State]\n    K --\u003e L[Visualization \u0026 Report]\n\n    %% Styling\n    classDef data fill:#9cf,stroke:#333,stroke-width:2px\n    classDef decision fill:#fcf,stroke:#333,stroke-width:2px\n    classDef strategy fill:#f99,stroke:#333,stroke-width:2px\n    classDef technique fill:#cfc,stroke:#333,stroke-width:1px\n    classDef detail fill:#ff9,stroke:#333,stroke-width:1px\n    classDef process fill:#fc9,stroke:#333,stroke-width:2px\n    classDef output fill:#9f9,stroke:#333,stroke-width:2px\n\n    class A data\n    class B decision\n    class C1,C2 strategy\n    class D strategy\n    class E1,E2,E3 technique\n    class F1,F2,G1,G2,H1,H2 detail\n    class I,J,K process\n    class L output\n```\n\n## 🔍 Knowledge Graph Visualization\n\nThe interactive knowledge graph visualization provides:\n\n- **Node Representation**: Entities in the knowledge base\n- **Edge Connections**: Similarity relationships between nodes\n- **Color Coding**: Visual differentiation of node types\n- **Interactive Exploration**: Zoom, pan, and click for details\n- **Filtering**: Focus on specific node types or relationship strengths\n\n## 📊 Sample Reports\n\nThe system generates comprehensive reports including:\n\n- **Performance Summary**: Key metrics and performance indicators\n- **Detailed Analysis**: Breakdown of performance across query types\n- **Comparative Metrics**: Comparison against baseline systems\n- **Knowledge Graph Insights**: Network analysis of node relationships\n- **Recommendations**: Suggestions for system improvement\n\n## 🧩 Project Structure\n\n```\n.\n├── .env                  # Environment variables\n├── agents/\n│   ├── __init__.py\n│   ├── main_agent.py     # Main supervisory agent\n│   └── stark_agent.py    # StarkQA evaluation agent\n├── config/\n│   └── config.py         # Configuration settings\n├── state/\n│   └── shared_state.py   # Shared state management\n├── tests/\n│   ├── __init__.py\n│   └── test_stark_evaluation.py\n├── tools/\n│   ├── __init__.py\n│   └── stark/\n│       ├── __init__.py\n│       └── evaluation_retrival.py  # Core evaluation tool\n├── ui/\n│   ├── stark_app.py            # Streamlit interface\n│   └── demo_app.py\n└── utils/\n    ├── __init__.py\n    └── llm.py            # LLM utilities\n```\n\n## 🔬 Why STaRK?\n\nSTaRK addresses critical challenges in modern retrieval systems:\n\n1. **Semi-structured Data**: Real-world knowledge bases combine structured and unstructured data\n2. **Complex Queries**: Users formulate queries involving multiple relationships and text\n3. **Domain Diversity**: Different domains require specialized evaluation approaches\n\n### Target Users\n\n- **ML Engineers**: Evaluating retrieval system performance\n- **Researchers**: Benchmarking LLM capabilities on complex knowledge bases\n- **Product Teams**: Optimizing search and recommendation systems\n- **Domain Experts**: Analyzing domain-specific retrieval effectiveness\n\n### Real-world Applications\n\n- **E-commerce**: Improved product search leveraging reviews and specifications\n- **Academic Research**: Enhanced literature search across citation networks\n- **Healthcare**: Better drug discovery through relationship analysis\n- **Enterprise Search**: More effective information retrieval from corporate knowledge bases\n\n## Agent Communication Pattern\n\n```mermaid\nsequenceDiagram\n    participant User\n    participant UI as Streamlit UI\n    participant MA as Main Agent\n    participant SQA as StarkQA Agent\n    participant ET as Evaluation Tool\n    participant SS as Shared State\n\n    User-\u003e\u003eUI: Upload Files \u0026 Configure\n    UI-\u003e\u003eMA: Request Evaluation\n    MA-\u003e\u003eSQA: Route Evaluation Request\n    SQA-\u003e\u003eET: Invoke Tool\n\n    ET-\u003e\u003eET: Process Embeddings\n    Note over ET: Memory-optimized processing\n\n    ET-\u003e\u003eSS: Store Results\n    ET-\u003e\u003eSQA: Return Results\n    SQA-\u003e\u003eMA: Update Status\n    MA-\u003e\u003eUI: Display Results\n    UI-\u003e\u003eUser: Show Visualizations\n\n    User-\u003e\u003eUI: Ask Question\n    UI-\u003e\u003eMA: Process Query\n    MA-\u003e\u003eSQA: Request Context\n    SQA-\u003e\u003eSS: Retrieve Data\n    SS-\u003e\u003eSQA: Return Context\n    SQA-\u003e\u003eMA: Provide Answer\n    MA-\u003e\u003eUI: Display Answer\n    UI-\u003e\u003eUser: Show Response\n```\n\n## 🤝 Contributing\n\nWe welcome contributions! Please feel free to submit a Pull Request.\n\n1. Fork the repository\n2. Create your feature branch (`git checkout -b feature/amazing-feature`)\n3. Commit your changes (`git commit -m 'Add some amazing feature'`)\n4. Push to the branch (`git push origin feature/amazing-feature`)\n5. Open a Pull Request\n\n## 📄 License\n\nThis project is licensed under the MIT License - see the LICENSE file for details.\n\n## 🙏 Acknowledgements\n\n- STaRK benchmark creators (Stanford SNAP Group)\n- LangChain and LangGraph for agent framework\n- Streamlit for the UI framework\n- PyTorch and TorchMetrics for evaluation metrics\n- OpenAI for GPT-4-mini access\n- This project was developed for Team VPE during the Biodatathon - [VirtualPatientEngine/AIAgents4Pharma](https://github.com/VirtualPatientEngine/AIAgents4Pharma)\n\n## 📚 Citation\n\nIf you use the STaRK Benchmark Suite in your research or project, please cite:\n\n```bibtex\n@software{stark_benchmark,\n  author = {Kumar, Ansh and Apoorva Gupta},\n  title = {STaRK: Benchmarking LLM Retrieval on Semi-structured Knowledge Bases},\n  url = {https://github.com/ansh-info/stark-agent},\n  year = {2025},\n  month = {February},\n  note = {STaRK Agent: A hierarchical agent system for evaluating LLM retrieval performance}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fansh-info%2Fstark-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fansh-info%2Fstark-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fansh-info%2Fstark-agent/lists"}