{"id":50900875,"url":"https://github.com/exalsius/rca-llm","last_synced_at":"2026-06-16T02:30:46.569Z","repository":{"id":353589039,"uuid":"1170862290","full_name":"exalsius/rca-llm","owner":"exalsius","description":"An evaluation framework for root cause analysis in large-scale LLM inference systems","archived":false,"fork":false,"pushed_at":"2026-04-24T14:12:16.000Z","size":269,"stargazers_count":6,"open_issues_count":0,"forks_count":2,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-04-24T16:30:27.134Z","etag":null,"topics":["inference-engine","large-language-models","load-testing","root-cause-analysis"],"latest_commit_sha":null,"homepage":"https://arxiv.org/abs/2603.02057","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/exalsius.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE.txt","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-02T15:54:38.000Z","updated_at":"2026-04-24T14:12:20.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/exalsius/rca-llm","commit_stats":null,"previous_names":["exalsius/rca-llm"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/exalsius/rca-llm","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exalsius%2Frca-llm","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exalsius%2Frca-llm/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exalsius%2Frca-llm/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exalsius%2Frca-llm/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/exalsius","download_url":"https://codeload.github.com/exalsius/rca-llm/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/exalsius%2Frca-llm/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34388669,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-16T02:00:06.860Z","response_time":126,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["inference-engine","large-language-models","load-testing","root-cause-analysis"],"created_at":"2026-06-16T02:30:45.954Z","updated_at":"2026-06-16T02:30:46.564Z","avatar_url":"https://github.com/exalsius.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Root Cause Analysis for Large Language Model Inference Systems\n\nOur work has been accepted to the [QualITA](https://qualitawg.github.io/) workshop co-located with the [ICPE 2026](https://icpe2026.spec.org/) conference.\n\n- **Publication:** [Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads](https://dl.acm.org/doi/10.1145/3777911.3800637)\n- **Preprint:** Find it [here](https://arxiv.org/abs/2603.02057)\n\nThis repository contains the complete experimental framework and evaluation suite for our research on **Root Cause Analysis (RCA) in Large Language Model (LLM) inference systems**. Our work addresses the critical challenge of diagnosing performance anomalies and failures in distributed LLM deployments through comprehensive telemetry analysis and automated root cause detection.\n\n## Research Contributions\n\nOur research makes several key contributions to the field of LLM system reliability and observability:\n\n1. **Comprehensive RCA Evaluation Framework**: We present the first systematic evaluation of root cause analysis methods specifically tailored for LLM inference systems, testing 20+ different RCA algorithms across various failure scenarios.\n\n2. **Real-world LLM Telemetry Datasets**: We provide novel telemetry datasets collected from actual distributed LLM deployments under various stress conditions, including GPU stress, memory pressure, and network chaos scenarios.\n\n3. **Chaos Engineering for LLM Systems**: We develop and validate chaos engineering techniques specifically designed for LLM inference workloads, enabling controlled failure injection and system behavior analysis.\n\n4. **Multi-modal Observability Integration**: We demonstrate how to effectively combine metrics, logs, traces, and network data for comprehensive system diagnosis in LLM deployments.\n\n## Repository Structure\n\n### 🔬 **`ansible/`** - Experiment Orchestration Framework\nThe core infrastructure automation system that enables reproducible, large-scale experiments on Kubernetes clusters.\n\n- **Key Components:**\n  - **Infrastructure Setup**: Automated deployment of k3s, NVIDIA GPU operator, Ray clusters, and comprehensive observability stack\n  - **Experiment Execution**: Configurable experiment workflows with support for multiple LLM models and failure scenarios\n  - **Chaos Engineering**: Automated anomaly injection for CPU, memory, GPU, and network stress testing\n  - **Data Collection**: Automated telemetry gathering from Prometheus, Grafana, Loki, OpenTelemetry, and DeepFlow\n\n- **Supported Models**: Falcon, Llama, Mistral, Gemma, Phi, Qwen (7B-11B parameter range); many others can be easily added\n- **Experiment Types**: Baseline performance, stress testing, chaos injection, and RCA evaluation\n\n### 🔬 **`llm-benchmark/`** - Load Testing and Performance Evaluation\nA comprehensive benchmarking suite for evaluating LLM inference performance under various load conditions.\n\n- **Features:**\n  - Multi-provider support (VLLM, OpenAI API, Together.ai, Anyscale, TGI, Triton)\n  - Configurable load patterns (fixed QPS, burst mode, continuous load)\n  - Comprehensive metrics collection (latency, throughput, token generation rates)\n  - Streaming and non-streaming API support\n\n### 🔬 **`rca-benchmark/`** - Root Cause Analysis Evaluation Suite\nAn enhanced version of RCAEval specifically adapted for LLM inference systems, featuring 20+ RCA algorithms and custom LLM telemetry datasets.\n\n- **RCA Methods Evaluated:**\n  - **Graph-based**: PC, FCI, Granger Causality, LiNGAM, GES, CMLP with PageRank/Random Walk\n  - **Specialized**: BARO, CausalRCA, CIRCA, MicroCause, E-Diagnosis, RCD, NSigma, CausalAI\n  - **Trace-based**: MicroRank, TraceRCA\n  - **Multi-source**: MMBaro, PDiagnose\n  - **Custom Implementations**: MicroRCA, MicroScope, MonitorRank\n\n- **Datasets**: Custom LLM inference deployment telemetry data with controlled failure injections\n\n### 📊 **`data/`** - Experimental Results and Artifacts\nContains the complete experimental data from our evaluation, including:\n- **Smoke Tests**: Performance baselines and capability limits\n- **RCA Evaluation**: Chaos injection results and RCA algorithm performance\n- **Telemetry Data**: Comprehensive metrics, logs, traces, and network data\n\n## Quick Start\n\n### 1. **Understanding the Research Scope**\n- Start with `ansible/README.md` for the complete experimental framework overview\n- Review `rca-benchmark/README.md` for RCA evaluation methodology\n- Check `llm-benchmark/README.md` for load testing capabilities\n\n### 2. **Exploring Experimental Data**\n```bash\n# Extract experimental results\ncd data/\ntar -xzvf artifacts.tar.gz\n# Browse smoketests/ and rcaevaluation/ directories\n```\n\n### 3. **Reproducing Experiments**\n```bash\n# Set up infrastructure (requires Kubernetes cluster)\ncd ansible/\nansible-playbook -i inventories/your-cluster.ini install-k3s.yaml\n# Check the ansible directory for more information, these commands are not complete\nansible-playbook -i inventories/your-cluster.ini install-observability.yaml\n\n# Run a baseline experiment\nansible-playbook -i inventories/your-cluster.ini execute-experiment.yaml \\\n  -e \"model_id=falcon-h1-7b-instruct\" \\\n  -e \"exp_type_identifier=baseline-test\"\n```\n\n### 4. **RCA Evaluation**\n```bash\n# Run RCA benchmark on collected telemetry data\ncd rca-benchmark/\ndocker build -t rca-benchmark .\n./execute_experiments.sh python3.12\n```\n\n## Technical Highlights\n\n- **Infrastructure**: Kubernetes-based distributed system with Ray for model serving\n- **Observability**: Comprehensive telemetry stack (Prometheus, Grafana, Loki, OpenTelemetry, DeepFlow)\n- **Chaos Engineering**: Controlled failure injection for realistic failure scenarios\n- **Evaluation**: Systematic comparison of 20+ RCA algorithms across multiple failure types\n- **Reproducibility**: Complete automation of experiment setup, execution, and data collection\n\n## How to Cite\n\nPublication:\n\n```bibtex\n@inproceedings{10.1145/3777911.3800637,\n      author = {Scheinert, Dominik and Acker, Alexander and Wittkopp, Thorsten and Becker, Soeren and Yous, Hamza and Reddy, Karnakar and Farhat, Ibrahim and Hacid, Hakim and Kao, Odej},\n      title = {Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads},\n      year = {2026},\n      isbn = {9798400723261},\n      publisher = {Association for Computing Machinery},\n      address = {New York, NY, USA},\n      url = {https://doi.org/10.1145/3777911.3800637},\n      doi = {10.1145/3777911.3800637},\n      abstract = {Large language model (LLM) services have become an integral part of search, assistance, and decision-making applications. However, unlike traditional web or microservices, the hardware and software stack enabling LLM inference deployment is of higher complexity and far less field-tested, making it more susceptible to failures that are difficult to resolve. Keeping outage costs and quality of service degradations in check depends on shortening mean time to repair, which in practice is gated by how quickly the fault is identified, located, and diagnosed. Automated root cause analysis (RCA) accelerates failure localization by identifying the system component that failed and tracing how the failure propagated. Numerous RCA methods have been developed for traditional services, using request path tracing, resource metric and log data analysis. Yet, existing RCA methods have not been designed for LLM deployments that present distinct runtime characteristics. In this study, we evaluate the effectiveness of RCA methods on a best-practice LLM inference deployment under controlled failure injections. Across 24 methods—20 metric-based, two trace-based, and two multi-source—we find that multi-source approaches achieve the highest accuracy, metric-based methods show fault-type-dependent performance, and trace-based methods largely fail. These results reveal that existing RCA tools do not generalize to LLM systems, motivating tailored analysis techniques and enhanced observability, for which we formulate guidelines.},\n      booktitle = {Companion of the 17th ACM/SPEC International Conference on Performance Engineering},\n      pages = {163–172},\n      numpages = {10},\n      keywords = {distributed systems, reliability engineering, root cause analysis, aiops, large language models},\n      location = {Italy},\n      series = {ICPE Companion '26}\n}\n```\n\nPreprint:\n\n```bibtex\n@misc{scheinert2026microservicestestingwebscalerca,\n      title={Beyond Microservices: Testing Web-Scale RCA Methods on GPU-Driven LLM Workloads},\n      author={Dominik Scheinert and Alexander Acker and Thorsten Wittkopp and Soeren Becker and Hamza Yous and Karnakar Reddy and Ibrahim Farhat and Hakim Hacid and Odej Kao},\n      year={2026},\n      eprint={2603.02057},\n      archivePrefix={arXiv},\n      primaryClass={cs.DC},\n      url={https://arxiv.org/abs/2603.02057},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexalsius%2Frca-llm","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fexalsius%2Frca-llm","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fexalsius%2Frca-llm/lists"}