Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-failure-diagnosis
Related resources for incident failure diagnosis research.
https://github.com/phamquiluan/awesome-failure-diagnosis
Last synced: 4 days ago
JSON representation
-
Others Paper
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- 2022 - Constructing Large-Scale Real-World Benchmark Datasets for AIOps
- ISSRE'22 - Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling - mortems-Analysis)
- ASE'22 - WOLFFI: A fault injection platform for learning AIOps models.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSE'21 - Fast outage analysis of large-scale production clouds with service correlation mining.
- 2020 - Loghub: a large collection of system log datasets towards automated log analytics.
- ICSE'19 - Tools and Benchmarks for Automated Log Parsing.
- TNSM'2017 - Mining causality of network events in log data.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- 2024 - A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends
- 2024 - Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
-
Uncategorized
-
Misc
- IBM Cloud - https://cloud.ibm.com/status/incident-reports
- Google Cloud - https://status.cloud.google.com/summary
- Google Cloud - https://www.google.com/appsstatus/dashboard/summary
- AWS Post-Event Summaries - https://aws.amazon.com/premiumsupport/technology/pes/
- Alibaba Cloud - https://status.alibabacloud.com/
- Verica Open Incident Database
- Github Incidents
- Atlassian
- Computer failure data repository.
- Apache log files.
- Datadog Incident Management
- Google SRE - Monitoring Distributed Systems
- CNCF Cloud Native Interactive Landscape
- Inside Azure Search: Chaos Engineering
- Computer Security Incident Handling Guide
- AWS Health Dashboard - https://health.aws.amazon.com/health/status
- Train Ticket @ Fudan University - to-deploy-train-ticket.md)
- Online Boutique @ Google Cloud
- Sock Shop @ RMIT
- Sock Shop @ Weaveworks
- Robot Shop @ Instana
- A list of security log data.
- AWS Observability Recipes
- Introducing Bits AI, your new DevOps copilot
- Kibana vs Grafana
- CircleCI
- Notion
- Train Ticket @ RMIT - to-deploy-rmit-train-ticket.md)
- Toward generating a new intrusion detection dataset and intrusion traffic characterization.
-
Traces
-
Chaos Engineering / Fault Injection
- TC (Traffic Control)
- tc-netem (Network Emulator)
- Strace
- Chaos Genius
- Chaos Mesh - source chaos engineering platform for Kubernetes. It provides a set of APIs and CLI tools that allow users to define and orchestrate chaos experiments, such as network latency injection, pod failure, etc.
- ChaosBlade
- Chaos Toolkit
-
Conferences and Journals
-
Anomaly Detection
- ICSE'21 - Log-based Anomaly Detection with Deep Learning: How Far Are We?
- ATC'21 - {Jump-Starting} Multivariate Time Series Anomaly Detection for Online Service Systems
- WWW'18 - Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications.
- Robust multimodal failure detection for microservice systems - Lab-NKU/AnoFusion)
- IMC'15 - Opprentice: Towards practical and automatic anomaly detection through machine learning.
- UAC-AD: Unsupervised Adversarial Contrastive Learning for Anomaly Detection on Multi-modal Data in Microservice Systems - AD)
-
Root cause analysis / Fault Localization
- FSE'24 - BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection
- NeuIPS'22 - Root Cause Discovery: Root Cause Analysis of Failures in Microservices through Causal Discovery
- ICSE'21 - MicroHECL: High-efficient root cause localization in large-scale microservice systems.
- ISSRE'21 - Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems.
- ISSRE'19 - FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation.
- VLDB'22 - Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases.
- HeMiRCA: Fine-Grained Root Cause Analysis for Microservices with Heterogeneous Data Sources
- SIGCOMM'23 - Network-centric distributed tracing with deepflow: Troubleshooting your microservices in zero code
- FSE'23 - Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
- TSE'24 - TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems
- ASE'24 - Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?
- ASE'24 - Giving Every Modality a Voice in Microservice Failure Diagnosis via Multimodal Adaptive Optimization
- ASE'24 - MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
- FSE'22 - Actionable and interpretable fault localization for recurring failures in online service systems.
- FSE'19 - Latent error prediction and fault localization for microservice applications by learning from system trace logs
- KDD'22 - Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition
- ICSE'23 - Eadro: An End-to-End Troubleshooting Framework for Microservices on Multi-source Data
-
Researcher
- Prof. Hongyu Zhang - Chongqing University
- Prof. Michael R. Lyu - The Chinese University of Hong Kong
- Assoc Prof. Dan Pei - Tsinghua University
- Prof. Tao Xie - Fudan University
- Prof. Peng Xin - Fudan University
- Dr. Dongmei Zhang - Microsoft Asia Research
- Qingwei Lin - Microsoft Research Asia
- Assoc. Prof. Pengfei Chen - Sun Yat-sen University
- Guangba Yu - Sun Yat-sen University
- Causal Inference Course Lectures - Brady Neal
- Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE
- The Smallest Possible SRE Team
- Banking on Continuous Delivery - Capital One
-
Metrics
-
Logs
-
Load generators
Programming Languages
Categories
Sub Categories
Keywords
kubernetes
4
chaos-engineering
3
java
3
microservices
2
microservice
2
golang
2
chaos-testing
2
prometheus
2
performance
2
fault-injection
2
prometheus-exporter
2
site-reliability-engineering
2
load-testing
2
http
2
benchmarking
2
feature-extraction
1
performance-monitoring
1
robot
1
host-metrics
1
machine-metrics
1
metrics
1
node-metrics
1
data-science
1
procfs
1
icmp
1
blackbox-exporter
1
system-metrics
1
system-information
1
gcp
1
gke
1
google-cloud
1
grpc
1
istio
1
kustomize
1
sample-application
1
samples
1
skaffold
1
terraform
1
docker
1
ecs
1
mesos
1
microservices-demo
1
nodejs
1
nomad
1
spring-boot
1
distributed-tracing
1
microservice-example
1
overheating
1
posix
1
stress-testing
1