Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-failure-diagnosis
Related resources for incident failure diagnosis research.
https://github.com/phamquiluan/awesome-failure-diagnosis
- ICSE'21 - Log-based Anomaly Detection with Deep Learning: How Far Are We?
- IPCCC'18 - Rapid deployment of anomaly detection models for large number of emerging kpi streams
- IMC'15 - Opprentice: Towards practical and automatic anomaly detection through machine learning.
- ATC'21 - {Jump-Starting} Multivariate Time Series Anomaly Detection for Online Service Systems
- WWW'18 - Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal KPIs in Web Applications.
- FSE'24 - BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection
- NeuIPS'22 - Root Cause Discovery: Root Cause Analysis of Failures in Microservices through Causal Discovery
- KDD'22 - Causal Inference-Based Root Cause Analysis for Online Service Systems with Intervention Recognition
- VLDB'22 - Diagnosing Root Causes of Intermittent Slow Queries in Cloud Databases.
- FSE'22 - Actionable and interpretable fault localization for recurring failures in online service systems.
- ICSE'21 - MicroHECL: High-efficient root cause localization in large-scale microservice systems.
- ISSRE'21 - Identifying Root-Cause Metrics for Incident Diagnosis in Online Service Systems.
- FSE'20 - Graph-based trace analysis for microservice architecture understanding and problem diagnosis.
- FSE'19 - Latent error prediction and fault localization for microservice applications by learning from system trace logs
- ISSRE'19 - FluxRank: A Widely-Deployable Framework to Automatically Localizing Root Cause Machines for Software Service Failure Mitigation.
- 2022 - Constructing Large-Scale Real-World Benchmark Datasets for AIOps
- ASE'22 - Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems
- ISSRE'22 - Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling - mortems-Analysis)
- CSUR'22 - A Survey on Deep Learning for Software Engineering
- ASE'22 - WOLFFI: A fault injection platform for learning AIOps models.
- SIGOPS'22 - An Intelligent Framework for Timely, Accurate, and Comprehensive Cloud Incident Detection.
- WWW'21 - MicroRank: End-to-End Latency Issue Localization with Extended Spectrum Analysis in Microservice Environments
- ICSOC'21 - Localization of Operational Faults in Cloud Applications by Mining Causal Dependencies in Logs Using Golden Signals.
- CSUR'21 - A survey on automated log analysis for reliability engineering.
- ICSE'21 - Fast outage analysis of large-scale production clouds with service correlation mining.
- FSE'21 - Onion: Identifying Incident-Indicating Logs for Cloud Systems.
- FSE'20 - Towards intelligent incident management: why we need it and how we make it.
- 2020 - Loghub: a large collection of system log datasets towards automated log analytics.
- ICSE'19 - Tools and Benchmarks for Automated Log Parsing.
- FSE'18 - Identifying impactful service system problems via log analysis.
- TNSM'2017 - Mining causality of network events in log data.
- Datadog Incident Management
- Introducing Bits AI, your new DevOps copilot
- AWS Observability Recipes
- Monitoring the Golden Signals (Latency, Traffic, Errors, and Saturation)
- Kibana vs Grafana
- Google SRE - Monitoring Distributed Systems
- Computer Security Incident Handling Guide
- CNCF Cloud Native Interactive Landscape
- Inside Azure Search: Chaos Engineering
- Azure Cloud - https://status.azure.com/en-us/status/history
- IBM Cloud - https://cloud.ibm.com/status/incident-reports
- Google Cloud - https://status.cloud.google.com/summary
- Google Cloud - https://www.google.com/appsstatus/dashboard/summary
- AWS Health Dashboard - https://health.aws.amazon.com/health/status
- AWS Post-Event Summaries - https://aws.amazon.com/premiumsupport/technology/pes/
- Alibaba Cloud - https://status.alibabacloud.com/
- Verica Open Incident Database
- Github Incidents
- Atlassian
- CircleCI
- Notion
- Train Ticket @ Fudan University - to-deploy-train-ticket.md)
- Train Ticket @ RMIT - to-deploy-rmit-train-ticket.md)
- Online Boutique @ Google Cloud
- Sock Shop @ RMIT
- Sock Shop @ Weaveworks
- Robot Shop @ Instana
- Error logs produced by OpenStack.
- Computer failure data repository.
- A list of security log data.
- Apache log files.
- Toward generating a new intrusion detection dataset and intrusion traffic characterization.
- Prometheus - Node Exporter
- Prometheus - Blackbox prober exporter
- tsfresh
- cAdvisor (Container Advisor)
- Elasticsearch
- Elasticsearch
- OpenTelemetry
- Locust
- Vegeta
- Jmeter
- Stress-ng
- wrk2
- Chaos Mesh - source chaos engineering platform for Kubernetes. It provides a set of APIs and CLI tools that allow users to define and orchestrate chaos experiments, such as network latency injection, pod failure, etc.
- TC (Traffic Control)
- tc-netem (Network Emulator)
- ChaosBlade
- Strace
- Chaos Toolkit
- Chaos Genius
- ICSE - trier.de/db/conf/sigsoft/index) | [ASE](https://dblp.org/db/conf/kbse/index.html) | [WWW](https://dblp.org/db/conf/www/index.html) | [KDD](https://dblp.org/db/conf/kdd/index.html) | [NeurIPS](https://dblp.org/db/conf/nips/index.html)
- IEEE TSE
- [Check Conference Rank
- Prof. Hongyu Zhang - Chongqing University
- Prof. Michael R. Lyu - The Chinese University of Hong Kong
- Assoc Prof. Dan Pei - Tsinghua University
- Prof. Tao Xie - Fudan University
- Prof. Peng Xin - Fudan University
- Dr. Dongmei Zhang - Microsoft Asia Research
- Qingwei Lin - Microsoft Research Asia
- Assoc. Prof. Pengfei Chen - Sun Yat-sen University
- Guangba Yu - Sun Yat-sen University
- Causal Inference Course Lectures - Brady Neal
- Adobe - The Good, the Bad and the Ugly: The 3 Learnings of an SRE
- The Smallest Possible SRE Team
- Banking on Continuous Delivery - Capital One
Programming Languages
Keywords
kubernetes
4
chaos-engineering
3
prometheus
3
java
3
chaos-testing
2
fault-injection
2
golang
2
microservices
2
site-reliability-engineering
2
microservice
2
performance
2
load-testing
2
http
2
benchmarking
2
prometheus-exporter
2
freebsd
1
kernel
1
linux
1
system-metrics
1
memory
1
openbsd
1
system-information
1
overheating
1
procfs
1
posix
1
node-metrics
1
metrics
1
stress-testing
1
x86
1
machine-metrics
1
chaos
1
host-metrics
1
elasticsearch
1
time-series
1
search-engine
1
feature-extraction
1
data-science
1
load-generator
1
load-test
1
icmp
1
load-tests
1
locust
1
blackbox-exporter
1
performance-testing
1
python
1
go
1
test
1
c
1
cpu
1
disk
1