https://github.com/cubefs/compass

Compass is a task diagnosis platform for bigdata
https://github.com/cubefs/compass

airflow bigdata diagnose dolphinscheduler flink hadoop mapreduce scheduler spark sql

Last synced: 10 months ago
JSON representation

Compass is a task diagnosis platform for bigdata

Host: GitHub
URL: https://github.com/cubefs/compass
Owner: cubefs
License: apache-2.0
Created: 2023-03-29T05:08:50.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-11-23T13:11:02.000Z (over 1 year ago)
Last Synced: 2025-03-31T21:44:21.888Z (11 months ago)
Topics: airflow, bigdata, diagnose, dolphinscheduler, flink, hadoop, mapreduce, scheduler, spark, sql
Language: Java
Homepage:
Size: 5.92 MB
Stars: 379
Watchers: 18
Forks: 136
Open Issues: 90
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Security: SECURITY.md

Awesome Lists containing this project

awesome-java - Compass

README

          # Compass

[Chinese Document](README_zh.md)

### Abstract

Compass is a platform for diagnosing computing engines and schedulers in the big data ecosystem, aiming to improve 

the efficiency of troubleshooting and reduce the complexity of problem tuning. It automatically collects logs and

metrics, and uses heuristic rules to identify problems and provide tuning advice. In addition, for logs, ChatGPT is 

used to provide diagnostic suggestions. The logs are automatically aggregated into templates using the drain algorithm, 

which can be used for manual intervention, etc., to improve the automation of diagnosis and optimization solutions.

### Feature

1. Non-invasive, in-time diagnosis, no need to modify the original platform code.

2. Compatible with multiple version for different componts such Spark 2.4+、Flink 1.2+、Hadoop 2.4+, DolphinScheduler 2.x+, Airflow, etc.

3. Supports diagnostics for kinds of scheduling job issues, such as failure, abnormal elapsed time, abnormal baseline, etc.

4. Supports diagnostics for kinds of engine task issues, such as data skew, big table scan,  memory waste, long tail task, etc.

5. Supports diagnostics for capturing log exception and offers advise or solution.

6. Supports ChatGPT to diagnose abnormal logs and provide solutions; uses the drain algorithm to aggregate templates, saving costs.

### Feature Support

- [x] ChatGPT

- [x] Spark

- [x] Flink

- [x] Mapreduce

- [ ] Trino

- [ ] Spark Tez

- [x] Airflow

- [x] DolphinScheduler

- [ ] Azkaban

- [ ] Oozie

- [ ] Debezium (Synchronize Postgresql data to Postgresql)

- [ ] Other(Any suggestions are welcomed, high valued)...

###  Documents

[Deployment document](document/manual/deployment.md)

[Architecture document](document/manual/architecture.md)

### Community

Welcome to join the community for the usage or development of Compass.

- Submit an [issue](https://github.com/cubefs/compass/issues).

- Submit a pull request, please read the [contributing guideline](https://github.com/cubefs/compass/blob/main/CONTRIBUTING.md).

- Discuss [idea & question](https://github.com/cubefs/compass/discussions).

Usually We will reply it quickly.

### Categories of Diagnosis

| Category 
|------------------ 
| Failed task 
| First failed task 
| Long-term failed task 
| Exceed base-time task 
| Abnormal time-elapsed task 
| Long time-consuming task 
| Failed SQL task 
| Shuffle failed task 
| Memory Overflow 
| CPU waste 
| Memory waste 
| Large table scan 
| Memory overflow warning 
| Data skew 
| Abnormal time-consuming job 
| Abnormal 
| Long tail task 
| Hdfs read/write stuck 
| Speculative tasks 
| Abnormal global sort 
| Abnormal gc 
| High memory usage 
| Low memory usage 
| Abnormal jobmanager memory 
| No data processing 
| No data in partial task 
| Optimize taskmanager memory 
| Not enough Parallel 
| High CPU usage 
| Low CPU usage 
| High Maximum CPU usage 
| Slow operators 
| Back pressure 
| High delay

| Scope           | Dimension           | Description                                                                                             | -------------|-----------------|---------------------|---------------------------------------------------------------------------------------------------------| | Scheduler       | Runtime Analysis    | Fail to run task successfully after retrying per running cycle                                          | | Scheduler       | Runtime Analysis    | Fail to run task first time but succeed after retrying per running cycle                                | | Scheduler       | Runtime Analysis    | Keep failing to run task every running cycle                                                            | | Scheduler       | Time Analysis       | The run ends earlier or later than normal                                                               | | Scheduler       | Time Analysis       | The elapsed time of task is either too short or too long compared to the normal                         | | Scheduler       | Time Analysis       | The elapsed time of task is exceed 2 hours                                                              | | Spark           | Runtime Analysis    | Failed to run sql                                                                                       | | Spark           | Runtime Analysis    | Failed to run task due to being unable to shuffle successfully                                          | | Spark           | Runtime Analysis    | There is not enough memory to run task                                                                  | | Spark,MapReduce | Resource Analysis   | The usage of CPU is not high                                                                            | | Spark           | Resource Analysis   | The usage of Memory is not high                                                                         | | Spark,MapReduce | Efficiency Analysis | Scan too many rows of large table due to no partitions or no filters                                    | | Spark           | Efficiency Analysis | The size or rows of data broadcast from driver to executor is too many, which may cause memory overflow | | Spark,MapReduce | Efficiency Analysis | The maximum data each processing unit(task/map/reduce) is larger than the median                        | | Spark           | Efficiency Analysis | There is a higher ratio of idle time during the run of the job                                          | time-consuming stage | Spark           | Efficiency Analysis | There is a higher ratio of idle time during the run of the stage                                        | | Spark,MapReduce | Efficiency Analysis | The maximum running time of a processing unit(task/map/reduce) is much larger than the median           | | Spark           | Efficiency Analysis | The rate of processing data each task is much slower than that in a normal stage                        | | Spark,MapReduce | Efficiency Analysis | There are too many speculative tasks because of the executor is processing slowly                       | | Spark           | Efficiency Analysis | The whole Spark application contains only one task                                                      | | MapReduce       | Efficiency Analysis | There is a higher ratio gc time compared to CPU time                                                    | | Flink           | Resource Analysis   | The usage of the memory is high                                                                         | | Flink           | Resource Analysis   | The usage of the memory is low                                                                          | | Flink           | Resource Analysis   | The memory of jobmanager is abnormal if there is too many taskmanager                                   | | Flink           | Resource Analysis   | There is no data processing in a job                                                                    | | Flink           | Resource Analysis   | There is no data processing in partial taskmanagers                                                     | | Flink           | Resource Analysis   | Optimize the memory of taskmanager due to the abnormal memory given                                     | | Flink           | Resource Analysis   | There is less parallel for flink job                                                                    | | Flink           | Resource Analysis   | The usage of the CPU is high                                                                            | | Flink           | Resource Analysis   | The usage of the CPU is low                                                                             | | Flink           | Resource Analysis   | The peek of the CPU is high                                                                             | | Flink           | Runtime Analysis    | There are slow operators in a flink job                                                                 | | Flink           | Runtime Analysis    | There is back pressure in a flink job                                                                   | | Flink           | Runtime Analysis    | There is high delay in a flink job                                                                      |

### UI

![overview](document/manual/img/spark_report.png)

![overview-1](document/manual/img/spark_report_trend.png)

![tasks](document/manual/img/spark_scheduler.png)

![onclick](document/manual/img/application_report_memory.png)

### License

Compass is licensed under the [Apache License, Version 2.0](http://www.apache.org/licenses/LICENSE-2.0) For detail

see [LICENSE](LICENSE) and [NOTICE](NOTICE).

### Reference

The Drain algorithm is based on `logpai` project, for more please see 

- [https://github.com/logpai/Drain3](https://github.com/logpai/Drain3)

- [https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf](https://jiemingzhu.github.io/pub/pjhe_icws2017.pdf)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/cubefs/compass

Awesome Lists containing this project

README