Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/opstower-ai/devops-ai-open-leaderboard

DevOps AI Assistant benchmarks for AWS, Kubernetes, and more
https://github.com/opstower-ai/devops-ai-open-leaderboard

aws kubernetes terraform

Last synced: 2 months ago
JSON representation

DevOps AI Assistant benchmarks for AWS, Kubernetes, and more

Host: GitHub
URL: https://github.com/opstower-ai/devops-ai-open-leaderboard
Owner: opstower-ai
License: mit
Created: 2023-09-10T13:06:17.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-10T16:27:20.000Z (6 months ago)
Last Synced: 2024-08-10T17:42:53.106Z (6 months ago)
Topics: aws, kubernetes, terraform
Language: Ruby
Homepage:
Size: 97.7 KB
Stars: 36
Watchers: 0
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        # DevOps AI Assistant Open Leaderboard

This project tracks, ranks, and evaluates DevOps AI Assistants across knowledge domains.

_📅 [Book a time on my calendar](https://calendly.com/derek-haynes) or email [email protected] to chat about these benchmarks._

## 🏆 Current Leaderboard

### AWS Services ([dataset](datasets/aws_services.csv))

| Name      | Accuracy        | Median Duration (s) | Created At |

|-----------|-----------------|---------------------|------------|

| [OpsTower.ai](https://github.com/opstower-ai/llm-opstower)  | [92%](results/OpsTower-2023-09-17-aws_services.csv) 🏆          | 29                  | 2023-09-17 |

| [ReleaseAI](https://release.ai/) | [72%](results/ReleaseAi-2023-09-17-aws_services.csv)             | 11                  | 2023-09-17 |

### AWS CloudWatch Metrics ([dataset](datasets/aws_cloudwatch_metrics.csv))

| Name      | Accuracy        | Median Duration (s) | Created At |

|-----------|-----------------|---------------------|------------|

| [OpsTower.ai](https://github.com/opstower-ai/llm-opstower)  | [89%](results/OpsTower-2023-09-17-aws_cloudwatch_metrics.csv) 🏆          | 42                  | 2023-09-17 |

| [ReleaseAI](https://release.ai/) | [56%](results/ReleaseAi-2023-09-18-aws_cloudwatch_metrics.csv)             | 20                  | 2023-09-18 |

### AWS Billing ([dataset](datasets/aws_billing.csv))

| Name      | Accuracy        | Median Duration (s) | Created At |

|-----------|-----------------|---------------------|------------|

| [OpsTower.ai](https://github.com/opstower-ai/llm-opstower)  | [91%](results/OpsTower-2023-09-18-aws_billing.csv) 🏆          | 53                  | 2023-09-18 |

| [ReleaseAI](https://release.ai/) | [73%](results/ReleaseAi-2023-09-18-aws_billing.csv)             | 23                  | 2023-09-18 |

### kubectl ([dataset](datasets/kubectl.csv))

| Name      | Accuracy        | Median Duration (s) | Created At |

|-----------|-----------------|---------------------|------------|

| [abhishek-ch/kubectl-GPT](https://github.com/abhishek-ch/Kubectl-GPT) | [83%](results/AbhishekchKubectlGpt-2023-09-19-kubectl.csv) 🏆          | 5                   | 2023-09-19 |

| [devinjeon/kubectl-gpt](https://github.com/devinjeon/kubectl-gpt) | [50%](results/DevinjeonKubectlGpt-2023-09-19-kubectl.csv)             | 1                   | 2023-09-19 |

| [mico](https://github.com/tahtaciburak/mico)        | [17%](results/Mico-2023-09-19-kubectl.csv)             | 1                   | 2023-09-19 |

Metrics:

* `Accuracy`: The percent of questions that the DevOps AI Assistant answered correctly.

* `Median Duration`: The median duration in seconds that it took the DevOps AI Assistant to answer a question.

## What is a DevOps AI Assistant?

A DevOps AI Assistant is an LLM-backed autonomous agent that helps DevOps engineers perform their daily tasks. They connect to external systems like AWS and Kubernetes to perform actions on behalf of the user.

## List of DevOps AI Assistants

Only includes assistants that can be invoked from the command line or via a REST API, are functional, and are available for immediate use (not in private beta).

| Name                                                         | Focus                     | Evaluated?                     |

| ------------------------------------------------------------ | ------------------------- | ------------------------------ |

| [aiac](https://github.com/gofireflyio/aiac)                  | Terraform, kubectl, AWS   | No - code generation only      |

| [aiws](https://github.com/huseyinbabal/aiws)                 | AWS                       | No - does not decipher command output |

| [Aptible AI](https://www.aptible.ai/)                        | ?                         | No                             |

| [Argon](https://www.argonlabs.ai/)                           | Kubernetes                | No                             |

| [cloud copilot](https://github.com/aavetis/cloud-copilot)    | Azure                     | No - does not decipher command output |

| [k8sgpt](https://github.com/k8sgpt-ai/k8sgpt)                | Kubernetes                | Planned                       |

| [kubectl-GPT](https://github.com/abhishek-ch/Kubectl-GPT)    | kubectl                   | ✅                             |

| [kubectl-gpt](https://github.com/devinjeon/kubectl-gpt)      | kubectl                   | ✅                             |

| [KubeCtl-ai](https://github.com/sozercan/kubectl-ai)         | Kubernetes manifests      | No - code generation only      |

| [mico](https://github.com/tahtaciburak/mico)                 | kubectl                   | ✅                             |

| [OpsTower.ai](https://github.com/opstower-ai/llm-opstower)   | AWS                       | ✅                             |

| [ReleaseAI](https://release.ai/)                             | AWS, Kubectl              | ✅                             |

| [Terraform AI](https://github.com/jigsaw373/terraform-ai)    | Terraform                 | No - code generation only      |

| [tfgpt](https://github.com/flavius-dinu/tfgpt)               | Terraform                 | No - code generation only      |

### Submit a DevOps AI Assistant for evaluation

Open a PR and submit a DevOps AI Assistant for automated evaluation. To be evaluated, the agent must meet the following criteria:

1. Can be invoked from the command line or via a REST API.

2. Not in private BETA.

## Question Datasets

See the [datasets/](datasets/) directory for the question datasets. There are 3 columns in each dataset csv file:

1. `question`: The question to ask the DevOps AI Assistant

2. `answer_format`: The expected answer in natural language.

3. `reference_functions`: The reference functions that the DevOps AI Assistant should call to answer the question.

List of datasets:

| Name | Example Question |

| -------- | -------- |

| [aws_cloudwatch_metrics.csv](datasets/aws_cloudwatch_metrics.csv) | Were there any Lambda invocations that lasted over 30 seconds in the last day? |

| [aws_services.csv](datasets/aws_services.csv) | Do our ec2 instances have are any unexpected reboots or terminations over the past 7 days? |

| [aws_billing.csv](datasets/aws_billing.csv) | Which region has the highest AWS expenses for me over the past 3 months? |

| [kubectl.csv](datasets/kubectl.csv) | How many pods are currently running in the default namespace? |

## Evaluation Process

1. Iterate over each question in the dataset and store:

  * the answer from the DevOps AI Assistant

  * the truth answer derived from evaluating the human-evaluated [reference functions](functions/) with a [prompt](prompts/answer_from_saved_methods.rb) to summarize the results into an answer.

2. Iterate over the answer results, using the [dynamic eval prompt](prompts/dynamic_eval.rb) to compare the results of the DevOps AI Assistant to the truth answer. This generates a confidence score and a short explanation for background on the score.

3. Store the results in the [results/](results/) directory.

## A note on dynamic evaluation

A critical component of the evaluation process is the dynamic evaluation. It's not feasible to provide a static answer for most questions as the correct answer is environment-specific. For example, the answer to "What is the average CPU utilization across my EC2 instances?" is not a static answer. It depends on the current state of the EC2 instances.

To solve this, I've stored a set of human-evaluated functions to generate the data that provide correct answers. Then, I use an LLM prompt to generate a natural language answer from the data. This would be a poor evaluation process if the LLM provided an incorrect answer based on the returned data, but I have yet to observe significant errors in the LLM's reasoning of the function output.

Please submit a PR if you believe a reference function is incorrect.

## Contact Info

Reach out [email protected] if you have general questions about this leaderboard.