Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/elastic/sysgrok

LLM-driven assistant for analyzing, understanding and optimizing systems
https://github.com/elastic/sysgrok
Last synced: 27 days ago
JSON representation
LLM-driven assistant for analyzing, understanding and optimizing systems
Host: GitHub
URL: https://github.com/elastic/sysgrok
Owner: elastic
License: apache-2.0
Created: 2023-05-15T13:46:56.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2024-08-09T06:32:38.000Z (3 months ago)
Last Synced: 2024-09-27T01:54:06.403Z (about 1 month ago)
Language: Python
Homepage:
Size: 129 KB
Stars: 94
Watchers: 4
Forks: 8
Open Issues: 17
Metadata Files:
- Readme: README.md
- License: LICENSE.txt
Awesome Lists containing this project

README

        `sysgrok` is an experimental proof-of-concept, intended to demonstrate how

LLMs can be used to help SWEs and SREs to understand systems, debug issues, and optimise performance.

It can do things like:

* Take the top most expensive functions and processes identified by a profiler, explain

the functionality that each provides, and suggest optimisations.

* Take a host and a description of a problem that host is encountering and automatically

debug the issue and suggest remediations.

* Take source code that has been annotated by a profiler, explain the hot paths, and

suggest ways to improve the performance of the code.

Check out the release [blog post](https://www.elastic.co/blog/open-sourcing-sysgrok-ai-assistant) for

a longer introduction. 

See the Command Overview section below for an overview of the full list of available

commands that it supports.

Here's an example using the `analyzecmd` sub-command, which connects to a remote host,

executes one or more commands, and summarises the result. The demo shows how this can be

used to automate the process described in Brendan Gregg's article -

[Linux Performance Analysis in 60 seconds](https://www.brendangregg.com/Articles/Netflix_Linux_Perf_Analysis_60s.pdf).

[![asciicast](https://asciinema.org/a/593518.svg)](https://asciinema.org/a/593518)

# Installation

1. Copy `.env.example` to `.env` and fill the required variables. The `GAI_API_TYPE` must be either "azure" or "openai",

and the `GAI_API_KEY` must be your API key. If you are using an Azure endpoint then you must also provide the

`GAI_API_BASE` and `GAI_API_VERSION` variables. The correct values for these can be found in your Azure portal.

2. Install requirements via pip

```

$ python -m venv venv # Create a virtual environment

$ source venv/bin/activate # Activate the virtual environment

$ pip install -r requirements.txt # Install requirements in the virtual environment

```

# Usage

For now, `sysgrok` is a command line tool and takes input either via stdin

or from a file, depending on the command. Usage is as follows:

```

usage: ./sysgrok.py [-h] [-d] [-e] [-c] [--output-format OUTPUT_FORMAT] [-m MODEL] [--temperature TEMPERATURE] [--max-concurrent-queries MAX_CONCURRENT_QUERIES]

                    {analyzecmd,code,explainfunction,explainprocess,debughost,findfaster,stacktrace,topn} ...

                               _

 ___ _   _ ___  __ _ _ __ ___ | | __

/ __| | | / __|/ _` | '__/ _ \| |/ /

\__ \ |_| \__ \ (_| | | | (_) |   <

|___/\__, |___/\__, |_|  \___/|_|\_

     |___/     |___/

System analysis and optimisation with LLMs

positional arguments:

  {analyzecmd,code,explainfunction,explainprocess,debughost,findfaster,stacktrace,topn}

                        The sub-command to execute

    analyzecmd          Summarise the output of a command, optionally with respect to a problem under investigation

    code                Summarise profiler-annoted code and suggest optimisations

    explainfunction     Explain what a function does and suggest optimisations

    explainprocess      Explain what a process does and suggest optimisations

    debughost           Debug an issue by executing CLI tools and interpreting the output

    findfaster          Search for faster alternatives to a provided library or program

    stacktrace          Summarise a stack trace and suggest changes to optimise the software

    topn                Summarise Top-N output from a profiler and suggest improvements

options:

  -h, --help            show this help message and exit

  -d, --debug           Debug output

  -e, --echo-input      Echo the input provided to sysgrok. Useful when input is piped in and you want to see what it is

  -c, --chat            Enable interactive chat after each LLM response

  --output-format OUTPUT_FORMAT

                        Specify the output format for the LLM to use

  -m MODEL, --model-or-deployment-id MODEL

                        The OpenAI model, or Azure deployment ID, to use.

  --temperature TEMPERATURE

                        ChatGPT temperature. See OpenAI docs.

  --max-concurrent-queries MAX_CONCURRENT_QUERIES

                        Maximum number of parallel queries to OpenAI

```

# Feature Requests, Bugs and Suggestions

Please log them via the Github Issues tab. If you have specific requests or bugs

then great, but I'm also happy to discuss open-ended topics, future work, and

ideas.

# Adding a New Command

Adding a new command is easy. You need to:

1. Create a file, yourcommand.py, in the `sysgrok/commands` directory. It's

likely easiest to just copy an existing command, e.g. `stacktrace.py`

2. Your command file needs to have the following components:

    * A top level `command` variable which is the name users will use to invoke

    your command

    * A top level `help` variable describing the command, and which will appear

    when the `-h` flag is passed.

    * A `add_to_command_subparsers` function which should add a sub-parser

    which will handle the command line arguments that are specific to your

    command.

    * A `run` function that is the interface to your command. It will be the

    function that creates the LLM queries and produces a result. It should

    return 0 upon success, or -1 otherwise.

3. Update `sysgrok.py`:

    * Add your command to the imports

    * Add your command to the `commands` dict.

# Examples

Note 1: The output of `sysgrok` is heavily dependent on the input prompts, and

I am still experimenting with them. This means that the output you get from

running `sysgrok` may differ from what you see below. It may also vary based

on minor differences in your input format, or the usual quirks of LLMs. If you

have an example where the output you get from a command is incorrect or

otherwise not helpful, please file an issue and let me know. If I can figure out

how to get `sysgrok` to act more usefully, I will.

Note 2: The output of `sysgrok` is also heavily dependent on the OpenAI model

used. The default is `gpt-3.5-turbo` and these examples have been generated

using that. See the usage docs for other options. If the results you get are not

good enough with `gpt-3.5-turbo` then try `gpt-4`. It is slower and more

expensive but may provide higher quality output.

## Finding faster replacement programs and libraries

One of the easiest wins in optimisation is replacing an existing library with

a functionally equivalent, faster, alternative. Usually this process begins

with an engineering looking at the Top-N output of their profiler, which lists

the most expensive libraries and functions in their system, and then going on

a hunt for an optimised version. The `findfaster` command solves this problem

for you. Here are some examples.

[![asciicast](https://asciinema.org/a/593479.svg)](https://asciinema.org/a/593479)

## Analysing the TopN functions and suggesting optimisations

A good starting place for analysing a system is often looking at what libraries

and functions your profiler tells you are using the most CPU. Most commercial

profilers will have a tab for this information, and you can get it from `perf`

via `perf report --stdio --max-stack=0`. One of the stumbling blocks when

encountering this data is that firstly you need to understand what each

program, library and function actually is, and then you need to come up with

ideas for how to optimise them. This is made even more complicated in the

world of whole-system, or whole-data-center, profiling, where there are a

huge number of programs and libraries running, and you are often unfamiliar

with many of them.

The `sysgrok topn` command helps with this. Provide it with your Top-N, and

it will try to summarise what each program, library and function is doing, as

well as providing you with some suggestions as to what you might do to

optimise your system.

[![asciicast](https://asciinema.org/a/593492.svg)](https://asciinema.org/a/593492)

## Explaining a specific function and suggesting optimisations

If you know a particular function is using signficant CPU then, using the

`explainfunction` command, you can ask for an explanation of that

specific function, and for optimistion suggestions, instead of asking about

the entire Top N.

[![asciicast](https://asciinema.org/a/593483.svg)](https://asciinema.org/a/593483)

## Executing commands on a remote host and analysing the response using the LLM

The `analyzecmd` command takes a host and one or more Linux commands to execute.

It connects to the host, executes the commands and summarises the results. You can

also provide it with an optional description of an issue you are investigating, and

the command output will be summarised with respect to that problem.

This first example shows how one or more commands can be executed.

[![asciicast](https://asciinema.org/a/593515.svg)](https://asciinema.org/a/593515)

We can also use `analyzecmd` to analyse commands that produce logs.

[![asciicast](https://asciinema.org/a/593516.svg)](https://asciinema.org/a/593516)

And in this final example we execute and analyze the commands recommended by

Brendan Gregg  in his article "Linux Performance Analysis in 60 seconds".

[![asciicast](https://asciinema.org/a/593518.svg)](https://asciinema.org/a/593518)

## Automatically debug a problem on a host

The `debughost` command takes a host and a problem description and then:

1. Queries the LLM for commands to run that may generate information useful in

debugging the problem.

2. Connects to the host via ssh and executes the commands

3. Uses the LLM to summarise the output of each command, individually.

4. Concatenates the summaries and passes them to the LLM to ask for a report on

the likely source of the problem the user is facing.

[![asciicast](https://asciinema.org/a/593520.svg)](https://asciinema.org/a/593520)