{"id":13839373,"url":"https://github.com/vmware-tanzu/crash-diagnostics","last_synced_at":"2025-07-11T03:32:04.371Z","repository":{"id":40680449,"uuid":"212403317","full_name":"vmware-tanzu/crash-diagnostics","owner":"vmware-tanzu","description":"Crash-Diagnostics (Crashd) is a tool to help investigate, analyze, and troubleshoot unresponsive or crashed Kubernetes clusters.","archived":false,"fork":false,"pushed_at":"2025-05-22T11:02:06.000Z","size":740,"stargazers_count":185,"open_issues_count":50,"forks_count":47,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-06-06T17:08:57.561Z","etag":null,"topics":["kubernetes","kubernetes-cluster","troubleshooting"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vmware-tanzu.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE.txt","code_of_conduct":"CODE-OF-CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":"ROADMAP.md","authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-10-02T17:39:58.000Z","updated_at":"2025-05-22T10:01:25.000Z","dependencies_parsed_at":"2023-02-12T23:01:09.706Z","dependency_job_id":"639bb35a-7f66-4a6a-8f61-85969c7fd18e","html_url":"https://github.com/vmware-tanzu/crash-diagnostics","commit_stats":null,"previous_names":[],"tags_count":34,"template":false,"template_full_name":null,"purl":"pkg:github/vmware-tanzu/crash-diagnostics","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmware-tanzu%2Fcrash-diagnostics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmware-tanzu%2Fcrash-diagnostics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmware-tanzu%2Fcrash-diagnostics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmware-tanzu%2Fcrash-diagnostics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vmware-tanzu","download_url":"https://codeload.github.com/vmware-tanzu/crash-diagnostics/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vmware-tanzu%2Fcrash-diagnostics/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264721342,"owners_count":23653923,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["kubernetes","kubernetes-cluster","troubleshooting"],"created_at":"2024-08-04T17:00:20.884Z","updated_at":"2025-07-11T03:32:04.365Z","avatar_url":"https://github.com/vmware-tanzu.png","language":"Go","funding_links":[],"categories":["OPS"],"sub_categories":[],"readme":"![](https://github.com/vmware-tanzu/crash-diagnostics/workflows/Crash%20Diagnostics%20Build/badge.svg)\n[![Go Report Card](https://goreportcard.com/badge/github.com/vmware-tanzu/crash-diagnostics)](https://goreportcard.com/report/github.com/vmware-tanzu/crash-diagnostics)\n\n# Crashd - Crash Diagnostics\n\nCrash Diagnostics (Crashd) is a tool that helps human operators to easily interact and collect information from infrastructures running on Kubernetes for tasks such as automated diagnosis and troubleshooting.  \n\n## Crashd Features\n* Crashd uses the [Starlark language](https://github.com/google/starlark-go/blob/master/doc/spec.md), a Python dialect, to express and invoke automation functions\n* Easily automate interaction with infrastructures running Kubernetes\n* Interact and capture information from compute resources such as machines (via SSH)\n* Automatically execute commands on compute nodes to capture results \n* Capture object and cluster log from the Kubernetes API server\n* Easily extract data from Cluster-API managed clusters \n\n\n## How Does it Work?\nCrashd executes script files, written in Starlark, that interacts a specified infrastructure along with its cluster resources.  Starlark script files contain predefined Starlark functions that are capable of interacting and collect diagnostics and other information from the servers in the cluster.\n\nFor detail on the design of Crashd, see this Google Doc design document [here](https://docs.google.com/document/d/1pqYOdTf6ZIT_GSis-AVzlOTm3kyyg-32-seIfULaYEs/edit?usp=sharing).\n\n## Installation\nThere are two ways to get started with Crashd. Either download a pre-built binary or pull down the code and build it locally.\n\n### Download binary\n1. Dowload the latest [binary release](https://github.com/vmware-tanzu/crash-diagnostics/releases/) for your platform\n2. Extract `tarball` from release\n   ```\n   tar -xvf \u003cRELEASE_TARBALL_NAME\u003e.tar.gz\n   ```\n3. Move the binary to your operating system's `PATH`\n\n\n### Compiling from source\nCrashd is written in Go and requires version 1.11 or later.  Clone the source from its repo or download it to your local directory.  From the project's root directory, compile the code with the\nfollowing:\n\n```\nGO111MODULE=on go build -o crashd .\n```\n\nOr, yo can run a versioned build using the `build.go` source code:\n\n```\ngo run .ci/build/build.go\n\nBuild amd64/darwin OK: .build/amd64/darwin/crashd\nBuild amd64/linux OK: .build/amd64/linux/crashd\n```\n\n## Getting Started\nA Crashd script consists of a collection of Starlark functions stored in a file.  For instance, the following script (saved as diagnostics.crsh) collects system information from a list of provided hosts using SSH.  The collected data is then bundled as tar.gz file at the end: \n\n```python\n# Crashd global config\ncrshd = crashd_config(workdir=\"{0}/crashd\".format(os.home))\n\n# Enumerate compute resources \n# Define a host list provider with configured SSH\nhosts=resources(\n    provider=host_list_provider(\n        hosts=[\"170.10.20.30\", \"170.40.50.60\"], \n        ssh_config=ssh_config(\n            username=os.username,\n            private_key_path=\"{0}/.ssh/id_rsa\".format(os.home),\n        ),\n    ),\n)\n\n# collect data from hosts\ncapture(cmd=\"sudo df -i\", resources=hosts)\ncapture(cmd=\"sudo crictl info\", resources=hosts)\ncapture(cmd=\"df -h /var/lib/containerd\", resources=hosts)\ncapture(cmd=\"sudo systemctl status kubelet\", resources=hosts)\ncapture(cmd=\"sudo systemctl status containerd\", resources=hosts)\ncapture(cmd=\"sudo journalctl -xeu kubelet\", resources=hosts)\n\n# archive collected data\narchive(output_file=\"diagnostics.tar.gz\", source_paths=[crshd.workdir])\n```\n\nThe previous code snippet connects to two hosts (specified in the `host_list_provider`) and execute commands remotely, over SSH, and `capture` and stores the result.\n\n\u003e See the complete list of supported [functions here](./docs/README.md).\n\n### Running the script\nTo run the script, do the following:\n\n```\n$\u003e crashd run diagnostics.crsh \n```\n\nIf you want to output debug information, use the `--debug` flag as shown:\n\n```\n$\u003e crashd run --debug diagnostics.crsh\n\nDEBU[0000] creating working directory /home/user/crashd\nDEBU[0000] run: executing command on 2 resources\nDEBU[0000] run: executing command on localhost using ssh: [sudo df -i]\nDEBU[0000] ssh.run: /usr/bin/ssh -q -o StrictHostKeyChecking=no -i /home/user/.ssh/id_rsa -p 22  user@localhost \"sudo df -i\"\nDEBU[0001] run: executing command on 170.10.20.30 using ssh: [sudo df -i]\n...\n```\n\nTo run crashd in a `restrictedMode`, use the `--restrictedMode` flag as shown:\n\n```\n$\u003e crashd run --restrictedMode diagnostics.crsh\n```\nRestricted mode is used to prevent the execution of potentially harmful commands.  In restricted mode, the following commands are disabled: `run_local`, `capture_local`, `copy_to`\n\n## Compute Resource Providers\nCrashd utilizes the concept of a provider to enumerate compute resources. Each implementation of a provider is responsible for enumerating compute resources on which Crashd can execute commands using a transport (i.e. SSH). Crashd comes with several providers including\n\n* *Host List Provider* - uses an explicit list of host addresses (see previous example)\n* *Kubernetes Nodes Provider* - extracts host information from a Kubernetes API node objects\n* *CAPV Provider* - uses Cluster-API to discover machines in vSphere cluster\n* *CAPA Provider* - uses Cluster-API to discover machines running on AWS\n* More providers coming!\n\n\n## Accessing script parameters\nCrashd scripts can access external values that can be used as script parameters.\n### Environment variables\n  Crashd scripts can access environment variables at runtime using the `os.getenv` method:\n```python\nkube_capture(what=\"logs\", namespaces=[os.getenv(\"KUBE_DEFAULT_NS\")])\n```\n\n### Command-line arguments\nScripts can also access command-line arguments passed as key/value pairs using the `--args` or `--args-file` flags. For instance, when the following command is used to start a script:\n\n```bash\n$ crashd run --args=\"kube_ns=kube-system, username=$(whoami)\" diagnostics.crsh\n```\n\nValues from `--args` can be accessed as shown below:\n\n```python\nkube_capture(what=\"logs\", namespaces=[\"default\", args.kube_ns])\n```\n\n## More Examples\n### SSH Connection via a jump host\nThe SSH configuration function can be configured with a jump user and jump host.  This is useful for providers that requires a host proxy for SSH connection as shown in the following example:\n```python\nssh=ssh_config(username=os.username, jump_user=args.jump_user, jump_host=args.jump_host)\nhosts=host_list_provider(hosts=[\"some.host\", \"172.100.100.20\"], ssh_config=ssh)\n...\n```\n\n### Connecting to Kubernetes nodes with SSH\nThe following uses the `kube_nodes_provider` to connect to Kubernetes nodes and execute remote commands against those nodes using SSH:\n\n```python\n# SSH configuration\nssh=ssh_config(\n    username=os.username,\n    private_key_path=\"{0}/.ssh/id_rsa\".format(os.home),\n    port=args.ssh_port,\n    max_retries=5,\n)\n\n# enumerate nodes as compute resources\nnodes=resources(\n    provider=kube_nodes_provider(\n        kube_config=kube_config(path=args.kubecfg),\n        ssh_config=ssh,\n    ),\n)\n\n# exec `uptime` command on each node\nuptimes = run(cmd=\"uptime\", resources=nodes)\n\n# print `run` result from first node\nprint(uptimes[0].result)\n```\n\n\n### Retreiving Kubernetes API objects and logs\nThe`kube_capture` is used, in the following example, to connect to a Kubernetes API server to retrieve Kubernetes API objects and logs.  The retrieved data is then saved to the filesystem as shown below:\n\n```python\nnspaces=[\n    \"capi-kubeadm-bootstrap-system\",\n    \"capi-kubeadm-control-plane-system\",\n    \"capi-system capi-webhook-system\",\n    \"cert-manager tkg-system\",\n]\n\nconf=kube_config(path=args.kubecfg)\n\n# capture Kubernetes API object and store in files\nkube_capture(what=\"logs\", namespaces=nspaces, kube_config=conf)\nkube_capture(what=\"objects\", kinds=[\"services\", \"pods\"], namespaces=nspaces, kube_config=conf)\nkube_capture(what=\"objects\", kinds=[\"deployments\", \"replicasets\"], namespaces=nspaces, kube_config=conf)\n```\n\n### Interacting with Cluster-API managed machines running on vSphere (CAPV)\nAs mentioned, Crashd provides the `capv_provider` which allows scripts to interact with Cluster-API managed clusters running on a vSphere infrastructure (CAPV).  The following shows an abbreviated snippet of a Crashd script that retrieves diagnostics information from the management cluster machines managed by a CAPV-initiated cluster:\n\n```python\n# enumerates management cluster nodes\nnodes = resources(\n    provider=capv_provider(\n        ssh_config=ssh_config(username=\"capv\", private_key_path=args.private_key),\n        kube_config=kube_config(path=args.mc_config)\n    )\n)\n\n# execute and capture commands output from management nodes\ncapture(cmd=\"sudo df -i\", resources=nodes)\ncapture(cmd=\"sudo crictl info\", resources=nodes)\ncapture(cmd=\"sudo cat /var/log/cloud-init-output.log\", resources=nodes)\ncapture(cmd=\"sudo cat /var/log/cloud-init.log\", resources=nodes)\n...\n\n```\n\nThe previous snippet interact with management cluster machines. The provider can be configured to enumerate workload machines (by specifying the name of a workload cluster) as shown in the following example:\n\n```python\n# enumerates workload cluster nodes\nnodes = resources(\n    provider=capv_provider(\n        workload_cluster=args.cluster_name,\n        ssh_config=ssh_config(username=\"capv\", private_key_path=args.private_key),\n        kube_config=kube_config(path=args.mc_config)\n    )\n)\n\n# execute and capture commands output from workload nodes\ncapture(cmd=\"sudo df -i\", resources=nodes)\ncapture(cmd=\"sudo crictl info\", resources=nodes)\n...\n```\n\n### All Examples\nSee all script examples in the [./examples](./examples) directory.\n\n## Roadmap\nThis project has numerous possibilities ahead of it.  Read about our evolving [roadmap here](ROADMAP.md).\n\n\n## Contributing\n\nNew contributors will need to sign a CLA (contributor license agreement). Details are described in our [contributing](CONTRIBUTING.md) documentation.\n\n\n## License\nThis project is available under the [Apache License, Version 2.0](LICENSE.txt)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmware-tanzu%2Fcrash-diagnostics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvmware-tanzu%2Fcrash-diagnostics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvmware-tanzu%2Fcrash-diagnostics/lists"}