{"id":19347179,"url":"https://github.com/dcos/dcos-diagnostics","last_synced_at":"2026-02-28T11:09:14.929Z","repository":{"id":52413058,"uuid":"94367552","full_name":"dcos/dcos-diagnostics","owner":"dcos","description":"DC/OS Distributed Diagnostics Tool \u0026 Aggregation Service","archived":false,"fork":false,"pushed_at":"2023-02-25T06:09:54.000Z","size":9018,"stargazers_count":6,"open_issues_count":12,"forks_count":26,"subscribers_count":42,"default_branch":"master","last_synced_at":"2024-06-20T12:39:38.548Z","etag":null,"topics":["dcos","dcos-checks","dcos-diagnostics","dcos-ux-guild"],"latest_commit_sha":null,"homepage":"","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dcos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":"CODEOWNERS","security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-14T19:49:08.000Z","updated_at":"2021-01-24T10:38:14.000Z","dependencies_parsed_at":"2024-06-20T11:55:06.048Z","dependency_job_id":"1bb9147d-211c-4c81-b772-389f7f90bf30","html_url":"https://github.com/dcos/dcos-diagnostics","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcos%2Fdcos-diagnostics","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcos%2Fdcos-diagnostics/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcos%2Fdcos-diagnostics/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dcos%2Fdcos-diagnostics/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dcos","download_url":"https://codeload.github.com/dcos/dcos-diagnostics/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223909905,"owners_count":17223592,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dcos","dcos-checks","dcos-diagnostics","dcos-ux-guild"],"created_at":"2024-11-10T04:14:38.799Z","updated_at":"2026-02-28T11:09:09.877Z","avatar_url":"https://github.com/dcos.png","language":"Go","funding_links":[],"categories":[],"sub_categories":[],"readme":"# dcos-diagnostics [![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![Jenkins](https://jenkins.mesosphere.com/service/jenkins/buildStatus/icon?job=public-dcos-cluster-ops/dcos-diagnostics/dcos-diagnostics-master)](https://jenkins.mesosphere.com/service/jenkins/job/public-dcos-cluster-ops/job/dcos-diagnostics/job/dcos-diagnostics-master/) [![Go Report Card](https://goreportcard.com/badge/github.com/dcos/dcos-diagnostics)](https://goreportcard.com/report/github.com/dcos/dcos-diagnostics)\n\ndcos-diagnostics is a monitoring agent which exposes a HTTP API for querying from the `/system/health/v1` DC/OS API.\ndcos-diagnostics puller collects the data from agents and represents individual node health for things like\nsystem resources as well as DC/OS-specific services.\n\ndcos-diagnostics generate historical mesos-states bundles.\nFor more context see: https://github.com/dcos/dcos/pull/5907.\n\n### Architecture\n\nOriginally dcos-diagnostics was designed in Master/Agent model. It's running on every DC/OS node.\n\n* Master\n\nMaster runs on DC/OS Masters. There is the point of entry to dcos-diagnostics from remote systems (e.g., UI).\nMaster is able to query other nodes for health status. Master is responsible for generating cluster diagnostics bundle.\n\n* Public Agent and Agent\n\nAgent runs on every non Master node (excluding bootstrap node). The main responsibility of Agent is providing JSON\nreport of DC/OS Systemd components health. Agent also provides logs that should appear in cluster bundle.\n\n\n### Diagnostics Bundle\n\nDiagnostics bundle is just a ZIP file with all files useful when debugging problems.\nIt can be treated as flight recorder (blackbox) but for clusters. List of interesting files, commands and endpoints,\nthat should be fetch in bundle is configurable and deployed with dcos-diagnostics binary.\nDiagnostic bundle generation process fetches all configured files and stores them in single ZIP.\nZIP contains directories named after nodes' IP and role\n(see: [api/rest/coordinator.go](https://github.com/dcos/dcos-diagnostics/blob/f719ffce0339f07f1be4a1ca24fb8b96fe94dff4/api/rest/coordinator.go#L361-L363)).\n\nThe contents of the generated bundle are not stable over time and any internal or third party bundle analysis tooling\nshould be programmed very defensively in this regard.\nSee: [dcos-docs-site#2253](https://github.com/mesosphere/dcos-docs-site/pull/2253)\n\n#### API\n\nAPI documentation could be find in [docs](/docs) directory. It's using\n[OpenAPI v3.0](https://github.com/OAI/OpenAPI-Specification/blob/master/versions/3.0.0.md)\nYou can see rendered version\n[here](https://temando.github.io/open-api-renderer/demo/?url=https://raw.githubusercontent.com/dcos/dcos-diagnostics/master/docs/api.yaml).\nThere are two versions of bundle API.\n\n1. Old serial API – single master calls every node for data. This API is deprecated and should be removed in DC/OS 2.2\n\n![deprecated cluster bundle creation diagram](docs/diagrams/deprecated_cluster_bundle_creation_diagram.png)\n\n2. New parallel API – single master schedules local bundle creation for every node in a cluster. Then master wait until\nnodes finish bundles. Master downloads finished bundles and merges them into a single cluster bundle zip.\n\n![cluster bundle creation diagram](docs/diagrams/cluster_bundle_creation_diagram.png)\n\nOld API is faster for smaller clusters but it's slow for large clusters, so we recommend to only use the new API\nthat's available since DC/OS 2.0.\n\nTo get more information read [the design doc](https://docs.google.com/document/d/1UU47_ZVBPQRzzSc9D57W4h7VtzRyMxiLTcZ4XKfwA5I/edit?usp=sharing)\n\n### History\n\nIn the past dcos-diagnostics was bundled with:\n\n* [dcos-checks](https://github.com/dcos/dcos-checks)\n* [dcos-checks-runner](https://github.com/dcos/dcos-check-runner)\n\n– see: https://github.com/dcos/dcos-diagnostics/pull/35\nIn that time dcos-diagnostics was called `3dt`\n([DC/OS Distributed Diagnostics Tool](https://github.com/dcos/3dt/tree/master)).\nIt was deprecated in Jun, 2017 but some references might still exist.\n\n\n## Build\n\n```\ngo get github.com/dcos/dcos-diagnostics\ncd $GOPATH/src/github.com/dcos/dcos-diagnostics\nmake\nbuild/dcos-diagnostics --version\n```\n\n## Run\nRun dcos-diagnostics once, on a DC/OS host to check systemd units:\n\n```\ndcos-diagnostics --diag\n```\n\nGet verbose log output:\n\n```\ndcos-diagnostics --diag --verbose\n```\n\nRun the dcos-diagnostics aggregation service to query all cluster hosts for health state:\n\n```\ndcos-diagnostics daemon --pull\n```\n\nStart the dcos-diagnostics health API endpoint:\n\n```\ndcos-diagnostics daemon\n```\n\n### dcos-diagnostics daemon options\n\n| Flag                          |   Type  | Description                                                                                               |\n|-------------------------------|:-------:|-----------------------------------------------------------------------------------------------------------|\n| agent-port                    |   int   | Use TCP port to connect to agents. (default 1050)                                                         |\n| ca-cert                       |  string | Use certificate authority.                                                                                |\n| command-exec-timeout          |   int   | Set command executing timeout (default 50)                                                                |\n| debug                         |   bool  | Enable pprof debugging endpoints.                                                                         |\n| diagnostics-bundle-dir        |  string | Set a path to store diagnostic bundles (default \"/var/run/dcos/dcos-diagnostics/diagnostic_bundles\")      |\n| diagnostics-job-timeout       |   int   | Set a global diagnostics job timeout (default 720)                                                        |\n| diagnostics-units-since       |  string | Collect systemd units logs since (default \"24h\")                                                          |\n| diagnostics-url-timeout       |   int   | Set a local timeout for every single GET request to a log endpoint (default 1)                            |\n| endpoint-config               | strings | Use endpoints_config.json (default [/opt/mesosphere/etc/endpoints_config.json])                           |\n| exhibitor-url                 |  string | Use Exhibitor URL to discover master nodes. (default \"http://127.0.0.1:8181/exhibitor/v1/cluster/status\") |\n| fetchers-count                |   int   | Set a number of concurrent fetchers gathering nodes logs (default 1)                                      |\n| force-tls                     |   bool  | Use HTTPS to do all requests.                                                                             |\n| health-update-interval        |   int   | Set update health interval in seconds. (default 60)                                                       |\n| hostname                      |  string | A host name (by default it uses system hostname) (default \"orion\")                                        |\n| iam-config                    |  string | A path to identity and access management config                                                           |\n| ip-discovery-command-location |  string | A command used to get local IP address                                                                    |\n| master-port                   |   int   | Use TCP port to connect to masters. (default 1050)                                                        |\n| no-unix-socket                |   bool  | Disable use unix socket provided by systemd activation.                                                   |\n| port                          |   int   | Web server TCP port. (default 1050)                                                                       |\n| pull                          |   bool  | Try to pull runner from DC/OS hosts.                                                                      |\n| pull-interval                 |   int   | Set pull interval in seconds. (default 60)                                                                |\n| pull-timeout                  |   int   | Set pull timeout. (default 3)                                                                             |\n\n## Test\n```\nmake test\n```\n\n## Future\n\nStarting with DC/OS 2.0 we deprecated \"old\" bundle API and proposed new parallel API. The deprecation process should\nbe finished with DC/OS 2.3 and all code responsible for old API can be deleted. In order to do this we need to change\nall scripts in other DC/OS components to use new DC/OS Diagnostics CLI.\n\nNew Diagnostics Bundle API gives us opportunity to create diagnostics bundle on a single node even if DC/OS Cluster is down.\nNext step should be making dcos-diagnostics independent from DC/OS. Currently, Cluster bundle will not be generated if\nMesos, Admin Router or DNS is down. To do it we should move from single service to binary deployed on cluster.\nThis idea is described in [design doc](https://docs.google.com/document/d/1Z6dcOK1_IQFlHGiQ_y1jsZ4RTo0IVpkRrKPGzniHC4E/edit?usp=sharing)\n\nWe keep user stories in [this doc](https://docs.google.com/document/d/1tuzwye3EvraGw15bqE7yI9PZ_QYwX3x_eNi3xmFP554/edit?usp=sharing)\nTasks are gathered under [DCOS-57837](https://jira.mesosphere.com/browse/DCOS-57837).","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcos%2Fdcos-diagnostics","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdcos%2Fdcos-diagnostics","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdcos%2Fdcos-diagnostics/lists"}