An open API service indexing awesome lists of open source software.

https://github.com/caplin/caplin-platform-diagnostics

Collect diagnostic information on a running or crashed Caplin Platform component.
https://github.com/caplin/caplin-platform-diagnostics

Last synced: 3 months ago
JSON representation

Collect diagnostic information on a running or crashed Caplin Platform component.

Awesome Lists containing this project

README

        

# Caplin Platform Diagnostics

Caplin Platform Diagnostics is a collection of Bash scripts that collect diagnostics on a running or crashed Caplin Platform component.

The scripts automate a series of common Linux diagnostic commands that Caplin Support ask customers to run when raising a support request (see [Send diagnostic information to Caplin Support](https://www.caplin.com/developer/caplin-platform/platform-architecture/get-information-about-a-failed-platform-component) on the Caplin website).

Caplin Platform Diagnostics is made available under an MIT licence.

**Contents**:

* [Requirements](#requirements)
* [Quick start](#quick-start)
* [Running diagnostics on a core file](#running-diagnostics-on-a-core-file)
* [Running diagnostics on a process](#running-diagnostics-on-a-process)

## Requirements

The Caplin Platform Diagnostics scripts have the following requirements:

* [CentOS](https://www.centos.org/)/[RHEL](https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux) 6 or 7
* GNU Debugger: `$ sudo yum install gdb`
* Red Hat OpenJDK 8 (full JDK, not just JRE): `$ sudo yum install java-1.8.0-openjdk-devel`

## Installation

Copy (or symlink) the two diagnostic scripts to a directory on your executable path. For example, `~/bin/` or `/usr/local/bin/`

## Quick start

To run diagnostics on a **process**, follow the steps below:

1. Install dependencies, if not already installed:

```
$ sudo yum install gdb java-1.8.0-openjdk-devel
```

1. Run the script below _under the same user as the target process_ (run time 20 seconds):

```
$ caplin-process-diagnostics.sh
```

For full details and options, see [Running diagnostics on a process](#running-diagnostics-on-a-process).

1. Upload the generated tar file and any log files requested by Caplin Support to Caplin's [File Upload Facility](https://www.caplin.com/account/uploads).

To run diagnostics on a **core-file**, follow the steps below:

1. Install dependencies, if not already installed:

```
$ sudo yum install gdb
```

1. Run the script below:

```
$ caplin-corefile-diagnostics.sh
```

For full details and options, see [Running diagnostics on a core file](#running-diagnostics-on-a-core-file).

1. Upload the generated tar file and any log files requested by Caplin Support to Caplin's [File Upload Facility](https://www.caplin.com/account/uploads).

## Running diagnostics on a core file

The `caplin-corefile-diagnostics.sh` script collates diagnostics for a core file dumped by a crashed Caplin Platform component.

The diagnostics collated include all the files Caplin Support require to analyse the core file: the component binary, the core file, and all shared libraries referenced in the core file. For the full list of information collated, see [Information collated](#information-collated), below.

Run this script on the crashed component's host, or, if this is not possible, on an identically configured host (same operating system and Java versions).

After running the script, log in to Caplin's secure [File Upload Facility](https://www.caplin.com/account/uploads) and upload the following files:

* Tar archive generated by the `caplin-corefile-diagnostics.sh` script
* Java virtual machine log and error files (if available):
* HotSpot JVM error file (`hs_err_pid.log`)
* Heap dump file (`var/java_pid.hprof`)
* Garbage collection log (`var/gc.log`)
* Caplin log files for the period of the incident
* Caplin configuration files

### Requirements

This script has the following requirements:

* [CentOS](https://www.centos.org/)/[RHEL](https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux) 6 or 7
* GNU Debugger (`gdb` RPM package).
* Write permission to the current directory
* Run on the crashed component's host or, if this is not possible, on an identically configured host (same operating system and Java versions)

### Usage

**Syntax**: `caplin-corefile-diagnostics.sh core [binary]`

* `core`: path to the core file dumped by the crashed process.
* `binary`: path to the crashed process's binary. Defaults to the path of the binary recorded in the core file.

**Run as**: any user

**Runtime**: < 1 minute

**Output**: `diagnostics---.tar.gz`

### Information collated

This script collates the following information:

| Diagnostic | Dependencies | User |
|----------------------------------------------|-------------------|------|
| `/etc/os-release` | - | - |
| `/etc/redhat-release` | - | - |
| `/etc/security/limits.conf` | - | - |
| `/etc/security/limits.d/*` | - | - |
| `ulimit -aS` output | - | - |
| `ulimit -aH` output | - | - |
| `uname -a` output | - | - |
| `df` output for binary's 'var' directory | - | - |
| Caplin `dfw versions` output | Binary is in a [DFW](https://www.caplin.com/developer/caplin-platform/deployment-framework/) | - |
| Core file | - | - |
| Core file backtrace | `gdb` RPM package | - |
| Core file libraries | `gdb` RPM package | - |
| Component binary | - | - |

### Example

The example below collates diagnostics for a core file, `core.4972`, dumped by a Liberator binary, `rttpd`:

```console
$ ./caplin-corefile-diagnostics.sh ~/dfw1/servers/Liberator/core.4972

Caplin Core-file Diagnostics
============================

Host: server1
Core: /home/caplin/dfw1/servers/Liberator/core.4972
Binary: /home/caplin/dfw1/servers/Liberator/bin/rttpd
GDB installed: 1
Script temp dir: diagnostics-server1-rttpd-core.4972-20190916104354

Recording /etc/os-release
Recording /etc/redhat-release
Recording 'uname -a' output
Recording 'df' output for /home/caplin/dfw1/servers/Liberator/var
Recording 'dfw versions' output
Getting thread backtraces from core.4972
Getting list of libraries referenced by core.4972
Copying libraries referenced by core.4972

DONE

Files collected:

core.4972
core.4972.backtrace.out
core.4972.libs.tar
dfw-versions.out
diagnostics.log
libs-list.out
libs-list.txt
os-release
redhat-release
rttpd
uname.out

Archiving files to diagnostics-server1-rttpd-core.4972-20190916104354.tar.gz

Please login to https://www.caplin.com/account/uploads
and upload the archive to Caplin Support.
```

## Running diagnostics on a process

The `caplin-process-diagnostics.sh` script collates diagnostics for a process without terminating the process.

Script run-time is 20s for the default set of diagnostics. Optional diagnostics take longer, and their timing can be variable. For example, the run time for the optional GDB core dump (`--gcore`) depends on the size of the target process in memory, and the host's disk I/O and CPU performance.

For the full list of information collated, see [Information collated](#information-collated-1), below.

After running the script, log in to Caplin's secure [File Upload Facility](https://www.caplin.com/account/uploads) and upload the following files:

* Tar archive generated by the `caplin-process-diagnostics.sh` script
* Java virtual machine log files (if available):
* Garbage collection log (`var/gc.log`)
* Caplin log files for the period of the incident
* Caplin configuration files

### Requirements

The main dependency is the GNU Debugger (`gdb` package). This is required for generating stack traces and a core dump.

If any requirements are missing when you run the script, the script lists the missing dependencies and asks if you wish to continue. If you choose to continue, the script skips any diagnostics with missing dependencies.

**All diagnostics**:

* [CentOS](https://www.centos.org/)/[RHEL](https://www.redhat.com/en/technologies/linux-platforms/enterprise-linux) 6 or 7
* Write permission to the current directory

**GDB core dump and backtrace diagnostics**:
* `gdb` RPM package
* Free disk space greater than the process's virtual memory
* **CentOS/RHEL 7**: [SELINUX](https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/selinux_users_and_administrators_guide/index) boolean `deny_ptrace` set to off (if SELINUX enabled and enforcing).
* **CentOS/RHEL 7**: [Yama kernel module](https://www.kernel.org/doc/Documentation/security/Yama.txt) sysctl setting `kernel.yama.ptrace_scope` set to 0, 1, or 2.

**JVM diagnostics**:
* `java-1.8.0-openjdk-devel` RPM package. This package installs the full JDK, which includes the `jcmd` diagnostic tool.

**Optional `strace` diagnostic**:
* `strace` RPM package. Only required if requested by Caplin Support.

### Usage

**Syntax**: `caplin-process-diagnostics.sh [options] pid`

* `pid`: process identifier of the running component
* Options:
* `--gcore`: include the optional GDB core file dump diagnostic. Only include this diagnostic when requested by Caplin Support.
* `--jvm-heap`: include the optional JVM heap dump diagnostic. Halts the JVM temporarily for the duration of the diagnostic. Only include this diagnostic when requested by Caplin Support.
* `--jvm-class-histogram`: include the optional JVM class histogram diagnostic. Halts the JVM temporarily for the duration of the diagnostic. Only include this diagnostic when requested by Caplin Support.
* `--strace`: include the optional `strace` diagnostic. Only include this diagnostic when requested by Caplin Support.
* `--help`: display help and exit
* `--version`: display version and exit

**Run as**:

* CentOS 6: the process's user
* CentOS 7:
* `kernel.yama.ptrace_scope=0`: the process's user
* `kernel.yama.ptrace_scope=1`: root (required for core dump, thread backtraces, and `strace`)
* `kernel.yama.ptrace_scope=2`: root (required for core dump, thread backtraces, and `strace`)
* `kernel.yama.ptrace_scope=3`: the process's user (core dump, thread backtraces, and `strace` prohibited for all users)

**Runtime**: 20s for the default set of diagnostics

**Output**: `diagnostics----.tar.gz`

### Information collated

Default diagnostics:

| Diagnostic | Dependencies | User |
|---------------------------------------|----------------------|----------|
| `/etc/os-release` | - | - |
| `/etc/redhat-release` | - | - |
| `uname -a` output | - | - |
| `/proc/sys/kernel/core_pattern` | - | - |
| `/proc/sys/kernel/core_uses_pid` | - | - |
| `/proc//limits` | - | - |
| `/etc/security/limits.conf` | - | - |
| `/etc/security/limits.d/*` | - | - |
| `top` output for the system (5 seconds)| - | - |
| `top` output for the process (5 seconds)| - | - |
| `df` output for the process's `/var` directory| - | - |
| `free` output | - | - |
| `vmstat` output (5 seconds) | - | - |
| Caplin `dfw info` output | Process binary is in a [DFW](https://www.caplin.com/developer/caplin-platform/deployment-framework/)| - |
| Caplin `dfw status` output | Process binary is in a [DFW](https://www.caplin.com/developer/caplin-platform/deployment-framework/)| - |
| Caplin `dfw versions` output | Process binary is in a [DFW](https://www.caplin.com/developer/caplin-platform/deployment-framework/)| - |
| JVM `jcmd Thread.print` output | `jcmd` JDK command | _Note 1_ |
| JVM `jcmd GC.heap_info` output | `jcmd` JDK command | _Note 1_ |
| JVM `jcmd VM.system_properties` output | `jcmd` JDK command| _Note 1_ |
| JVM `jcmd VM.flags` output | `jcmd` JDK command | _Note 1_ |
| JVM `jcmd PerfCounter.print` output | `jcmd` JDK command | _Note 1_ |
| JVM `jstat -gc ` output | `jcmd` JDK command | _Note 1_ |
| JVM `jstat -gcutil ` output | `jcmd` JDK command | _Note 1_ |
| GDB thread backtrace | `gdb` RPM package | _Note 2_ |
| Process binary | - | - |

Optional diagnostics (only enable if requested by Caplin Support):

| Diagnostic | Dependencies | User |
|---------------------------------------|----------------------|----------|
| GDB core-file dump, backtrace, and libraries | `gdb` RPM package | _Note 2_ |
| JVM `jcmd GC.heap_dump` output | `jcmd` JDK command | _Note 1_ |
| JVM `jcmd GC.class_histogram` output | `jcmd` JDK command| _Note 1_ |
| `strace` output (system-call logging) | `strace` RPM package | _Note 2_ |

**Note 1**: JVM diagnostics must be run as the process's user. If you run the script as root, then the script uses `sudo` to run the JVM diagnostics as the process's user.

**Note 2**: GDB thread backtraces, GDB core dump, and `strace` can be run as the process's user, unless prohibited by the [Yama kernel module](https://www.kernel.org/doc/Documentation/security/Yama.txt) (introduced in CentOS/RHEL 7). The script will advise you if root privileges are required to run these diagnostics.

### Performance impact

The default set of diagnostics includes only one diagnostic that directly impacts the performance of the target process:

* **GDB thread backtrace**: the target process is halted temporarily for less than 1 second for each backtrace.

The optional diagnostics have a potentially greater performance impact and should only be enabled when requested by Caplin Support:

* **GDB core dump**: the target process is halted temporarily for the time it takes the [gcore](http://man7.org/linux/man-pages/man1/gcore.1.html) command to write the process's virtual memory to a core file. The execution time is determined by the size of the process's virtual memory (`ps -o vsz= -q `) and the host's CPU and I/O performance.

* **strace**: slows performance of the target process for the duration of the diagnostic (40 seconds).

* **JVM heap dump**: halts the JVM temporarily for the duration of the diagnostic.

* **JVM class histogram**: halts the JVM temporarily for the duration of the diagnostic.

### Example

The example below collates diagnostics for a Liberator running as process 4972:

```console
$ ./caplin-process-diagnostics.sh 4972

Caplin Process Diagnostics
==========================

Process ID: 4972
Process binary: /home/caplin/dfw1/kits/Liberator/Liberator-7.1.9-313149/bin/rttpd

Script user: same user as process 4972
Script temp dir: ./diagnostics-server1-rttpd-4972-20190916102608

Recording /etc/redhat-release
Recording 'uname -a' output
Recording /proc/sys/kernel/core_pattern
Recording /proc/sys/kernel/core_uses_pid
Recording /proc/4972/limits
Recording 'top' output (5 seconds)
Recording 'top' output for process 4972 (5 seconds)
Recording process 4972 limits (/proc/4972/limits)
Recording 'df' output for /home/caplin/dfw1/servers/Liberator/var
Recording 'free' output
Recording 'vmstat' output (5 seconds)
Recording 'dfw info' output
Recording 'dfw status' output
Recording 'dfw versions' output
1/3: Dumping GDB thread backtraces for process 4972
Sleeping for 1 second...
2/3: Dumping GDB thread backtraces for process 4972
Sleeping for 1 second...
3/3: Dumping GDB thread backtraces for process 4972
1/3: Dumping JVM stack trace for process 4972
Sleeping for 1 second...
2/3: Dumping JVM stack trace for process 4972
Sleeping for 1 second...
3/3: Dumping JVM stack trace for process 4972
Recording JVM heap info
Recording JVM properties
Recording JVM flags
Recording JVM performance counters
Recording JVM jstat GC output

DONE

Files collected:
df.out
dfw-info.out
dfw-status.out
dfw-versions.out
diagnostics.log
free.out
jvm-flags
jvm-heapinfo
jvm-jstat-gc
jvm-jstat-gcutil
jvm-perfcounter
jvm-props
jvm-stacktrace-20190916102810.out
jvm-stacktrace-20190916102811.out
jvm-stacktrace-20190916102812.out
proc-4972-limits
proc-sys-kernel-core_pattern
proc-sys-kernel-core_uses_pid
redhat-release
rttpd-backtrace-20190916102806.out
rttpd-backtrace-20190916102808.out
rttpd-backtrace-20190916102809.out
top-4972.out
top.out
uname.out
vmstat.out

Archiving files to diagnostics-server1-rttpd-4972-20190916102608.tar.gz

Please login to https://www.caplin.com/account/uploads
and upload the archive to Caplin Support.
```