An open API service indexing awesome lists of open source software.

https://github.com/cdapio/chaos-monkey

Chaos Monkey for CDAP
https://github.com/cdapio/chaos-monkey

Last synced: 8 months ago
JSON representation

Chaos Monkey for CDAP

Awesome Lists containing this project

README

          

# Chaos Monkey

Chaos Monkey provides a convenient way to disrupt CDAP and hadoop services on a cluster.
Disruptions can be scheduled, randomized, or issued on command.

## Standalone Chaos Monkey
To start Chaos Monkey daemon and HTTP server, set configurations in chaos-monkey-site.xml and run ChaosMonkeyMain

### Configurations
**Disruptions setup**

>By default, the following disruptions will be available to each service:

>* start

>* restart

>* stop

>* terminate

>* kill

>* rolling-restart

>
>Custom disruptions can be added by extending the Disruption class and then associating them with a service.
>A custom disruption is started by calling ClusterDisruptor.disrupt(serviceName, disruptionName, actionArguments),
>where disruptionName is set by the Disruption.getName() method.
>Disruptions receive a collection of RemoteProcess based on the actionArguments, and can be used to execute commands
>via ssh. To add a custom disruption to a service:
>* {service}.disruptions - Class paths of custom disruptions, separated by commas

**Initialize a service for Chaos Monkey**

>Any configured service can be interacted with through ClusterDisruptor or REST endpoints. To configure a service for
chaos Monkey, either provide custom disruptions or a pid file for the default disruptions:

>* {service}.pidFile - Path to the .pid file of the service

**Configurations for scheduled disruptions**

>These additional properties can be set for a certain service to start a scheduled disruption:

>* {service}.interval - Number of seconds between each disruption

>* {service}.killProbability - Number between 0 to 1 representing chance of kill occurring each iteration.

>* {service}.stopProbability - Number between 0 to 1 representing chance of stop occurring each iteration.

>* {service}.restartProbability - Number between 0 to 1 representing chance of restart occurring each iteration.

>* {service}.minNodesPerIteration - Minimum number of nodes affected each iteration.

>* {service}.maxNodesPerIteration - Maximum number of nodes affected each iteration.

**Cluster information collector**

>By default, Chaos Monkey will retrieve cluster information from Coopr

>To get cluster information from Coopr, the following configurations need to be set:

>* cluster.info.collector.coopr.clusterId

>* cluster.info.collector.coopr.tenantId

>* cluster.info.collector.coopr.server.uri

>
>To get cluster information from other sources, include a plugin to implement ClusterInfoCollector and set the
following configs:

>* cluster.info.collector.class - classpath of the implementation of ClusterInfoCollector
>
>Additional properties can be passed in to the ClusterInfoCollector implementation. Setting the property
cluster.info.collector.{propertyName} in configurations will make {propertyName} available in the properties map,
passed in via the initialize method

**SSH configurations**

>username - username of SSH profile (if different from system user)

>keyPassphrase - passphrase for private key, if applicable

>privateKey - path to private key (will check default locations unless specified)

## HTTP endpoints
HTTP server is hosted on port 11020, with the following endpoints:

>**POST /v1/services/{service}/{action}**

>{action} includes stop, kill, terminate, start, restart, and rolling-restart

>The action, by default, will be performed on all nodes configured with the service. To specify affected nodes, include
ne of the following request bodies:
>```
>{
> nodes:[,...]
>}
>```
>```
>{
> percentage:
>}
>```
>```
>{
> count:
>}
>```
>In addition to the above request bodies, rolling restart can be also configured with:
>```
>{
> restartTime:
> delay:
>}
>```

>**GET /v1/nodes/{ip}/status**

>Get the status of all configured service on a given address

>**GET /v1/status**

>Get the status of all configured service on every node of the cluster