https://github.com/platformbuilds/otel-fintrans-simulator

This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system.
https://github.com/platformbuilds/otel-fintrans-simulator

Last synced: 7 months ago
JSON representation

This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system.

Host: GitHub
URL: https://github.com/platformbuilds/otel-fintrans-simulator
Owner: platformbuilds
License: apache-2.0
Created: 2025-11-29T06:22:07.000Z (8 months ago)
Default Branch: main
Last Pushed: 2025-12-27T02:08:24.000Z (7 months ago)
Last Synced: 2025-12-28T18:04:35.047Z (7 months ago)
Language: Go
Size: 247 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 1
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# OpenTelemetry Financial Transaction Simulator

## Purpose

This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system. It is used for:

- **Local development**: Testing Mirador Core's correlation and RCA engines with realistic telemetry data
- **Demo scenarios**: Showcasing platform capabilities with domain-specific observability patterns
- **Load testing**: Generating controlled telemetry volumes for performance testing

## Overview

This simulator generates realistic OpenTelemetry metrics, logs, and traces for a financial transaction processing system. It is fully configuration-driven:

- Telemetry naming and outputs are configured through `simulator-config.yaml`.
- Failure scenarios (e.g., bursty failures) are configurable in `simulator-config.yaml`.
- Telemetry outputs can be OTLP, stdout, or both — controlled by `telemetry.outputs` in the YAML.

The simulator is primarily intended for local development, demo scenarios, and load testing.

## Usage

### Running Locally

```bash
# Build the simulator
go build -o bin/otel-fintrans-simulator cmd/otel-fintrans-simulator/main.go

# Run with default settings (sends to localhost:4317)
./bin/otel-fintrans-simulator

# Run with custom OTLP endpoint (configured via simulator-config.yaml)
# Edit `simulator-config.yaml` and set `telemetry.endpoint: "http://my-collector:4317"` and optionally `telemetry.insecure: true`.
# Then run normally:
./bin/otel-fintrans-simulator

# Run with specific transaction rate
TRANSACTION_RATE=100 ./bin/otel-fintrans-simulator

# Use a deterministic RNG seed for reproducible runs
./bin/otel-fintrans-simulator --rand-seed 12345

# Print simulator logs to stdout instead of no-op logging
./bin/otel-fintrans-simulator --log-output stdout

Helper scripts
--------------
We've added small helper scripts under `scripts/` to make local runs and scenario testing easier. They are convenience wrappers that will build the simulator binary if missing and run the desired scenario(s).

Make them executable first (one-time):

```bash
chmod +x ./scripts/*.sh
```

Common helper scripts

- `./scripts/build_and_run.sh [args...]` — Build (if needed) and run a simulator binary with any arguments you pass through. E.g.:

```bash
./scripts/build_and_run.sh --config simulator-config.yaml --log-output stdout --signal-time-interval=5s
```

- `./scripts/run_examples.sh list` — list all example scenario YAML files shipped in `examples/scenarios`.
- `./scripts/run_examples.sh run ` — run a particular scenario (delegates to `examples/run_scenario.sh`). Example:

```bash
./scripts/run_examples.sh run cassandra_disk_pressure
```

- `./scripts/run_examples.sh run-all` — sequentially runs all example scenarios quickly using lightweight defaults (short run lengths and reduced transaction volumes) — handy for smoke-testing.

- `./scripts/gen_varied_scenarios.sh` — generates short/long/ramp variants for every scenario and writes them to `examples/generated/` so you can quickly test variant behaviours without editing original files.

Example: generate variants and run one

```bash
./scripts/gen_varied_scenarios.sh
./scripts/run_examples.sh run examples/generated/cassandra_disk_pressure.short.yaml
```
```

### Metric export interval

You can control how often the simulator collects and exports metrics to the configured exporters (OTLP/stdout) with the `--signal-time-interval` flag. The value is a Go duration string (for example `15s`, `30s`, `1m`). The default is `15s`.

Examples:

```bash
# default (15s)
./bin/otel-fintrans-simulator --signal-time-interval=15s

# set to 30 seconds
./bin/otel-fintrans-simulator --signal-time-interval=30s

# set to 1 minute
./bin/otel-fintrans-simulator --signal-time-interval=1m

# using `go run` with a custom interval
go run . --signal-time-interval=15s
```

For testing dense, continuous time series (recommended when you want good rate() and histogram results):

```bash
# Example: 300 transactions spread over 5 minutes with 10s data and export intervals
./bin/otel-fintrans-simulator \
--transactions=300 \
--time-window=5m \
--data-interval=10s \
--signal-time-interval=10s \
--concurrency=10 \
--failure-mode=mixed \
--failure-rate=0.2 \
--config=simulator-config.yaml \
--log-output=stdout
```

This produces frequent, evenly spaced metric points for 5 minutes so PromQL functions like rate(...[1m]) and histogram_quantile(...) have dense data to operate on.

Note: extremely short intervals may increase CPU/network load; pick an interval appropriate for your testing scenario.

### Configuration

Environment variables:

Telemetry endpoint & protocol
- `telemetry.endpoint`: OTLP collector endpoint (default: `localhost:4317` when not set in config). The simulator supports both gRPC (default port 4317) and HTTP/OTLP (default port 4318).
- `telemetry.insecure`: when true, use plaintext (no TLS) for the selected protocol (default: true).
- `telemetry.skip_tls_verify`: when using TLS, set to `true` to skip certificate verification (InsecureSkipVerify). Default: false.
Validation & helpful warnings
-----------------------------
The simulator performs lightweight validation of your `telemetry` settings at startup and logs warnings for inconsistent combinations. Examples:

- `telemetry.endpoint` uses `http://` but `telemetry.insecure=false` — HTTP is plaintext; either set `telemetry.insecure: true` or use `https://` for TLS.
- `telemetry.endpoint` uses `https://` but `telemetry.insecure=true` — that's inconsistent; either set `telemetry.insecure: false` to use TLS or change the endpoint scheme to `http://`.
- `telemetry.skip_tls_verify` is ignored when `telemetry.insecure` is `true` (plaintext).
Telemetry outputs
-----------------
Telemetry outputs are configured via the `telemetry.outputs` field in `simulator-config.yaml` (no CLI override).

Supported values (single or combined):
- `otlp` — send traces, metrics and logs to the configured OTLP endpoint (default)
- `stdout` — export traces + metrics to stdout (pretty-printed) and print logs to stdout
- `both` — export to both OTLP and stdout

Examples:

1) Use the default OTLP exporter (no change): keep `telemetry.outputs` empty / absent and the simulator will send telemetry to the OTLP endpoint.

2) Use stdout-only or both: edit `simulator-config.yaml` and add `telemetry.outputs: ["stdout"]` or `telemetry.outputs: ["otlp","stdout"]` for the desired behavior (then start simulator normally).

- `OTEL_SERVICE_NAME`: Service name for root traces (default: `api-gateway`)
- `TRANSACTION_RATE`: Transactions per second (default: `10`)
- `ERROR_RATE`: Percentage of failed transactions (default: `5`)
- `SIMULATION_DURATION`: How long to run (default: unlimited)

Configuration file (YAML)
-------------------------
The simulator now supports an optional YAML configuration file that controls telemetry names and failure scheduling.

By default the example config shipped with the tool is `simulator-config.yaml` (in this folder). Use `--config` to point to a custom config file:

```bash
# Use a custom YAML config
./bin/otel-fintrans-simulator --config ./cmd/otel-fintrans-simulator/simulator-config.yaml
```

The `failure` section supports a `bursty` mode and a list of `bursts` where the failure rate is multiplied for a time window. This enables more realistic, correlated failures.

Dynamic metric declarations
-------------------------
The simulator can now create extra metrics at startup driven purely by configuration using `telemetry.dynamic_metrics`. This enables teams to add new gauges, counters or histograms without changing code. Example:

```yaml
telemetry:
dynamic_metrics:
- name: cassandra_disk_pressure
type: gauge
dataType: float
description: "Synthetic disk pressure metric"

- name: api_request_latency_seconds
type: histogram
dataType: float
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0]
```

The simulator will validate the dynamic metric schema and create OTEL instruments at startup. Recording can be configured via scenarios or the simulator will emit sample values for gauge/histogram types.

Fully dynamic (all built-in metrics)
----------------------------------
The simulator now registers all built-in instruments via the dynamic MetricRegistry at startup. That means:

- You can override any of the default metric names using `telemetry.metric_names` in the YAML; the registry will create instruments using the effective names at startup.
- You can add entirely new metrics via `telemetry.dynamic_metrics` and the simulator will create and expose those instruments at startup without any code changes.
- Runtime recording prefers registry-backed handles so the simulator supports a fully dynamic telemetry surface. If a metric is declared in `dynamic_metrics` it will be available to scenarios and background generators.

This enables teams to add or rename KPIs and instrumentation without modifying the simulator binary — edit the YAML and restart.

Configuration example
---------------------
The bundled `simulator-config.yaml` (in this folder) contains a compact example which demonstrates:

- Overriding `service_names` used in spans/attributes
- Custom `metric_names` for all instrumented metrics
- A `failure` section that sets a base `rate`, chooses a mode (`bursty` recommended) and one or more `bursts` with `start`, `duration`, and `multiplier` values

Behavior notes
--------------
- If `--config` is not provided or the `failure` section is absent, the simulator falls back to the CLI flags `--failure-rate` and `--failure-mode` (original behaviour).
- If the YAML `failure.seed` is set, the simulator seeds randomness for deterministic runs, which is useful for reproducible demos/tests.

### Failure scenarios (config-driven)

The simulator supports richer, configuration-driven scenario injection. Use the `failure.scenarios` block in `simulator-config.yaml` to declare correlated, multi-metric scenarios. Each scenario contains a `start`, `duration`, optional `labels` (to scope the scenario to specific label values) and a list of `effects`.

An effect targets a named simulator dimension or metric and uses one of the following operations:
- `scale` — multiply the target by the specified value
- `add` — add the specified value
- `set` — set the target to the given value
- `ramp` — increment the target by `step` on each simulation tick

Example (see `simulator-config.yaml` in repo):

```yaml
failure:
scenarios:
- name: "db_slow_cascade"
start: "5s"
duration: "60s"
labels:
OrgId: ["bank_01", "bank_02"]
effects:
- metric: "db_latency"
op: "scale"
value: 5.0
- metric: "jvm_gc"
op: "scale"
value: 3.0
- metric: "transaction_failures"
op: "scale"
value: 4.0

- name: "bank03_outage"
start: "20s"
duration: "40s"
labels:
OrgId: ["bank_03"]
effects:
- metric: "kafka_controller_UnderReplicatedPartitions"
op: "add"
value: 2
- metric: "transaction_failures"
op: "scale"
value: 8.0
```

When a scenario is active the simulator applies its effects to the runtime state during each background tick. You can mix bursts (simple failure-rate multipliers) with scenario windows for rich, realistic fault patterns.

Hardware-fault scenarios
------------------------
In addition to service- and KPI-focused scenarios, the simulator now supports hardware/infra-fault style effects. These simulate problems such as disk failures impacting Kafka or a bad memory module impacting in-memory datastores (KeyDB/valkey). Example metric names you can use in scenario `effects` include:
- `kafka_disk_failure` — drives increased Kafka produce/consume errors and ISR noise
- `keydb_memory_fault` / `valkey_bad_memory` — drives KeyDB/valkey operation failures and increases redis memory/error signals

Use these effects to model outages that originate in underlying infrastructure (hardware, nodes, network) rather than just service deployments.

Network-fault scenarios
-----------------------
We also support network-specific scenarios to simulate packet drops and network-induced latency — useful when failures originate from unreliable network interfaces, congested links, or router problems. Typical metric names for scenario effects:
- `network_latency` / `node_network_latency_ms` — scales up simulated network latency (affects produce/consume and API gateway processing)
- `network_packet_drop` / `node_network_packet_drops_total` — increases packet drop counts and causes higher messaging errors

When these scenarios are active the simulator increases network latency on affected nodes and emits packet drop counters. That also increases Kafka/consumer errors and may cascade into higher transaction failures.

Scenario examples — copy/paste ready
-----------------------------------
Below are practical, ready-to-use scenario YAML snippets you can copy into `failure.scenarios` in your `simulator-config.yaml`. These show how to simulate common outage classes — service deployment problems, hardware failures, memory faults, and network problems.

1) Database slowdown / deployment outage
```yaml
- name: "db_slow_cascade"
start: "5s"
duration: "60s"
labels:
OrgId: ["bank_01", "bank_02"]
effects:
- metric: "db_latency"
op: "scale"
value: 5.0
- metric: "transaction_failures"
op: "scale"
value: 4.0
```