An open API service indexing awesome lists of open source software.

https://github.com/hubzero/hzmetrics

HUBzero metrics pipeline — Python rewrite of the legacy PHP/Perl/Bash pipeline
https://github.com/hubzero/hzmetrics

Last synced: 3 days ago
JSON representation

HUBzero metrics pipeline — Python rewrite of the legacy PHP/Perl/Bash pipeline

Awesome Lists containing this project

README

          



Hubzero Metrics Pipeline


Apache logs → MariaDB analytics, in one Python file.


tests
docs CI
documentation
Python 3.10+
license: MIT
status: beta

---

`hzmetrics.py` is the analytics pipeline for a HUBzero-based science
gateway. It ingests Apache access logs and CMS authentication logs,
enriches them (reverse DNS, domain classification, GeoIP, session
coalescing), and produces monthly summary statistics in a MariaDB
metrics database. Those statistics drive the hub's usage reporting
pages and grant reporting.

One Python file (~8000 lines) replaces the decade-plus accumulation
of PHP, Perl, and Bash scripts that previously lived at
`/opt/hubzero/bin/metrics/`. The legacy reference implementation is
preserved verbatim under [`tests/legacy/`](tests/legacy/) and is the
bug-for-bug parity target the A/B test harness compares against.

## Quickstart

```sh
# 1. Deps + /opt tree + scripts (root; idempotent).
sudo make install

# 2. Drop the unified per-tenant config in place (DB creds + DNS settings).
sudo install -o apache -g apache -m 0600 hzmetrics.conf \
/opt/hubzero/metrics/conf/hzmetrics.conf

# 3. Create the metrics DB, run baseline DDL, apply migrations.
sudo -u apache python3 /opt/hubzero/metrics/bin/hzmetrics.py init

# 4. Confirm everything is healthy.
sudo -u apache python3 /opt/hubzero/metrics/bin/hzmetrics.py doctor

# 5. Register the cron line.
sudo -u apache crontab /opt/hubzero/metrics/conf/hzmetrics.cron.apache.sample
```

`make install`, `init`, and `doctor` are idempotent. The same `init`
machinery also runs automatically on the first cron tick when invoked
as `apache` / `www-data`, so if you skip step 3 the next tick will
catch up — see
[`docs/architecture.md → Self-bootstrap`](docs/architecture.md#self-bootstrap).

The cron line is one entry, every five minutes:

```
*/5 * * * * python3 /opt/hubzero/metrics/bin/hzmetrics.py tick
```

`tick` refreshes the whoisonline map every invocation; at `:30` past
each hour it also opportunistically runs the metrics pipeline under a
PID lock. The pipeline is a three-mode state machine (`normal`,
`catchup`, `rebuild`) — a multi-year backlog drains autonomously
without operator intervention.

For everything else, `hzmetrics.py --help` and the
[full documentation](https://hubzero.github.io/hzmetrics/).

## Source layout

```
.
├── hzmetrics.py the entire pipeline
├── Makefile install / uninstall / test / lint
├── conf/ templates: hzmetrics.conf.sample, cron
├── docs/ plain-markdown documentation
├── gh-pages/ static-site templates + builder
└── tests/
├── legacy/ pre-rewrite PHP/Perl/Bash baseline
└── ab/ A/B + golden + defensive harness
(44 ports — see docs/testing.md)
```

## Documentation

Start at [`docs/README.md`](docs/README.md) (or the
[rendered site](https://hubzero.github.io/hzmetrics/)). Most-touched
operational pages:

- [`docs/deployment.md`](docs/deployment.md) — install, cron,
logrotate, hzmetrics.conf.
- [`docs/operations.md`](docs/operations.md) — runbook: catch-up,
stuck lock, bot inflation, DNS issues, crash recovery,
ANALYZE TABLE, etc.
- [`docs/architecture.md`](docs/architecture.md) — pipeline phases,
tables, scheduling, the catchup state machine, self-bootstrap.
- [`docs/testing.md`](docs/testing.md) — A/B + golden + defensive
test modes.

## Acknowledgments

The HUBzero metrics subsystem was originally written in Perl by
Swaroop Shivarajapura and later ported to PHP by Nicholas J.
Kisseberth. Long-term stewardship of the codebase has been carried
by J.M. Sperhac (SDSC), among others. This Python rewrite builds
directly on their work.