Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/stasm/raptor-compare

Compare sets of Raptor results and test for statistical significance of the observed difference.
https://github.com/stasm/raptor-compare

Last synced: 3 months ago
JSON representation

Compare sets of Raptor results and test for statistical significance of the observed difference.

Awesome Lists containing this project

README

        

raptor-compare [![Build Status][travisimage]][travislink]
============================================================

[travisimage]: https://travis-ci.org/stasm/raptor-compare.png?branch=master
[travislink]: https://travis-ci.org/stasm/raptor-compare

Compare sets of Raptor results and test for their statistical significance
([t-test][] with 0.05 alpha).

[t-test]: https://en.wikipedia.org/wiki/Student%27s_t-test

$ raptor-compare my_test.ldjson

music.gaiamobile.org base: mean 1: mean 1: delta 1: p-value
--------------------- ---------- ------- -------- ----------
navigationLoaded 711 726 14 0.06
navigationInteractive 737 748 12 0.10
visuallyLoaded 1322 1217 -105 * 0.00
contentInteractive 1323 1217 -105 * 0.00
fullyLoaded 1462 1442 -20 0.14
uss 19.881 20.370 0.489 * 0.00
pss 23.468 23.981 0.513 * 0.00
rss 39.640 40.152 0.512 * 0.00

In the example above, Raptor measurements for the Music app were stable for the
`visuallyLoaded` and `contentInteractive` events, as indicated by the asterisks
next to the p-values. At the same time, we can see that the memory footprint
has regressed: the mean `uss` usage is higher than the base measurement and the
difference is statistically significant as well.

For all measurements marked with the asterisk (`*`) it is valid to assume that
the means are indeed significantly different between the base and the try runs.

The remaining results, e.g. the 20 ms `fullyLoaded` speed-up, are not
significant and might be caused by a random instability of the data. Try
increasing the sample size (via Raptor's `--runs` option; see below) and run
Raptor again.

What is p-value?
----------------

The p-value is a concept used in statistical testing which represents our
willingness to make mistakes about the data. A low p-value means that there's
only a small risk of making a mistake by concluding that the test data
indicates that the means are truly different and that the observed differences
are not due to poor sampling and randomness.

For the data above, a p-value of 0.14 for `fullyLoaded` means that the risk of
being wrong is 14% when we conclude that the 20 ms difference between the means
is due to an actual code change and not to randomness.

Good p-values are below 0.05.

Installation
------------

npm install -g raptor-compare

Running Raptor tests
--------------------

(For best results, follow the [Raptor][] guide on MDN.)

Install Raptor with:

$ sudo npm install -g @mozilla/raptor

Connect your device to the computer, go into you Gaia directory and build Gaia:

$ make raptor

Then, run the desired perf test:

$ raptor test coldlaunch --runs 30 --app music --metrics my_test.ldjson

Raptor will print the output to `stdout`. The raw data will be saved in the
`ldjson` file specified in the `--metrics` option. The data is appended so you
can runmultiple tests for different revisions and apps and `raptor-compare`
will figure out how to handle it. All testing is conducted relative to the
first result set for the given app.

[Raptor]: https://developer.mozilla.org/en-US/Firefox_OS/Automated_testing/Raptor

API
---

You can also use _raptor-compare_ programmatically. It exposes three functions
for working with Raptor data: `read` reads in a LDJSON stream with the raw
metrics data, `parse` aggregates the data into a Map and `build` creates the
comparison tables with p-values for significance testing.

```javascript
// Needed for Node.js 0.10 and 0.12.
require('babel/polyfill');

const fs = require('fs');
const compare = require('raptor-compare');

compare.read(fs.createReadStream(filename))
.then(compare.parse)
.then(compare.build)
.then(tables => tables.forEach(
table => console.log(table.toString())))
.catch(console.error);
```