https://github.com/welpo/stop-early-stopping

Why you shouldn't peek at significance levels to decide when to stop an experiment
https://github.com/welpo/stop-early-stopping

ab-testing bad-practices early-stopping experimental-design experimentation good-practices p-hacking

Last synced: 5 months ago
JSON representation

Why you shouldn't peek at significance levels to decide when to stop an experiment

Host: GitHub
URL: https://github.com/welpo/stop-early-stopping
Owner: welpo
License: agpl-3.0
Created: 2025-05-18T18:43:00.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-05-24T20:10:42.000Z (5 months ago)
Last Synced: 2025-06-08T10:05:28.009Z (5 months ago)
Topics: ab-testing, bad-practices, early-stopping, experimental-design, experimentation, good-practices, p-hacking
Language: HTML
Homepage: https://stop-early-stopping.osc.garden
Size: 160 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: COPYING
- Code of conduct: CODE_OF_CONDUCT.md
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

Try it yourself!

Why you shouldn't peek at significance in A/B tests

## What is Early Stopping?

Early stopping occurs when you repeatedly check your A/B test results and stop the experiment as soon as you see a "statistically significant" result. This is a form of [p-hacking](https://en.wikipedia.org/wiki/Data_dredging#Optional_stopping) that dramatically increases false positive rates.

For example, with a significance level of 0.05 (5%), you expect a 5% false positive rate when checking just once at the end. But if you check results daily during a 30-day test, you could see false positive rates as high as 30%, meaning many of your "winning" tests are actually detecting nothing at all!

## Features

- **Interactive simulation**: See how early stopping affects false positive rates in real-time
- **Adjustable parameters**: Test different experiment durations, checking frequencies, and significance levels
- **Visual comparison**: Directly compare the false positive rates of proper testing vs early stopping

## How it works

This tool simulates A/B tests where both variants have identical performance (a true "null hypothesis" scenario):

1. Each simulation generates two identical variants with:
- 250 visitors per day per variant
- 10% conversion rate for both variants
- No actual difference between variants

2. The tool runs 625 simulated tests with two different stopping criteria:
- **Left grid**: Tests that only check for significance once, after the full duration
- **Right grid**: Tests that check at your specified frequency and stop when significance is found

3. Results are color-coded:
- 🟢: tests correctly showing no significant difference
- 🟥: tests incorrectly showing a significant difference (false positives)

Statistical significance is determined using a two-proportion Z-test, mathematically equivalent to a chi-squared test for this 2×2 case.

## Contributing

Please do! I'd appreciate bug reports, improvements (however minor), suggestions…

The simulator uses vanilla JavaScript, HTML, and CSS. To run locally:

1. Clone the repository: `git clone https://github.com/welpo/stop-early-stopping.git`
2. Navigate to the project directory: `cd stop-early-stopping`
3. Start a local server: `python3 -m http.server`
4. Visit `http://localhost:8000` in your browser

The important files are:

- `index.html`: Basic structure
- `styles.css`: Styles
- `app.js`: Main UI logic
- `simulationWorker.js`: Web worker that runs simulations in background

## Need help?

Something not working? Have an idea? Let me know!

- Questions or ideas → [Start a discussion](https://github.com/welpo/stop-early-stopping/discussions)
- Found a bug? → [Report it here](https://github.com/welpo/stop-early-stopping/issues/new?&labels=bug)
- Feature request? → [Let me know](https://github.com/welpo/stop-early-stopping/issues/new?&labels=feature)

## License

This simulator is free software: you can redistribute it and/or modify it under the terms of the [GNU Affero General Public License as published by the Free Software Foundation](./COPYING), either version 3 of the License, or (at your option) any later version.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/welpo/stop-early-stopping

Awesome Lists containing this project

README

Why you shouldn't peek at significance in A/B tests