{"id":48932448,"url":"https://github.com/sdaza/experiment-utils-pd","last_synced_at":"2026-05-03T16:03:54.047Z","repository":{"id":290971626,"uuid":"975987924","full_name":"sdaza/experiment-utils-pd","owner":"sdaza","description":"Generic functions for experiment analysis and design","archived":false,"fork":false,"pushed_at":"2026-04-24T14:52:10.000Z","size":2855,"stargazers_count":3,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-04-24T16:45:29.362Z","etag":null,"topics":["analytics","experiment","experimentation","ipw"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/sdaza.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-05-01T09:15:41.000Z","updated_at":"2026-04-24T14:52:01.000Z","dependencies_parsed_at":"2025-10-23T11:22:10.322Z","dependency_job_id":"452bc569-95a2-4d18-a223-8c51ab5895b6","html_url":"https://github.com/sdaza/experiment-utils-pd","commit_stats":null,"previous_names":["sdaza/experiment-utils-pd"],"tags_count":34,"template":false,"template_full_name":null,"purl":"pkg:github/sdaza/experiment-utils-pd","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdaza%2Fexperiment-utils-pd","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdaza%2Fexperiment-utils-pd/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdaza%2Fexperiment-utils-pd/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdaza%2Fexperiment-utils-pd/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/sdaza","download_url":"https://codeload.github.com/sdaza/experiment-utils-pd/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/sdaza%2Fexperiment-utils-pd/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32575121,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-03T06:36:36.687Z","status":"ssl_error","status_checked_at":"2026-05-03T06:36:09.306Z","response_time":103,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["analytics","experiment","experimentation","ipw"],"created_at":"2026-04-17T10:00:26.496Z","updated_at":"2026-05-03T16:03:54.038Z","avatar_url":"https://github.com/sdaza.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"[![ci](https://github.com/sdaza/experiment-utils-pd/actions/workflows/ci.yaml/badge.svg)](https://github.com/sdaza/experiment-utils-pd/actions/workflows/ci.yaml)\n[![PyPI version](https://img.shields.io/pypi/v/experiment-utils-pd.svg)](https://pypi.org/project/experiment-utils-pd/)\n\n# Experiment Utils\n\nA comprehensive Python package for designing, analyzing, and validating experiments with advanced causal inference capabilities.\n\n## Features\n\n- **Experiment Analysis**: Estimate treatment effects with multiple adjustment methods (covariate balancing, regression, IV, AIPW)\n- **Multiple Outcome Models**: OLS, logistic, Poisson, negative binomial, and Cox proportional hazards\n- **Doubly Robust Estimation**: Augmented IPW (AIPW) for OLS, logistic, Poisson, and negative binomial models\n- **Survival Analysis**: Cox proportional hazards with IPW and regression adjustment\n- **Covariate Balance**: Check and visualize balance between treatment groups\n- **Marginal Effects**: Average marginal effects for GLMs (probability change, count change)\n- **Overlap Weighting \u0026 Trimming**: Overlap weights (ATO) and propensity score trimming for robust handling of limited common support\n- **Meta-Analysis**: Fixed-effects (IVW) and random-effects (Paule-Mandel + HKSJ) pooling across experiments, with heterogeneity diagnostics (τ², I², Cochran's Q)\n- **Bootstrap Inference**: Robust confidence intervals and p-values via bootstrap resampling\n- **Multiple Comparison Correction**: Family-wise error rate control (Bonferroni, Holm, Sidak, FDR)\n- **Effect Visualization**: Cleveland dot plots of treatment effects across experiments, with auto-scaled percentage-point annotations, combined absolute/relative labels, fixed or random-effects pooling, magnitude sorting, and grouping by any experiment column\n- **Overlap Diagnostics**: Mirror density plots of propensity score distributions (`plot_overlap`) with overlap coefficient annotation and group-by splitting\n- **Equivalence Testing (TOST)**: Two One-Sided Tests for equivalence, non-inferiority, and non-superiority following Lakens (2017), with absolute, relative, and Cohen's d bounds, Lakens' four-cell conclusion matrix, and dedicated visualization\n- **Power Analysis**: Calculate statistical power and find optimal sample sizes, including TOST equivalence power\n- **Retrodesign Analysis**: Assess reliability of study designs (Type S/M errors)\n- **Random Assignment**: Generate balanced treatment assignments with stratification\n\n## Table of Contents\n\n- [Experiment Utils](#experiment-utils)\n  - [Features](#features)\n  - [Table of Contents](#table-of-contents)\n  - [Installation](#installation)\n    - [From PyPI (Recommended)](#from-pypi-recommended)\n    - [From GitHub (Latest Development Version)](#from-github-latest-development-version)\n  - [Quick Start](#quick-start)\n  - [User Guide](#user-guide)\n    - [Basic Experiment Analysis](#basic-experiment-analysis)\n    - [Covariate Parameters](#covariate-parameters)\n    - [Checking Covariate Balance](#checking-covariate-balance)\n    - [Covariate Adjustment Methods](#covariate-adjustment-methods)\n    - [Outcome Models](#outcome-models)\n    - [Ratio Metrics (Delta Method)](#ratio-metrics-delta-method)\n    - [Survival Analysis (Cox Models)](#survival-analysis-cox-models)\n    - [Bootstrap Inference](#bootstrap-inference)\n    - [Multiple Experiments](#multiple-experiments)\n    - [Categorical Treatment Variables](#categorical-treatment-variables)\n    - [Instrumental Variables (IV)](#instrumental-variables-iv)\n    - [Multiple Comparison Adjustments](#multiple-comparison-adjustments)\n    - [Equivalence Testing (TOST)](#equivalence-testing-tost)\n    - [Combining Effects (Meta-Analysis)](#combining-effects-meta-analysis)\n    - [Visualizing Effects](#visualizing-effects)\n    - [Common Support / Propensity Score Overlap](#common-support--propensity-score-overlap)\n    - [Retrodesign Analysis](#retrodesign-analysis)\n  - [Power Analysis](#power-analysis)\n    - [Calculate Power](#calculate-power)\n    - [Power from Real Data](#power-from-real-data)\n    - [Grid Power Simulation](#grid-power-simulation)\n    - [Find Sample Size](#find-sample-size)\n    - [TOST Equivalence Power](#tost-equivalence-power)\n    - [Simulate Retrodesign](#simulate-retrodesign)\n  - [Utilities](#utilities)\n    - [Balanced Random Assignment](#balanced-random-assignment)\n    - [Standalone Balance Checker](#standalone-balance-checker)\n  - [Advanced Topics](#advanced-topics)\n    - [When to Use Different Adjustment Methods](#when-to-use-different-adjustment-methods)\n    - [Non-Collapsibility of Hazard and Odds Ratios](#non-collapsibility-of-hazard-and-odds-ratios)\n    - [Handling Missing Data](#handling-missing-data)\n    - [Best Practices](#best-practices)\n    - [Common Workflows](#common-workflows)\n  - [Contributing](#contributing)\n  - [License](#license)\n  - [Citation](#citation)\n\n## Installation\n\n### From PyPI (Recommended)\n\n```bash\npip install experiment-utils-pd\n```\n\n### From GitHub (Latest Development Version)\n\n```bash\npip install git+https://github.com/sdaza/experiment-utils-pd.git\n```\n\n## Quick Start\n\nAll main classes and standalone functions are available directly from the package:\n\n```python\nfrom experiment_utils import ExperimentAnalyzer, PowerSim\nfrom experiment_utils import balanced_random_assignment, check_covariate_balance\nfrom experiment_utils import plot_effects, plot_equivalence, plot_overlap, plot_power\n```\n\nHere's a complete example analyzing an A/B test with covariate adjustment:\n\n```python\nimport pandas as pd\nimport numpy as np\nfrom experiment_utils import ExperimentAnalyzer\n\n# Create sample experiment data\nnp.random.seed(42)\ndf = pd.DataFrame({\n    \"user_id\": range(1000),\n    \"treatment\": np.random.choice([0, 1], 1000),\n    \"conversion\": np.random.binomial(1, 0.15, 1000),\n    \"revenue\": np.random.normal(50, 20, 1000),\n    \"age\": np.random.normal(35, 10, 1000),\n    \"is_member\": np.random.choice([0, 1], 1000),\n})\n\n# Initialize analyzer\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\", \"revenue\"],\n    balance_covariates=[\"age\", \"is_member\"],  # balance checking\n    adjustment=\"balance\",\n    balance_method=\"ps-logistic\",\n)\n\n# Estimate treatment effects\nanalyzer.get_effects()\n\n# View results\nresults = analyzer.results\nprint(results[[\"outcome\", \"absolute_effect\", \"relative_effect\", \n               \"pvalue\", \"stat_significance\"]])\n\n# Balance is automatically calculated when covariates are provided\nbalance = analyzer.balance\nprint(f\"\\nBalance: {balance['balance_flag'].mean():.1%} of covariates balanced\")\n```\n\nOutput:\n```\n       outcome  absolute_effect  relative_effect   pvalue stat_significance\n0   conversion           0.0234           0.1623   0.0456                 1\n1      revenue           2.1450           0.0429   0.1234                 0\n\nBalance: 100.0% of covariates balanced\n```\n\n## User Guide\n\n### Basic Experiment Analysis\n\nAnalyze a simple A/B test without covariate adjustment:\n\n```python\nfrom experiment_utils import ExperimentAnalyzer\n\n# Simple analysis (no covariates)\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n)\n\nanalyzer.get_effects()\nprint(analyzer.results)\n```\n\n**Key columns in results:**\n- `outcome`: Outcome variable name\n- `absolute_effect`: Treatment effect (treatment - control mean)\n- `relative_effect`: Lift (absolute_effect / control_mean)\n- `standard_error`: Standard error of the effect\n- `pvalue`: P-value for hypothesis test\n- `stat_significance`: 1 if significant at alpha level, 0 otherwise\n- `abs_effect_lower/upper`: Confidence interval bounds (absolute)\n- `rel_effect_lower/upper`: Confidence interval bounds (relative)\n\n### Covariate Parameters\n\nThree covariate parameters control balance checking and regression adjustment. Each can be specified independently and they can overlap freely — any covariate appearing in any list is automatically included in the balance table.\n\n| Parameter | Role | Balance checked? | In regression formula? |\n|---|---|---|---|\n| `balance_covariates` | Balance checking only | Yes | No |\n| `regression_covariates` | Regression main effects | Yes | Yes (main effects) |\n| `interaction_covariates` | CUPED / Lin interactions | Yes | Yes (`z_col + treatment:z_col`) |\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"revenue\"],\n    balance_covariates=[\"region\"],           # balance table only\n    regression_covariates=[\"age\", \"tenure\"], # OLS main effects + balance\n    interaction_covariates=[\"pre_revenue\"],  # CUPED variance reduction + balance\n)\n\nanalyzer.get_effects()\n\n# Balance table covers all three lists\nprint(analyzer.balance[[\"covariate\", \"smd\", \"balance_flag\"]])\n```\n\n\u003e `covariates` is still accepted as a deprecated alias for `balance_covariates`.\n\n### Checking Covariate Balance\n\n**Balance is automatically calculated** when you provide any covariates and run `get_effects()`:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n    balance_covariates=[\"age\", \"income\", \"region\"],  # Can include categorical\n)\n\nanalyzer.get_effects()\n\n# Balance is automatically available\nbalance = analyzer.balance\nprint(balance[[\"covariate\", \"smd\", \"balance_flag\"]])\nprint(f\"\\nBalanced: {balance['balance_flag'].mean():.1%}\")\n\n# Identify imbalanced covariates\nimbalanced = balance[balance[\"balance_flag\"] == 0]\nif not imbalanced.empty:\n    print(f\"Imbalanced: {imbalanced['covariate'].tolist()}\")\n```\n\n**Check balance independently** (optional, before running `get_effects()` or with custom parameters):\n\n```python\n# Check balance with different threshold\nbalance_strict = analyzer.check_balance(threshold=0.05)\n```\n\n**Balance metrics explained:**\n- `smd`: Standardized Mean Difference (|SMD| \u003c 0.1 indicates good balance)\n- `balance_flag`: 1 if balanced, 0 if imbalanced\n- `mean_treated/control`: Group means for the covariate\n\n### Covariate Adjustment Methods\n\nWhen treatment and control groups differ on covariates, adjust for bias:\n\n**Option 1: Propensity Score Weighting (Recommended)**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\", \"revenue\"],\n    balance_covariates=[\"age\", \"income\", \"is_member\"],\n    adjustment=\"balance\",\n    balance_method=\"ps-logistic\",  # Logistic regression for propensity scores\n    estimand=\"ATT\",  # Average Treatment Effect on Treated\n)\n\nanalyzer.get_effects()\n\n# Check post-adjustment balance\nprint(analyzer.adjusted_balance)\n\n# Retrieve weights for transparency\nweights_df = analyzer.weights\nprint(weights_df.head())\n```\n\n**Available methods:**\n- `ps-logistic`: Propensity score via logistic regression (fast, interpretable)\n- `ps-xgboost`: Propensity score via XGBoost (flexible, non-linear)\n- `entropy`: Entropy balancing (exact moment matching)\n\n**Target estimands:**\n\n- `ATT`: Average Treatment Effect on Treated (most common)\n- `ATE`: Average Treatment Effect (entire population)\n- `ATC`: Average Treatment Effect on Control\n- `ATO`: Average Treatment Effect for the Overlap population (overlap weights — see below)\n\n**Option 2: Regression Adjustment**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n    regression_covariates=[\"age\", \"income\"],\n    adjustment=None,  # No weighting, just regression\n)\n\nanalyzer.get_effects()\n```\n\n**Option 3: CUPED / Interaction Adjustment**\n\nAdd pre-experiment metrics as treatment interactions (Lin 2013 estimator). Each covariate is standardized and entered as `z_col + treatment:z_col`. This reduces variance without changing the point estimate interpretation:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"revenue\"],\n    interaction_covariates=[\"pre_revenue\", \"pre_orders\"],\n)\n\nanalyzer.get_effects()\n# adjustment column in results will show \"regression+interactions\"\n```\n\n**Option 4: IPW + Regression (Combined)**\n\nUse both propensity score weighting and regression covariates for extra robustness:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\", \"revenue\"],\n    balance_covariates=[\"age\", \"income\", \"is_member\"],\n    adjustment=\"balance\",\n    regression_covariates=[\"age\", \"income\"],\n    estimand=\"ATE\",\n)\n\nanalyzer.get_effects()\n```\n\n**Option 5: Doubly Robust / AIPW**\n\nAugmented Inverse Probability Weighting is consistent if either the propensity score model or the outcome model is correctly specified. Available for OLS, logistic, Poisson, and negative binomial models:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"revenue\"],\n    balance_covariates=[\"age\", \"income\", \"is_member\"],\n    adjustment=\"aipw\",\n    estimand=\"ATE\",\n)\n\nanalyzer.get_effects()\n\n# AIPW results include influence-function based standard errors\nprint(analyzer.results[[\"outcome\", \"absolute_effect\", \"standard_error\", \"pvalue\"]])\n```\n\nAIPW works by fitting separate outcome models for treated and control groups, predicting potential outcomes for all units, and combining them with IPW via the augmented influence function. Standard errors are derived from the influence function, making them robust without requiring bootstrap.\n\n\u003e **Note**: AIPW is not supported for Cox survival models due to the complexity of survival-specific doubly robust methods. For Cox models, use IPW + Regression instead.\n\n**Option 6: Overlap Weighting (ATO)**\n\nOverlap weights (Li, Morgan \u0026 Zaslavsky 2018) naturally downweight units with extreme propensity scores — treated units receive weight `(1 - ps)` and control units receive weight `ps`. Units near `ps = 0.5` (the region of maximum overlap) receive the highest weight. No trimming threshold is required.\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"revenue\"],\n    balance_covariates=[\"age\", \"income\"],\n    adjustment=\"balance\",\n    balance_method=\"ps-logistic\",  # or \"ps-xgboost\"\n    estimand=\"ATO\",                # overlap weights\n)\n\nanalyzer.get_effects()\n```\n\n\u003e **Note**: ATO is only supported with `balance_method=\"ps-logistic\"` or `\"ps-xgboost\"`. It is not compatible with `\"entropy\"`.\n\n**Option 7: Propensity Score Trimming**\n\nTrimming drops units with propensity scores outside `[trim_ps_lower, trim_ps_upper]` and recomputes weights on the remaining sample. This is useful as a robustness check when overlap is already reasonable but you want to restrict to the region where PS estimation is reliable.\n\n```python\n# Always trim to [0.1, 0.9]\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"revenue\"],\n    balance_covariates=[\"age\", \"income\"],\n    adjustment=\"balance\",\n    trim_ps=True,\n    trim_ps_lower=0.1,  # default\n    trim_ps_upper=0.9,  # default\n)\n\n# Trim only when overlap is good (overlap coefficient \u003e= threshold)\nanalyzer = ExperimentAnalyzer(\n    ...\n    trim_ps=True,\n    trim_overlap_threshold=0.8,  # skip trimming if overlap \u003c 0.8\n    assess_overlap=True,\n)\n\nanalyzer.get_effects()\n\n# trimmed_units column shows how many units were dropped\nprint(analyzer.results[[\"outcome\", \"absolute_effect\", \"trimmed_units\"]])\n```\n\n**Choosing between overlap weights and trimming:**\n\n| | Overlap weights (`ATO`) | Trimming |\n|---|---|---|\n| Mechanism | Continuously downweights extreme-PS units | Drops units outside threshold |\n| Threshold required | No | Yes (`trim_ps_lower`, `trim_ps_upper`) |\n| Changes `n` | No | Yes |\n| Estimand | ATO (overlap population) | ATT/ATE/ATC on trimmed sample |\n| When overlap is poor | Handles gracefully | May drop many units |\n| Use as robustness check | Yes | Yes |\n\n### Outcome Models\n\nBy default, all outcomes are analyzed with OLS. Use `outcome_models` to specify different model types:\n\n**Logistic regression (binary outcomes)**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"converted\", \"churned\"],\n    outcome_models=\"logistic\",  # Apply to all outcomes\n    balance_covariates=[\"age\", \"tenure\"],\n)\n\nanalyzer.get_effects()\n\n# By default, results report marginal effects (probability change in percentage points)\n# Use compute_marginal_effects=False for odds ratios instead\n```\n\n**Poisson / Negative binomial (count outcomes)**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"orders\", \"page_views\"],\n    outcome_models=\"poisson\",  # or \"negative_binomial\" for overdispersed counts\n    balance_covariates=[\"age\", \"tenure\"],\n)\n\nanalyzer.get_effects()\n\n# Results report change in expected count (marginal effects) by default\n# Use compute_marginal_effects=False for rate ratios\n```\n\n**Mixed models per outcome**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"revenue\", \"converted\", \"orders\"],\n    outcome_models={\n        \"revenue\": \"ols\",\n        \"converted\": \"logistic\",\n        \"orders\": [\"poisson\", \"negative_binomial\"],  # Compare both\n    },\n    balance_covariates=[\"age\"],\n)\n\nanalyzer.get_effects()\n\n# Results include model_type column to distinguish\nprint(analyzer.results[[\"outcome\", \"model_type\", \"absolute_effect\", \"pvalue\"]])\n```\n\n**Marginal effects options**\n\n```python\n# Average Marginal Effect (default) - recommended\nanalyzer = ExperimentAnalyzer(..., compute_marginal_effects=\"overall\")\n\n# Marginal Effect at the Mean\nanalyzer = ExperimentAnalyzer(..., compute_marginal_effects=\"mean\")\n\n# Odds ratios / rate ratios instead of marginal effects\nanalyzer = ExperimentAnalyzer(..., compute_marginal_effects=False)\n```\n\n| `compute_marginal_effects` | Logistic output | Poisson/NB output |\n|---|---|---|\n| `\"overall\"` (default) | Probability change (pp) | Change in expected count |\n| `\"mean\"` | Probability change at mean | Count change at mean |\n| `False` | Odds ratio | Rate ratio |\n\n### Ratio Metrics (Delta Method)\n\nUse `ratio_outcomes` for metrics where both the numerator and denominator include randomness — for example, *leads per converter* or *revenue per session*. Conditioning on the denominator (e.g., analysing only converters) introduces selection bias, so the correct approach is the **delta method linearization** (Deng et al. 2018):\n\n```\nlinearized_i = numerator_i  −  R_control × denominator_i\nwhere  R_control = mean(numerator_control) / mean(denominator_control)\n```\n\nOLS on `linearized_i` estimates the difference in population-average ratios with correct standard errors. `R_control` is computed separately for each `(treatment, control)` comparison pair, so multi-arm experiments work out of the box.\n\n**Basic usage**\n\n```python\nimport numpy as np\nimport pandas as pd\nfrom experiment_utils import ExperimentAnalyzer\n\nnp.random.seed(42)\nn = 20_000\ntreatment = np.random.choice([\"control\", \"variant_1\", \"variant_2\"], n)\n\n# ~30% of users convert; converters generate ~2 leads on average\nconverters = np.where(\n    treatment == \"variant_2\", np.random.binomial(1, 0.32, n),\n    np.where(treatment == \"variant_1\", np.random.binomial(1, 0.31, n),\n                                       np.random.binomial(1, 0.30, n)),\n)\nleads = np.where(converters == 1, np.random.poisson(2 + 0.1 * (treatment == \"variant_2\"), n), 0)\n\ndf = pd.DataFrame({\"treatment\": treatment, \"converters\": converters, \"leads\": leads})\n\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"converters\", \"leads\"],           # regular outcomes\n    ratio_outcomes={\"leads_per_converter\": (\"leads\", \"converters\")},\n)\n\nanalyzer.get_effects()\n\ncols = [\"outcome\", \"treatment_group\", \"control_group\",\n        \"control_value\", \"absolute_effect\", \"standard_error\",\n        \"stat_significance\", \"effect_type\"]\nprint(analyzer.results[cols].to_string())\n```\n\nOutput:\n```\n               outcome treatment_group control_group  control_value  absolute_effect  standard_error  stat_significance      effect_type\n0           converters       variant_1       control       0.301              0.010        0.006                  1   mean_difference\n1                leads       variant_1       control       0.602              0.046        0.017                  1   mean_difference\n2  leads_per_converter       variant_1       control       1.977              0.022        0.011                  1  ratio_difference\n3           converters       variant_2       control       0.301              0.019        0.006                  1   mean_difference\n4                leads       variant_2       control       0.602              0.076        0.017                  1   mean_difference\n5  leads_per_converter       variant_2       control       1.977              0.037        0.011                  1  ratio_difference\n6           converters       variant_2     variant_1       0.311              0.009        0.007                  0   mean_difference\n7                leads       variant_2     variant_1       0.647              0.030        0.017                  0   mean_difference\n8  leads_per_converter       variant_2     variant_1       2.049              0.014        0.012                  0  ratio_difference\n```\n\nThe `control_value` column shows `R_control` (the control arm's ratio), and `absolute_effect` is the estimated difference in ratios. Results integrate normally with `plot_effects`, `calculate_retrodesign`, and MCP correction.\n\n**With bootstrap**\n\nBootstrap correctly re-estimates `R_control` on each resample, so standard errors fully capture the uncertainty in the ratio baseline:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"leads\"],\n    ratio_outcomes={\"leads_per_converter\": (\"leads\", \"converters\")},\n    bootstrap=True,\n    bootstrap_iterations=1000,\n    bootstrap_seed=42,\n)\n\nanalyzer.get_effects()\nprint(analyzer.results[[\"outcome\", \"absolute_effect\", \"standard_error\",\n                         \"abs_effect_lower\", \"abs_effect_upper\"]])\n```\n\n\u003e **Why not just subset to converters?** Analysing only users who converted conditions on a post-randomisation variable, creating selection bias. The delta method preserves the full randomised sample and gives an unbiased estimate of the causal effect on the population-average ratio.\n\n**Key result columns for ratio outcomes**\n\n| Column | Meaning |\n|---|---|\n| `control_value` | `R_control = mean(num_control) / mean(den_control)` for this comparison |\n| `absolute_effect` | Estimated difference in population-average ratios |\n| `relative_effect` | `absolute_effect / control_value` |\n| `effect_type` | `\"ratio_difference\"` |\n\n### Survival Analysis (Cox Models)\n\nAnalyze time-to-event outcomes using Cox proportional hazards:\n\n```python\nfrom experiment_utils import ExperimentAnalyzer\n\n# Specify Cox outcomes as tuples: (time_col, event_col)\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[(\"time_to_event\", \"event_occurred\")],\n    outcome_models=\"cox\",\n    balance_covariates=[\"age\", \"income\"],\n)\n\nanalyzer.get_effects()\n\n# Results report log(HR) as absolute_effect and HR as relative_effect\nprint(analyzer.results[[\"outcome\", \"absolute_effect\", \"relative_effect\", \"pvalue\"]])\n```\n\n**Cox with regression adjustment**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[(\"survival_time\", \"died\")],\n    outcome_models=\"cox\",\n    regression_covariates=[\"age\", \"comorbidity_score\"],\n)\n\nanalyzer.get_effects()\n```\n\n**Cox with IPW + Regression (recommended for confounded data)**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[(\"survival_time\", \"died\")],\n    outcome_models=\"cox\",\n    balance_covariates=[\"age\", \"comorbidity_score\"],\n    adjustment=\"balance\",\n    regression_covariates=[\"age\", \"comorbidity_score\"],\n    estimand=\"ATE\",\n)\n\nanalyzer.get_effects()\n```\n\n\u003e **Note**: IPW alone for Cox models estimates the marginal hazard ratio, which differs from the conditional HR due to non-collapsibility. The package will warn you if you use IPW without regression covariates. See [Non-Collapsibility](#non-collapsibility-of-hazard-and-odds-ratios) for details.\n\n**Alternative: separate event_col parameter**\n\n```python\n# Equivalent to tuple notation\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"survival_time\"],\n    outcome_models=\"cox\",\n    event_col=\"died\",  # Applies to all outcomes\n)\n```\n\n**Bootstrap for survival models**\n\nBootstrap can be slow for Cox models with low event rates. Use `skip_bootstrap_for_survival` to fall back to robust standard errors:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[(\"survival_time\", \"died\")],\n    outcome_models=\"cox\",\n    bootstrap=True,\n    skip_bootstrap_for_survival=True,  # Use Cox robust SEs instead\n)\n```\n\n### Bootstrap Inference\n\nGet robust confidence intervals and p-values via bootstrapping:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n    balance_covariates=[\"age\", \"income\"],\n    adjustment=\"balance\",\n    bootstrap=True,\n    bootstrap_iterations=2000,\n    bootstrap_ci_method=\"percentile\",\n    bootstrap_seed=42,  # For reproducibility\n)\n\nanalyzer.get_effects()\n\n# Bootstrap results include robust CIs\nresults = analyzer.results\nprint(results[[\"outcome\", \"absolute_effect\", \"abs_effect_lower\", \n               \"abs_effect_upper\", \"inference_method\"]])\n```\n\n**When to use bootstrap:**\n- Small sample sizes\n- Non-normal distributions\n- Skepticism about asymptotic assumptions\n- Want robust, distribution-free inference\n\n**Effect probabilities and ROPE (bootstrap only):**\n\nRead off decision-ready probabilities directly from the bootstrap distribution:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n    bootstrap=True,\n    bootstrap_iterations=2000,\n    # P(effect \u003e threshold) — threshold is 0 by default\n    prob_threshold_abs=0.0,        # outcome units\n    prob_threshold_rel=0.02,       # fractional scale: 0.02 = 2% lift\n    # Region of Practical Equivalence (optional)\n    rope_abs=(-0.5, 0.5),          # outcome units\n    rope_rel=(-0.01, 0.01),        # +/- 1% relative\n)\nanalyzer.get_effects()\nr = analyzer.results.iloc[0]\n# P(absolute_effect \u003e 0), P(relative_effect \u003e 2%)\nr[\"prob_abs_effect_gt\"], r[\"prob_rel_effect_gt\"]\n# Three-way ROPE decision on the relative scale\nr[\"prob_rel_effect_below_rope\"], r[\"prob_rel_effect_in_rope\"], r[\"prob_rel_effect_above_rope\"]\n```\n\nEach scale is configured independently. Absolute params (`prob_threshold_abs`, `rope_abs`) use outcome units; relative params (`prob_threshold_rel`, `rope_rel`) use fractions of the control mean. Leaving `rope_abs` / `rope_rel` as `None` disables the ROPE columns for that scale.\n\n### Multiple Experiments\n\nAnalyze multiple experiments simultaneously:\n\n```python\n# Data with multiple experiments\ndf = pd.DataFrame({\n    \"experiment\": [\"exp_A\", \"exp_A\", \"exp_B\", \"exp_B\"] * 100,\n    \"treatment\": [0, 1, 0, 1] * 100,\n    \"outcome\": np.random.randn(400),\n    \"age\": np.random.normal(35, 10, 400),\n})\n\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"outcome\"],\n    experiment_identifier=\"experiment\",  # Group by experiment\n    balance_covariates=[\"age\"],\n)\n\nanalyzer.get_effects()\n\n# Results include experiment column\nresults = analyzer.results\nprint(results.groupby(\"experiment\")[[\"absolute_effect\", \"pvalue\"]].first())\n\n# Balance per experiment (automatically calculated)\nbalance = analyzer.balance\nprint(balance.groupby(\"experiment\")[\"balance_flag\"].mean())\n```\n\n### Categorical Treatment Variables\n\nCompare multiple treatment variants:\n\n```python\ndf = pd.DataFrame({\n    \"treatment\": np.random.choice([\"control\", \"variant_A\", \"variant_B\"], 1000),\n    \"outcome\": np.random.randn(1000),\n})\n\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"outcome\"],\n)\n\nanalyzer.get_effects()\n\n# Results show all pairwise comparisons\nresults = analyzer.results\nprint(results[[\"treatment_group\", \"control_group\", \"absolute_effect\", \"pvalue\"]])\n```\n\n### Instrumental Variables (IV)\n\nWhen treatment assignment is confounded (e.g., non-compliance in an experiment), use an instrument -- a variable that affects treatment receipt but only affects the outcome through treatment:\n\n```python\nimport numpy as np\nimport pandas as pd\nfrom experiment_utils import ExperimentAnalyzer\n\n# Simulate encouragement design with non-compliance\nnp.random.seed(42)\nn = 5000\nZ = np.random.binomial(1, 0.5, n)            # Random encouragement (instrument)\nU = np.random.normal(0, 1, n)                 # Unobserved confounder\nD = np.random.binomial(1, 1 / (1 + np.exp(-(-1 + 0.5 * U + 2.5 * Z))))  # Actual treatment (confounded)\nY = 2.0 * D + 1.0 * U + np.random.normal(0, 1, n)  # Outcome (true LATE = 2.0)\n\ndf = pd.DataFrame({\"encouragement\": Z, \"treatment\": D, \"outcome\": Y})\n\n# IV estimation using encouragement as instrument for treatment\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"outcome\"],\n    instrument_col=\"encouragement\",\n    adjustment=\"IV\",\n)\n\nanalyzer.get_effects()\nprint(analyzer.results[[\"outcome\", \"absolute_effect\", \"standard_error\", \"pvalue\"]])\n```\n\n**IV with covariates:**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"outcome\"],\n    instrument_col=\"encouragement\",\n    adjustment=\"IV\",\n    balance_covariates=[\"age\", \"region\"],  # Balance checked on instrument\n)\n\nanalyzer.get_effects()\n```\n\n**Key assumptions for valid IV estimation:**\n- **Relevance**: The instrument must be correlated with treatment (check first-stage F-statistic)\n- **Exclusion restriction**: The instrument affects the outcome *only* through treatment\n- **Independence**: The instrument is independent of unobserved confounders (holds by design in randomized encouragement)\n\n\u003e **Note**: IV estimation is only supported for OLS outcome models. For other model types (logistic, Cox, etc.), the analyzer will fall back to unadjusted estimation with a warning.\n\n### Multiple Comparison Adjustments\n\nControl family-wise error rate when testing multiple hypotheses:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\", \"revenue\", \"retention\", \"engagement\"],\n)\n\nanalyzer.get_effects()\n\n# Apply Bonferroni correction\nanalyzer.adjust_pvalues(method=\"bonferroni\")\n\nresults = analyzer.results\nprint(results[[\"outcome\", \"pvalue\", \"pvalue_mcp\", \"stat_significance_mcp\"]])\n```\n\n**Available methods:**\n- `bonferroni`: Most conservative, controls FWER\n- `holm`: Less conservative than Bonferroni, still controls FWER\n- `sidak`: Similar to Bonferroni, assumes independence\n- `fdr_bh`: Benjamini-Hochberg FDR control (less conservative)\n\n### Equivalence Testing (TOST)\n\nTest whether two groups are practically equivalent using the **Two One-Sided Tests (TOST)** procedure, following [Lakens (2017)](https://doi.org/10.1177/1948550617697177). The unified `test_equivalence()` method handles equivalence, non-inferiority, and non-superiority as related tests within the same framework.\n\n**Equivalence testing** asks: *\"Can we confidently say the effect is small enough to be negligible?\"* — the opposite of standard NHST which tests for a difference.\n\n```python\nanalyzer.get_effects()\n\n# TOST equivalence: effect must fall within ±1.0 units\nanalyzer.test_equivalence(absolute_bound=1.0)\n\n# Bound as a fraction of control value (10%)\nanalyzer.test_equivalence(relative_bound=0.10)\n\n# Bound in standardized units (Cohen's d = 0.3, OLS only)\nanalyzer.test_equivalence(cohens_d_bound=0.3)\n\nresults = analyzer.results\nprint(results[[\"outcome\", \"absolute_effect\", \"eq_pvalue\", \"eq_conclusion\", \"eq_cohens_d\"]])\n```\n\n**Non-inferiority and non-superiority** are one-sided special cases:\n\n```python\n# Non-inferiority: treatment must not be worse than control by more than 1 unit\nanalyzer.test_equivalence(\n    test_type=\"non_inferiority\",\n    absolute_bound=1.0,\n    direction=\"higher_is_better\",\n)\n\n# Non-superiority: treatment must not be better than control by more than 1 unit\nanalyzer.test_equivalence(\n    test_type=\"non_superiority\",\n    absolute_bound=1.0,\n    direction=\"higher_is_better\",\n)\n```\n\n**Conclusion logic** (Lakens' four-cell matrix) combines NHST and TOST results:\n\n| NHST significant | TOST significant | Conclusion |\n|---|---|---|\n| No | Yes | `equivalent` — no significant effect, confirmed within bounds |\n| No | No | `inconclusive` — can't reject zero or confirm equivalence |\n| Yes | Yes | `equivalent_with_difference` — statistically significant but practically trivial |\n| Yes | No | `not_equivalent` — significant effect outside equivalence bounds |\n\nAdded columns (all prefixed with `eq_`):\n- `eq_test_type` — \"equivalence\", \"non_inferiority\", or \"non_superiority\"\n- `eq_bound_lower`, `eq_bound_upper` — equivalence bounds in raw units\n- `eq_pvalue_lower`, `eq_pvalue_upper` — p-values for lower and upper one-sided tests\n- `eq_pvalue` — TOST: max of both p-values; NI/NS: the relevant one-sided p-value\n- `eq_ci_lower`, `eq_ci_upper` — 90% confidence interval (1 − 2α)\n- `eq_cohens_d` — observed effect in Cohen's d units\n- `eq_conclusion` — interpretive label from the four-cell matrix\n\n**Visualizing equivalence results:**\n\n```python\n# Standalone function\nfrom experiment_utils import plot_equivalence\nfig = plot_equivalence(data=analyzer.results)\n\n# Or as a class method\nfig = analyzer.plot_equivalence()\n```\n\n![Equivalence plot example](docs/assets/plot_equivalence_experiments.png)\n\nWhen the results contain multiple outcomes, each one gets its own stacked panel:\n\n![Equivalence plot — multiple outcomes](docs/assets/plot_equivalence_outcomes.png)\n\n### Combining Effects (Meta-Analysis)\n\nWhen you have multiple experiments or segments, pool results using fixed-effects or random-effects meta-analysis, or a simple weighted average.\n\n**Fixed-effects meta-analysis (inverse-variance weighting)**\n\nAssumes a single common true effect across all experiments. Pools estimates using inverse-variance weighting and produces a pooled effect with proper standard errors:\n\n```python\nfrom experiment_utils import ExperimentAnalyzer\n\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n    experiment_identifier=\"experiment\",\n    balance_covariates=[\"age\"],\n)\n\nanalyzer.get_effects()\n\n# Pool across experiments — fixed effects (default)\npooled = analyzer.combine_effects(grouping_cols=[\"outcome\"])\nprint(pooled[[\"outcome\", \"experiments\", \"absolute_effect\", \"standard_error\", \"pvalue\"]])\n```\n\n**Random-effects meta-analysis (Paule-Mandel + HKSJ)**\n\nWhen experiments may have genuinely different true effects (e.g., different markets, time periods, or populations), use random-effects. The Paule-Mandel τ² estimator quantifies between-experiment heterogeneity, and Hartung-Knapp-Sidik-Jonkman (HKSJ) confidence intervals are used for robustness, especially with few experiments:\n\n```python\n# Random-effects pooling\npooled_re = analyzer.combine_effects(grouping_cols=[\"outcome\"], method=\"random\")\nprint(pooled_re[[\"outcome\", \"experiments\", \"absolute_effect\", \"standard_error\", \"pvalue\"]])\n\n# Inspect heterogeneity diagnostics (τ², I², Cochran's Q, k)\nprint(analyzer.meta_stats_)\n```\n\nKey heterogeneity metrics stored in `analyzer.meta_stats_`:\n\n| Metric | Description |\n|--------|-------------|\n| `tau2` | Between-experiment variance (τ²); 0 means no heterogeneity |\n| `i2` | % of total variance due to heterogeneity (I²); \u003e50% = substantial |\n| `q` | Cochran's Q statistic |\n| `k` | Number of experiments pooled |\n\n**Custom grouping:**\n\n```python\n# Pool by outcome and region (e.g., combine experiments within each region)\npooled_by_region = analyzer.combine_effects(grouping_cols=[\"region\", \"outcome\"], method=\"random\")\nprint(pooled_by_region)\n```\n\n**Weighted average aggregation (`aggregate_effects`)**\n\nA simpler alternative that weights by treatment group size (useful for quick summaries, but `combine_effects` provides better standard error estimates):\n\n```python\naggregated = analyzer.aggregate_effects(grouping_cols=[\"outcome\"])\nprint(aggregated[[\"outcome\", \"experiments\", \"absolute_effect\", \"pvalue\"]])\n```\n\n**When to use fixed vs. random effects:**\n\n| Scenario | Recommended |\n|---|---|\n| Experiments are replications of the same study | Fixed effects |\n| Experiments span different markets, regions, or time periods | Random effects |\n| Small number of experiments (k \u003c 10) | Random effects with HKSJ CIs |\n| Exploring heterogeneity | Random effects (inspect `meta_stats_`) |\n\n### Visualizing Effects\n\n`plot_effects` produces a Cleveland dot plot with confidence intervals and optional meta-analysis pooling. It is available both as a **standalone function** and as a method on `ExperimentAnalyzer`.\n\n![plot_effects with pooled random-effects row and pp + relative combined labels](docs/assets/plot_effects_pooled.png)\n\n*Cleveland dot plot with per-experiment rows, a random-effects pooled row (diamond), and combined annotations. `pct_points=True` is applied automatically only to `converted` (a proportion — control ~8%), while `revenue` (dollar values ~$45) is left in raw units.*\n\nThe two axis roles are controlled by `y`:\n\n| `y` | Rows (y-axis) | Panels (subplots) |\n|---|---|---|\n| `\"experiment\"` *(default)* | Experiment labels | Outcomes |\n| `\"outcome\"` | Outcomes | Experiment labels |\n\n**Basic usage — multiple experiments, outcomes as panels (default)**\n\n```python\nanalyzer.get_effects()\n\n# show_values=True is the default — each dot is annotated with its effect value\nfig = analyzer.plot_effects(title=\"Treatment Effects\")\nplt.show()\n```\n\n**Percentage points (`pct_points=True`)**\n\nFor rate/proportion outcomes, display absolute effects as percentage points instead of raw decimals (e.g. `+3.0pp` instead of `+0.030`). The scaling is applied **per outcome** automatically — outcomes whose control value is outside [0, 1] (e.g. revenue in dollars) are left in their original units:\n\n```python\nfig = analyzer.plot_effects(\n    outcomes=\"converted\",\n    pct_points=True,\n    title=\"Conversion Rate (pp)\",\n)\nplt.show()\n```\n\n**Combined label — absolute (pp) + relative in one annotation**\n\nShow both metrics on a single panel with `combine_values=True`. The x-axis label updates automatically:\n\n```python\n# \"+3.0pp (+15.4%)\" on the absolute panel\nfig = analyzer.plot_effects(\n    outcomes=\"converted\",\n    effect=\"absolute\",\n    pct_points=True,\n    combine_values=True,\n    title=\"Conversion Rate\",\n)\nplt.show()\n\n# \"+15.4% (+3.0pp)\" on the relative panel\nfig = analyzer.plot_effects(\n    outcomes=\"converted\",\n    effect=\"relative\",\n    pct_points=True,\n    combine_values=True,\n    title=\"Conversion Rate\",\n)\nplt.show()\n```\n\nX-axis labels when `combine_values=True`:\n\n| `effect` | `pct_points` | x-axis label |\n|---|---|---|\n| `\"absolute\"` | `False` | `Absolute (Relative) Effect` |\n| `\"absolute\"` | `True` | `Absolute (Relative) Effect (pp)` |\n| `\"relative\"` | — | `Relative (Absolute) Effect` |\n\n**Side-by-side absolute (pp) and relative panels**\n\n```python\nfig = analyzer.plot_effects(\n    effect=[\"absolute\", \"relative\"],\n    pct_points=True,\n    title=\"Effects — Absolute \u0026 Relative\",\n)\nplt.show()\n```\n\n**Single experiment, multiple outcomes on the y-axis**\n\nWhen you have one experiment and several outcomes, flip the axes with `y=\"outcome\"`.\nSet `show_panel_titles=False` when the single experiment panel heading would be redundant:\n\n```python\nfig = analyzer.plot_effects(\n    y=\"outcome\",\n    title=\"My Experiment\",\n    show_panel_titles=False,\n)\nplt.show()\n```\n\nUse `panel_titles` when you do want to customise the panel heading:\n\n```python\nfig = analyzer.plot_effects(\n    y=\"outcome\",\n    title=\"My Experiment\",\n    panel_titles=\"Treatment vs Control\",   # single string → same for all panels\n)\nplt.show()\n```\n\n**Multiple experiments, outcomes on the y-axis**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"revenue\", \"converted\", \"orders\"],\n    experiment_identifier=[\"country\", \"type\"],\n)\nanalyzer.get_effects()\n\n# One panel per experiment group; rows = outcomes\nfig = analyzer.plot_effects(\n    y=\"outcome\",\n    panel_titles={\"US | email\": \"US — Email\", \"EU | push\": \"EU — Push\"},\n)\nplt.show()\n```\n\n**Standalone usage**\n\n```python\nfrom experiment_utils import plot_effects\n\nfig = plot_effects(\n    results=analyzer.results,\n    experiment_identifier=\"experiment\",\n    alpha=0.05,\n    title=\"Treatment Effects\",\n    save_path=\"effects.png\",   # optional; supports png, pdf, svg, ...\n)\nplt.show()\n```\n\n**Add a pooled meta-analysis row**\n\n```python\n# Auto-compute pooled estimate (IVW fixed effects, default)\nfig = analyzer.plot_effects(\n    outcomes=\"revenue\",\n    meta_analysis=True,\n    title=\"Revenue — with Pooled Estimate\",\n)\nplt.show()\n\n# Random-effects pooling (Paule-Mandel + HKSJ)\nfig = analyzer.plot_effects(\n    meta_analysis=True,\n    meta_method=\"random\",      # \"fixed\" (default) or \"random\"\n    title=\"Revenue — Random-Effects Pooled\",\n)\nplt.show()\n\n# Pass a pre-computed combine_effects() DataFrame\npooled = analyzer.combine_effects(grouping_cols=[\"outcome\"], method=\"random\")\nfig = analyzer.plot_effects(meta_analysis=pooled)\nplt.show()\n```\n\n**Side-by-side absolute (pp) and relative panels with random-effects pooling**\n\n```python\nfig = analyzer.plot_effects(\n    effect=[\"absolute\", \"relative\"],\n    pct_points=True,\n    meta_analysis=True,\n    meta_method=\"random\",\n    title=\"Effects — Absolute \u0026 Relative\",\n)\nplt.show()\n```\n\n![Absolute and relative side-by-side with random-effects pooled row](docs/assets/plot_effects_absolute_relative.png)\n\n*Side-by-side absolute (pp) and relative panels. The pooled diamond row uses random-effects meta-analysis.*\n\n**Split into one figure per group**\n\nWhen `experiment_identifier` contains multiple columns (e.g. `[\"country\", \"type\"]`), `group_by` produces one figure per unique value. Row labels are built from the remaining identifier columns automatically.\n\n```python\n# One figure per country; rows = type\nfigs = analyzer.plot_effects(group_by=\"country\", meta_analysis=True)\nfor fig in figs.values():\n    plt.figure(fig.number)\n    plt.show()\n\n# save_path inserts the group key before the extension:\n#   \"effects.png\" → \"effects_US.png\", \"effects_EU.png\", ...\nfigs = analyzer.plot_effects(group_by=\"country\", save_path=\"effects.png\")\n```\n\n`group_by` returns `dict[str, Figure]`; without it a single `Figure` is returned.\n\n**Multiple comparison adjustments**\n\nIf `adjust_pvalues()` has been called before plotting, the plot automatically uses the adjusted significance column (`stat_significance_mcp`) and updates the legend label accordingly:\n\n```python\nanalyzer.get_effects()\nanalyzer.adjust_pvalues(method=\"holm\")\n\n# Legend shows \"Significant (holm, α=0.05)\" and coloring uses adjusted p-values\nfig = analyzer.plot_effects()\nplt.show()\n```\n\n**Color by direction with a custom palette**\n\nUse `color_direction=True` to color effects by sign and significance. Override any of the default colors with `color_palette`; passing `color_palette` also enables direction coloring automatically:\n\n```python\nfig = analyzer.plot_effects(\n    color_direction=True,\n    color_palette={\n        \"sig_pos\": \"#047857\",\n        \"sig_neg\": \"#be123c\",\n        \"nsig_pos\": \"#86efac\",\n        \"nsig_neg\": \"#94a3b8\",\n        \"nsig_zero\": \"#64748b\",\n    },\n)\nplt.show()\n```\n\n**Key parameters**\n\n| Parameter | Default | Description |\n|---|---|---|\n| `y` | `\"experiment\"` | `\"experiment\"` — rows = experiments, panels = outcomes; `\"outcome\"` — rows = outcomes, panels = experiments |\n| `panel_titles` | `None` | Override subplot titles: `str` (all panels) or `dict` (per-panel) |\n| `show_panel_titles` | `True` | Show outcome/experiment subplot headings; set `False` to hide redundant panel titles |\n| `outcomes` | `None` | Outcome(s) to include; `None` = all |\n| `effect` | `\"absolute\"` | `\"absolute\"`, `\"relative\"`, or `[\"absolute\", \"relative\"]` for side-by-side |\n| `meta_analysis` | `None` | `True` (auto-compute pooled row from visible rows), `DataFrame` (pre-computed), or `None` |\n| `meta_method` | `\"fixed\"` | Meta-analysis method: `\"fixed\"` (IVW) or `\"random\"` (Paule-Mandel + HKSJ) |\n| `sort_by_magnitude` | `True` | Sort rows by `\\|effect\\|` descending |\n| `group_by` | `None` | Column(s) to split into separate figures |\n| `comparison` | `None` | `(treatment, control)` tuple or list of tuples to filter to specific comparisons |\n| `title` | `None` | Figure suptitle (group value used automatically when `group_by` is set) |\n| `show_zero_line` | `True` | Vertical reference line at zero |\n| `show_values` | `True` | Annotate each dot with its effect value (`*` when significant) |\n| `value_decimals` | auto | Decimal places for value labels. Defaults to `1` when `pct_points=True` or relative effect shown; `2` otherwise |\n| `pct_points` | `False` | When `True`, auto-detects proportion-scale outcomes (control value in [0, 1]) and scales their absolute effects ×100 for display as percentage points (pp). Raw-unit outcomes such as revenue are left unscaled. Axis tick labels and annotations are updated per panel. |\n| `combine_values` | `False` | Append the secondary effect in parentheses to each annotation: `+3.0pp (+15.4%)` or `+15.4% (+3.0pp)`. Also updates the x-axis label |\n| `color_direction` | `False` | Color effects by sign and significance |\n| `color_palette` | `None` | Override `sig_pos`, `sig_neg`, `nsig_pos`, `nsig_neg`, and/or `nsig_zero` colors. Passing a palette also enables `color_direction` |\n| `panel_spacing` | `None` | Horizontal whitespace between panels (`wspace`). Try `0.4`–`0.8` when panels overlap |\n| `repeat_ylabels` | `False` | Show y-axis tick labels on every panel, not only the leftmost |\n| `row_labels` | `None` | Rename individual y-axis row labels. `dict` mapping auto-generated labels to display strings, e.g. `{\"US \\| email\": \"Email (US)\"}` |\n| `save_path` | `None` | File path to save the figure. With `group_by`, the group key is inserted before the extension: `\"effects.png\"` → `\"effects_US.png\"`, etc. |\n| `figsize` | auto | `(width, height)` in inches |\n\n### Common Support / Propensity Score Overlap\n\n`plot_overlap` produces a mirror density plot of propensity scores for common-support diagnostics. It is available both as a **standalone function** and as a method on `ExperimentAnalyzer`. Requires `adjustment=\"balance\"` with a PS-based method (`ps-logistic` or `ps-xgboost`).\n\n**Standalone usage**\n\n```python\nfrom experiment_utils import plot_overlap\n\n# After get_effects(), propensity scores are stored in analyzer.weights\nfig = plot_overlap(\n    analyzer.weights,\n    treatment_col=\"treatment\",\n    propensity_col=\"propensity_score\",\n    title=\"Common Support\",\n)\nplt.show()\n```\n\n**Via `ExperimentAnalyzer` (recommended)**\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\", \"revenue\"],\n    balance_covariates=[\"age\", \"income\", \"region\"],\n    adjustment=\"balance\",\n    balance_method=\"ps-logistic\",\n)\n\nanalyzer.get_effects()\n\n# Mirror density plot from stored propensity scores\nfig = analyzer.plot_overlap(title=\"Common Support\")\nplt.show()\n```\n\n**Split by experiment with `group_by`**\n\n```python\n# One figure per experiment group\nfigs = analyzer.plot_overlap(group_by=\"region\")\nfor region, fig in figs.items():\n    plt.figure(fig.number)\n    plt.show()\n```\n\nThe mirror density plot shows treatment scores (facing up, blue) and control scores (facing down, red). The green band marks the shared overlap region; the annotation shows the KDE-based overlap coefficient.\n\n![Common support mirror density plot](docs/assets/common_support.png)\n\n*Mirror density plot of estimated propensity scores. Large overlap indicates good common support; thin or non-overlapping tails may warrant overlap weights (`estimand=\"ATO\"`) or trimming.*\n\n**Auto-plot during `get_effects()`**\n\nPass `overlap_plot=True` to render the mirror density plot automatically for each comparison during `get_effects()`, without needing to call `plot_overlap()` separately:\n\n```python\nanalyzer = ExperimentAnalyzer(\n    ...\n    assess_overlap=True,   # log the KDE-based overlap coefficient\n    overlap_plot=True,     # render mirror density plot automatically\n)\nanalyzer.get_effects()\n```\n\n**Key parameters of `plot_overlap`**\n\n| Parameter | Default | Description |\n|---|---|---|\n| `group_by` | `None` | Column(s) to split into separate figures |\n| `bw_method` | `None` | KDE bandwidth (Scott's rule when `None`) |\n| `show_overlap_region` | `True` | Shade the region where both densities exceed 5% of their peak |\n| `show_overlap_coef` | `True` | Annotate with the KDE overlap coefficient |\n| `title` | `None` | Figure title |\n| `figsize` | `(7, 4)` | Figure size in inches |\n| `save_path` | `None` | File path to save; group key inserted before extension with `group_by` |\n\n**Overlap coefficient**\n\nThe KDE-based overlap coefficient — the integral of `min(f_treat(x), f_control(x))` — is a single number between 0 (no overlap) and 1 (identical distributions). A value above 0.7 is generally considered acceptable.\n\n```python\ncoef = analyzer.get_overlap_coefficient(\n    treatment_scores=ps_treat,\n    control_scores=ps_control,\n)\nprint(f\"Overlap coefficient: {coef:.3f}\")\n```\n\n**When overlap is poor** (bimodal distributions, thin tails):\n\n- Switch to `estimand=\"ATO\"` (overlap weights) to automatically downweight extreme units\n- Or use `trim_ps=True` to drop units outside `[trim_ps_lower, trim_ps_upper]`\n- Set `trim_overlap_threshold` to skip trimming when overlap is already poor\n\n### Retrodesign Analysis\n\nAssess reliability of significant results (post-hoc power analysis):\n\n```python\nfrom experiment_utils import ExperimentAnalyzer\n\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n)\n\nanalyzer.get_effects()\n\n# Calculate Type S and Type M errors assuming true effect is 0.02\nretro = analyzer.calculate_retrodesign(true_effect=0.02)\n\nprint(retro[[\"outcome\", \"power\", \"type_s_error\", \"type_m_error\",\n             \"relative_bias\", \"trimmed_abs_effect\"]])\n```\n\n**Metrics explained:**\n- `power`: Probability of detecting the assumed true effect\n- `type_s_error`: Probability of wrong sign when significant (if underpowered)\n- `type_m_error`: Expected exaggeration ratio (mean |observed|/|true|)\n- `relative_bias`: Expected bias ratio preserving signs (mean observed/true); typically lower than `type_m_error` because wrong-sign estimates partially cancel overestimates\n- `trimmed_abs_effect`: Bias-corrected effect estimate (`absolute_effect / relative_bias`); deflates the observed effect by the sign-preserving exaggeration factor to approximate the true effect\n\n## Power Analysis\n\nDesign well-powered experiments using simulation-based power analysis.\n\n### Calculate Power\n\nEstimate statistical power for a given sample size:\n\n```python\nfrom experiment_utils import PowerSim\n\n# Initialize power simulator for proportion metric\npower_sim = PowerSim(\n    metric=\"proportion\",      # or \"average\" for continuous outcomes\n    relative_effect=False,    # False = absolute effect, True = relative\n    variants=1,               # Number of treatment variants\n    nsim=1000,               # Number of simulations\n    alpha=0.05,              # Significance level\n    alternative=\"two-tailed\" # or \"one-tailed\"\n)\n\n# Calculate power\npower_result = power_sim.get_power(\n    baseline=[0.10],          # Control conversion rate\n    effect=[0.02],           # Absolute effect size (2pp lift)\n    sample_size=[5000]       # Total sample size\n)\n\nprint(f\"Power: {power_result['power'].iloc[0]:.2%}\")\n```\n\n**Example: Multiple variants**\n\n```python\n# Compare 2 treatments vs control\npower_sim = PowerSim(metric=\"proportion\", variants=2, nsim=1000)\n\npower_result = power_sim.get_power(\n    baseline=0.10,\n    effect=[0.02, 0.03],  # Different effects for each variant\n    sample_size=6000\n)\n\nprint(power_result[[\"comparison\", \"power\"]])\n```\n\n### Power from Real Data\n\nWhen your data doesn't follow standard parametric assumptions, estimate power by bootstrapping directly from observed data using `get_power_from_data()`. Instead of generating synthetic data from a distribution, it repeatedly samples from your actual dataset and injects the specified effect:\n\n```python\nfrom experiment_utils import PowerSim\nimport pandas as pd\n\n# Use real data for power estimation\npower_sim = PowerSim(metric=\"average\", variants=1, nsim=1000)\n\npower_result = power_sim.get_power_from_data(\n    df=historical_data,          # Your actual dataset\n    metric_col=\"revenue\",        # Column to test\n    sample_size=5000,            # Sample size per group\n    effect=3.0,                  # Effect to inject (absolute)\n)\n\nprint(f\"Power: {power_result['power'].iloc[0]:.2%}\")\n```\n\n**When to use `get_power_from_data` vs `get_power`:**\n- Use `get_power_from_data` when your metric has a non-standard distribution (heavy tails, skewed, zero-inflated)\n- Use `get_power` for standard parametric scenarios (proportions, means, counts)\n\n**With compliance:**\n\n```python\n# Account for 80% compliance\npower_result = power_sim.get_power_from_data(\n    df=historical_data,\n    metric_col=\"revenue\",\n    sample_size=5000,\n    effect=3.0,\n    compliance=0.80,\n)\n```\n\n### Grid Power Simulation\n\nExplore power across a grid of parameter combinations using `grid_sim_power()`. This is useful for understanding how power varies with sample size, effect size, and baseline rates:\n\n```python\nfrom experiment_utils import PowerSim\n\npower_sim = PowerSim(metric=\"proportion\", variants=1, nsim=1000)\n\n# Simulate power across a grid of scenarios\ngrid_results = power_sim.grid_sim_power(\n    baseline_rates=[0.05, 0.10, 0.15],\n    effects=[0.02, 0.03, 0.05],\n    sample_sizes=[1000, 2000, 5000, 10000],\n    plot=True,  # Generate power curves\n)\n\nprint(grid_results.head())\n```\n\n**With multiple variants and custom compliance:**\n\n```python\npower_sim = PowerSim(metric=\"average\", variants=2, nsim=1000)\n\ngrid_results = power_sim.grid_sim_power(\n    baseline_rates=[50.0],\n    effects=[2.0, 5.0],\n    sample_sizes=[500, 1000, 2000, 5000],\n    standard_deviations=[[20.0]],\n    compliances=[[0.8]],\n    threads=4,        # Parallelize across scenarios\n    plot=True,\n)\n```\n\nThe output DataFrame includes all input parameters alongside the estimated power for each comparison, making it easy to filter and compare scenarios.\n\n### Find Sample Size\n\nFind the minimum sample size needed to achieve target power:\n\n```python\nfrom experiment_utils import PowerSim\n\npower_sim = PowerSim(metric=\"proportion\", variants=1, nsim=1000)\n\n# Find sample size for 80% power\nsample_result = power_sim.find_sample_size(\n    power=0.80,\n    baseline=0.10,\n    effect=0.02\n)\n\nprint(f\"Required sample size: {sample_result['total_sample_size'].iloc[0]:,.0f}\")\nprint(f\"Achieved power: {sample_result['achieved_power_by_comparison'].iloc[0]:.2%}\")\n```\n\n**Different power targets per comparison:**\n\n```python\n# Primary outcome needs 90%, secondary needs 80%\npower_sim = PowerSim(metric=\"proportion\", variants=2, nsim=1000)\n\nsample_result = power_sim.find_sample_size(\n    power={(0,1): 0.90, (0,2): 0.80},\n    baseline=0.10,\n    effect=[0.05, 0.03]\n)\n\nprint(sample_result[[\"comparison\", \"sample_size_by_group\", \"achieved_power\"]])\n```\n\n**Optimize allocation ratio:**\n\n```python\n# Find optimal allocation to minimize total sample size\nsample_result = power_sim.find_sample_size(\n    power=0.80,\n    baseline=0.10,\n    effect=0.05,\n    optimize_allocation=True\n)\n\nprint(f\"Optimal allocation: {sample_result['allocation_ratio'].iloc[0]}\")\nprint(f\"Total sample size: {sample_result['total_sample_size'].iloc[0]:,.0f}\")\n```\n\n**Custom allocation:**\n\n```python\n# 30% control, 70% treatment\nsample_result = power_sim.find_sample_size(\n    power=0.80,\n    baseline=0.10,\n    effect=0.02,\n    allocation_ratio=[0.3, 0.7]\n)\n```\n\n### TOST Equivalence Power\n\nEstimate the sample size needed to demonstrate equivalence using TOST. Equivalence tests require substantially larger samples than standard superiority tests — `power_tost()` uses simulation to estimate power for a given equivalence bound:\n\n```python\nfrom experiment_utils import PowerSim\n\npower_sim = PowerSim(metric=\"average\", nsim=500)\n\n# How much power do we have to demonstrate equivalence within ±1.0 units?\npower_results = power_sim.power_tost(\n    sample_sizes=[100, 200, 500, 1000],\n    equivalence_bound=1.0,   # absolute Δ\n    true_effect=0.0,          # assumed true difference (0 = truly equivalent)\n    pooled_sd=2.0,            # population SD\n    alpha=0.05,\n)\n\nprint(power_results)\n#  sample_size  power    se  nsim\n#          100   0.44  0.02   500\n#          200   0.78  0.02   500\n#          500   0.99  0.00   500\n#         1000   1.00  0.00   500\n```\n\nFor proportion metrics, pass `baseline` and use `equivalence_bound` as a probability difference:\n\n```python\npower_sim = PowerSim(metric=\"proportion\", nsim=500)\n\npower_results = power_sim.power_tost(\n    sample_sizes=[500, 1000, 2000],\n    equivalence_bound=0.05,   # ±5 percentage points\n    baseline=0.50,\n)\n```\n\n### Simulate Retrodesign\n\nProspective analysis of Type S (sign) and Type M (magnitude) errors:\n\n```python\nfrom experiment_utils import PowerSim\n\npower_sim = PowerSim(metric=\"proportion\", variants=1, nsim=5000)\n\n# Simulate underpowered study\nretro = power_sim.simulate_retrodesign(\n    true_effect=0.02,\n    sample_size=500,\n    baseline=0.10\n)\n\nprint(f\"Power: {retro['power'].iloc[0]:.2%}\")\nprint(f\"Type S Error: {retro['type_s_error'].iloc[0]:.2%}\")\nprint(f\"Exaggeration Ratio: {retro['exaggeration_ratio'].iloc[0]:.2f}x\")\nprint(f\"Relative Bias: {retro['relative_bias'].iloc[0]:.2f}x\")\n```\n\n**Understanding retrodesign metrics:**\n\n| Metric | Description |\n|--------|-------------|\n| `power` | Probability of detecting the true effect |\n| `type_s_error` | Probability of getting wrong sign when significant |\n| `exaggeration_ratio` | Expected overestimation (mean \u0026#124;observed\u0026#124;/\u0026#124;true\u0026#124;) |\n| `relative_bias` | Expected bias preserving signs (mean observed/true) \u003cbr\u003e Lower than exaggeration_ratio because Type S errors partially cancel out overestimates |\n| `median_significant_effect` | Median effect among significant results |\n| `prop_overestimate` | % of significant results that overestimate |\n\n**Compare power scenarios:**\n\n```python\n# Low power scenario\nretro_low = power_sim.simulate_retrodesign(\n    true_effect=0.02, sample_size=500, baseline=0.10\n)\n\n# High power scenario\nretro_high = power_sim.simulate_retrodesign(\n    true_effect=0.02, sample_size=5000, baseline=0.10\n)\n\nprint(f\"Low power - Exaggeration: {retro_low['exaggeration_ratio'].iloc[0]:.2f}x, \"\n      f\"Relative bias: {retro_low['relative_bias'].iloc[0]:.2f}x\")\nprint(f\"High power - Exaggeration: {retro_high['exaggeration_ratio'].iloc[0]:.2f}x, \"\n      f\"Relative bias: {retro_high['relative_bias'].iloc[0]:.2f}x\")\n```\n\n**Multiple variants:**\n\n```python\npower_sim = PowerSim(metric=\"proportion\", variants=3, nsim=5000)\n\nretro = power_sim.simulate_retrodesign(\n    true_effect=[0.02, 0.03, 0.04],  # Different effects per variant\n    sample_size=1000,\n    baseline=0.10,\n    comparisons=[(0, 1), (0, 2)]\n)\n\nprint(retro[[\"comparison\", \"power\", \"type_s_error\", \"exaggeration_ratio\", \"relative_bias\"]])\n```\n\n## Utilities\n\n### Balanced Random Assignment\n\nGenerate balanced treatment assignments with optional block randomization.\nVariant distribution and, when covariates are provided, a covariate balance\nsummary are always printed.\n\n```python\nfrom experiment_utils import balanced_random_assignment\nimport pandas as pd\nimport numpy as np\n\n# Create sample data\nnp.random.seed(42)\nusers = pd.DataFrame({\n    \"user_id\": range(1000),\n    \"age_group\": np.random.choice([\"18-25\", \"26-35\", \"36-45\", \"46+\"], 1000),\n    \"region\": np.random.choice([\"North\", \"South\", \"East\", \"West\"], 1000),\n    \"age\": np.random.normal(35, 10, 1000),\n})\n\n# Simple 50/50 split — prints variant distribution automatically\nusers[\"treatment\"] = balanced_random_assignment(\n    users,\n    allocation_ratio=0.5,\n    seed=42\n)\n```\n\n**Block randomization (stratify within subgroups):**\n\n```python\n# Stratify by age_group and region; check balance on the same variables\nusers[\"treatment_stratified\"] = balanced_random_assignment(\n    users,\n    allocation_ratio=0.5,\n    stratification_covariates=[\"age_group\", \"region\"],\n    seed=42\n)\n```\n\nWarns automatically if any stratification category has low prevalence (\u003c 5 % by\ndefault) and suggests not blocking on that variable.\n\n**Check balance on additional covariates:**\n\n```python\n# Stratify by region; check balance on a broader set\nusers[\"treatment_stratified\"] = balanced_random_assignment(\n    users,\n    allocation_ratio=0.5,\n    stratification_covariates=[\"region\"],\n    balance_covariates=[\"age_group\", \"region\", \"age\"],\n    seed=42\n)\n```\n\n**Multiple variants:**\n\n```python\n# Three variants with equal allocation\nusers[\"assignment\"] = balanced_random_assignment(\n    users,\n    variants=[\"control\", \"variant_A\", \"variant_B\"]\n)\n\n# Custom allocation ratios with stratification\nusers[\"assignment_custom\"] = balanced_random_assignment(\n    users,\n    variants=[\"control\", \"variant_A\", \"variant_B\"],\n    allocation_ratio={\"control\": 0.5, \"variant_A\": 0.3, \"variant_B\": 0.2},\n    stratification_covariates=[\"age_group\"]\n)\n```\n\n**Key parameters:**\n- `allocation_ratio`: Float (binary) or dict (multiple variants)\n- `stratification_covariates`: Columns to block-randomize on (continuous vars are auto-binned)\n- `balance_covariates`: Columns to check balance for after assignment (defaults to `stratification_covariates`)\n- `smd_threshold`: SMD threshold for balance flag (default `0.1`)\n- `min_stratum_pct`: Minimum category prevalence before a stratification warning is raised (default `0.05`)\n- `min_stratum_n`: Minimum absolute category count before a stratification warning is raised (default `10`)\n- `seed`: Random seed for reproducibility\n\n### Standalone Balance Checker\n\nCheck covariate balance on any dataset without using ExperimentAnalyzer:\n\n```python\nfrom experiment_utils import check_covariate_balance\nimport pandas as pd\nimport numpy as np\n\n# Create sample data with imbalance\nnp.random.seed(42)\nn_treatment = 300\nn_control = 200\n\ndf = pd.concat([\n    pd.DataFrame({\n        \"treatment\": [1] * n_treatment,\n        \"age\": np.random.normal(40, 10, n_treatment),      # Older in treatment\n        \"income\": np.random.normal(60000, 15000, n_treatment),  # Higher income\n    }),\n    pd.DataFrame({\n        \"treatment\": [0] * n_control,\n        \"age\": np.random.normal(30, 10, n_control),         # Younger in control\n        \"income\": np.random.normal(45000, 15000, n_control),    # Lower income\n    })\n])\n\n# Check balance\nbalance = check_covariate_balance(\n    data=df,\n    treatment_col=\"treatment\",\n    covariates=[\"age\", \"income\"],\n    threshold=0.1  # SMD threshold\n)\n\nprint(balance)\n```\n\nOutput:\n```\n  covariate  mean_treated  mean_control       smd  balance_flag\n0       age         40.23         30.15  1.012345             0\n1    income      59823.45      45234.12  0.923456             0\n```\n\n**With categorical variables:**\n\n```python\ndf[\"region\"] = np.random.choice([\"North\", \"South\", \"East\", \"West\"], len(df))\n\nbalance = check_covariate_balance(\n    data=df,\n    treatment_col=\"treatment\",\n    covariates=[\"age\", \"income\", \"region\"],  # Automatic categorical detection\n    threshold=0.1\n)\n\n# Region will be expanded to dummy variables\nprint(balance[balance[\"covariate\"].str.contains(\"region\")])\n```\n\n**Use cases:**\n- Pre-experiment: Check if randomization worked\n- Post-assignment: Validate treatment assignment quality\n- Observational data: Assess comparability before adjustment\n- Research: Standalone balance analysis for publications\n\n## Advanced Topics\n\n### When to Use Different Adjustment Methods\n\n| Method | `adjustment` | Covariate params | Best for |\n|---|---|---|---|\n| No adjustment | `None` | none | Well-randomized experiments |\n| Regression | `None` | `regression_covariates=[\"x1\",\"x2\"]` | Variance reduction |\n| CUPED | `None` | `interaction_covariates=[\"pre_x\"]` | Variance reduction with pre-experiment data |\n| IPW | `\"balance\"` | `balance_covariates=[\"x1\",\"x2\"]` | Many covariates, non-linear confounding |\n| IPW + Regression | `\"balance\"` | both `balance_covariates` and `regression_covariates` | Extra robustness, survival models |\n| Overlap weights (ATO) | `\"balance\"` + `estimand=\"ATO\"` | `balance_covariates=[\"x1\",\"x2\"]` | Poor or moderate overlap, no threshold needed |\n| Trimming | `\"balance\"` + `trim_ps=True` | `balance_covariates=[\"x1\",\"x2\"]` | Robustness check, restrict to overlap region |\n| AIPW (doubly robust) | `\"aipw\"` | `balance_covariates=[\"x1\",\"x2\"]` | Best protection against misspecification |\n| IV | `\"IV\"` | `balance_covariates` optional | Non-compliance, endogenous treatment (requires `instrument_col`) |\n\n**Choosing a balance method:**\n- `ps-logistic`: Default, fast, interpretable\n- `ps-xgboost`: Non-linear relationships, complex interactions\n- `entropy`: Exact moment matching, but can be unstable with many covariates\n\n**Choosing an outcome model:**\n\n| Outcome type | Parameter |\n|---|---|\n| Continuous (revenue, time) | `outcome_models=\"ols\"` (default) |\n| Binary (converted, churned) | `outcome_models=\"logistic\"` |\n| Count (orders, clicks) | `outcome_models=\"poisson\"` |\n| Overdispersed count | `outcome_models=\"negative_binomial\"` |\n| Time-to-event | `outcome_models=\"cox\"` |\n| Ratio (leads/converter, revenue/session) | `ratio_outcomes={\"name\": (\"num_col\", \"den_col\")}` |\n\n### Non-Collapsibility of Hazard and Odds Ratios\n\nWhen using IPW without regression covariates for Cox or logistic models, the estimated effect may differ from the conditional effect even with perfect covariate balancing. This is not a bug -- it reflects a fundamental property called **non-collapsibility**.\n\n**What happens**: IPW creates a pseudo-population where treatment is independent of covariates, then fits a model without covariates. This estimates the **marginal** effect (population-average). For non-collapsible measures like hazard ratios and odds ratios, the marginal effect differs from the conditional effect.\n\n**When it matters**: The gap increases with stronger covariate effects on the outcome. For Cox models the effect is typically larger than for logistic models.\n\n**Recommendations**:\n- For Cox models: use **regression adjustment** or **IPW + Regression** to recover the conditional HR\n- For logistic models: the default marginal effects output (probability change) is collapsible, so this mainly affects odds ratios (`compute_marginal_effects=False`)\n- For OLS: no issue (mean differences are collapsible)\n- AIPW estimates are on the marginal scale but are doubly robust\n\nThe package warns when IPW is used without regression covariates for Cox models.\n\n### Handling Missing Data\n\nThe package handles missing data automatically:\n\n- **Treatment variable**: Rows with missing treatment are dropped (logged as warning)\n- **Categorical covariates**: Missing values become explicit \"Missing\" category\n- **Numeric covariates**: Mean imputation\n- **Binary covariates**: Mode imputation\n\n```python\nanalyzer = ExperimentAnalyzer(\n    data=df,  # Can contain missing values\n    treatment_col=\"treatment\",\n    outcomes=[\"conversion\"],\n    balance_covariates=[\"age\", \"region\"],\n)\n# Missing data is handled automatically\nanalyzer.get_effects()\n```\n\n### Best Practices\n\n**1. Always check balance:**\n\n```python\nanalyzer = ExperimentAnalyzer(data=df, treatment_col=\"treatment\",\n                              outcomes=[\"conversion\"],\n                              balance_covariates=[\"age\", \"income\"])\n\nanalyzer.get_effects()\n\n# Check balance from results\nbalance = analyzer.balance\nif balance[\"balance_flag\"].mean() \u003c 0.8:  # \u003c80% balanced\n    print(\"Consider rerunning with covariate adjustment\")\n```\n\n**2. Use bootstrap for small samples:**\n\n```python\nif len(df) \u003c 500:\n    analyzer = ExperimentAnalyzer(..., bootstrap=True, bootstrap_iterations=2000)\n```\n\n**3. Apply multiple comparison correction:**\n\n```python\n# Always correct when testing multiple outcomes/experiments\nanalyzer.get_effects()\nanalyzer.adjust_pvalues(method=\"holm\")  # Less conservative than Bonferroni\n```\n\n**4. Report both absolute and relative effects:**\n\n```python\nresults = analyzer.results\nprint(results[[\"outcome\", \"absolute_effect\", \"relative_effect\", \n               \"abs_effect_lower\", \"abs_effect_upper\"]])\n```\n\n**5. Check sensitivity with retrodesign:**\n\n```python\n# After finding significant result, check reliability\nretro = analyzer.calculate_retrodesign(true_effect=0.01)\nif retro[\"type_m_error\"].iloc[0] \u003e 2:\n    print(\"Warning: Results may be exaggerated\")\n```\n\n### Common Workflows\n\n**Pre-experiment: Sample size calculation**\n\n```python\nfrom experiment_utils import PowerSim\n\n# Determine required sample size\npower_sim = PowerSim(metric=\"proportion\", variants=1, nsim=1000)\nresult = power_sim.find_sample_size(\n    power=0.80,\n    baseline=0.10,\n    effect=0.02\n)\nprint(f\"Need {result['total_sample_size'].iloc[0]:,.0f} users\")\n```\n\n**During experiment: Balance check**\n\n```python\nfrom experiment_utils import check_covariate_balance\n\n# Check if randomization worked\nbalance = check_covariate_balance(\n    data=experiment_df,\n    treatment_col=\"treatment\",\n    covariates=[\"age\", \"region\", \"tenure\"]\n)\nprint(f\"Balance: {balance['balance_flag'].mean():.1%}\")\n```\n\n**Post-experiment: Analysis**\n\n```python\nfrom experiment_utils import ExperimentAnalyzer\n\n# Full analysis pipeline\nanalyzer = ExperimentAnalyzer(\n    data=df,\n    treatment_col=\"treatment\",\n    outcomes=[\"primary_metric\", \"secondary_metric\"],\n    balance_covariates=[\"age\", \"region\"],\n    adjustment=\"balance\",\n    bootstrap=True,\n)\n\nanalyzer.get_effects()\nanalyzer.adjust_pvalues(method=\"holm\")\n\n# Report\nresults = analyzer.results\nprint(results[[\"outcome\", \"absolute_effect\", \"relative_effect\", \n               \"pvalue_mcp\", \"stat_significance_mcp\"]])\n```\n\n## Contributing\n\nContributions are welcome! Please feel free to submit a Pull Request.\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Citation\n\nIf you use this package in your research, please cite:\n\n```bibtex\n@software{experiment_utils_pd,\n  title = {Experiment Utils PD: A Python Package for Experiment Analysis},\n  author = {Sebastian Daza},\n  year = {2026},\n  url = {https://github.com/sdaza/experiment-utils-pd}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsdaza%2Fexperiment-utils-pd","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsdaza%2Fexperiment-utils-pd","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsdaza%2Fexperiment-utils-pd/lists"}