https://github.com/finite-sample/onlinerake

Online raking with SGD or MWU
https://github.com/finite-sample/onlinerake

Last synced: 11 months ago
JSON representation

Online raking with SGD or MWU

Host: GitHub
URL: https://github.com/finite-sample/onlinerake
Owner: finite-sample
Created: 2025-07-19T07:15:17.000Z (11 months ago)
Default Branch: main
Last Pushed: 2025-07-26T19:02:32.000Z (11 months ago)
Last Synced: 2025-07-26T23:31:44.421Z (11 months ago)
Language: Python
Size: 280 KB
Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

          ## onlinerake: Streaming Survey Raking Via MWU and SGD

[![PyPI version](https://img.shields.io/pypi/v/onlinerake.svg)](https://pypi.org/project/onlinerake/)

[![PyPI Downloads](https://static.pepy.tech/badge/onlinerake)](https://pepy.tech/projects/onlinerake)

Modern online surveys and passive data collection streams generate

responses one record at a time.  Classic weighting methods such as

iterative proportional fitting (IPF, or “raking”) and calibration

weighting are inherently *batch* procedures: they reprocess the entire

dataset whenever a new case arrives.  The `onlinerake` package

provides **incremental**, per‑observation updates to survey weights so

that weighted margins track known population totals in real time.

The package implements two complementary algorithms:

* **SGD raking** – an additive update that performs stochastic

  gradient descent on a squared–error loss over the margins.  It

  produces smooth weight trajectories and maintains high effective

  sample size (ESS).

* **MWU raking** – a multiplicative update inspired by the

  multiplicative‑weights update rule.  It corresponds to mirror

  descent under the Kullback–Leibler divergence and yields weight

  distributions reminiscent of classic IPF.  However, it can produce

  heavier tails when the learning rate is large.

Both methods share the same API: call `.partial_fit(obs)` for each

incoming observation and inspect properties such as `.margins`, `.loss`

and `.effective_sample_size` to monitor progress.

## Installation

Clone or download this repository and install in editable mode:

```bash

git clone 

cd onlinerake

pip install -e .

```

No external dependencies are required beyond `numpy` and `pandas`.

## Usage

```python

from onlinerake import OnlineRakingSGD, OnlineRakingMWU, Targets

# define target population margins (proportion of the population with indicator = 1)

targets = Targets(age=0.5, gender=0.5, education=0.4, region=0.3)

# instantiate a raker

raker = OnlineRakingSGD(targets, learning_rate=5.0)

# stream demographic observations

for obs in stream_of_dicts:

    raker.partial_fit(obs)

    print(raker.margins)  # current weighted margins

print("final effective sample size", raker.effective_sample_size)

```

To use the multiplicative‑weights version, replace

`OnlineRakingSGD` with `OnlineRakingMWU` and adjust the

`learning_rate` (a typical default is `1.0`).  See the docstrings

for full parameter descriptions.

## Simulation results

To understand the behaviour of the two update rules we simulated

three typical non‑stationary bias patterns: a **linear drift** in

demographic composition, a **sudden shift** halfway through the stream,

and an **oscillation** around the target frame.  For each scenario we

generated 300 observations per seed and averaged results over five

random seeds.  SGD used a learning rate of 5.0 and MWU used a

learning rate of 1.0 with three update steps per observation.  The

table below summarises the mean improvement in absolute margin error

relative to the unweighted baseline (positive values indicate an

improvement), the final effective sample size (ESS) and the mean final

loss (squared‑error on margins).  Higher ESS and larger improvements

are better.

| Scenario | Method | Age Imp (%) | Gender Imp (%) | Education Imp (%) | Region Imp (%) | Overall Imp (%) | Final ESS | Final Loss |

|---------|--------|-------------|---------------|------------------|---------------|----------------|---------:|-----------:|

| linear | SGD | 82.8 | 78.6 | 76.8 | 67.5 | 77.0 | 251.8 | 0.00147 |

| linear | MWU | 57.2 | 53.6 | 46.9 | 34.6 | 48.8 | 240.9 | 0.00676 |

| sudden | SGD | 82.9 | 82.3 | 79.6 | 63.5 | 79.5 | 225.5 | 0.00102 |

| sudden | MWU | 52.6 | 51.2 | 46.3 | 26.3 | 47.3 | 175.9 | 0.01235 |

| oscillating | SGD | 69.7 | 78.5 | 65.6 | 72.0 | 72.2 | 278.7 | 0.00023 |

| oscillating | MWU | 49.6 | 57.3 | 48.3 | 50.1 | 52.0 | 276.0 | 0.00048 |

**Interpretation**

* In all scenarios the online rakers dramatically reduce the margin

  errors relative to the unweighted baseline.  For example, in the

  sudden‑shift scenario the SGD raker reduces the average age error

  from 0.20 to about 0.03 (a 83% improvement).

* The SGD update consistently yields *higher* improvements and lower

  final loss than the MWU update, albeit at the cost of choosing a

  more aggressive learning rate.

* The MWU update, while less accurate in these settings, maintains

  comparable effective sample sizes and might be preferable when

  multiplicative adjustments are desired (e.g., when starting from

  unequal base weights).

You can reproduce these results or design new experiments by running

```bash

python -m onlinerake.simulation

```

from the repository root.  See the source of

`onlinerake/simulation.py` for details.

## Contributing

Pull requests are welcome!  Feel free to open issues if you find bugs

or have suggestions for new features, such as support for multi‑level

controls or adaptive learning‑rate schedules.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/finite-sample/onlinerake

Awesome Lists containing this project

README