https://github.com/twangodev/readback
https://github.com/twangodev/readback
Last synced: 7 days ago
JSON representation
- Host: GitHub
- URL: https://github.com/twangodev/readback
- Owner: twangodev
- Created: 2026-06-01T08:41:26.000Z (15 days ago)
- Default Branch: main
- Last Pushed: 2026-06-04T19:55:59.000Z (11 days ago)
- Last Synced: 2026-06-04T21:38:32.920Z (11 days ago)
- Language: Python
- Size: 223 KB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# readback
Ensemble ASR pseudo-labeling for Air Traffic Control.
Labels [`tartanaviation-atc-adsb-utterances`](https://huggingface.co/datasets/twangodev/tartanaviation-atc-adsb-utterances)
and publishes [`tartanaviation-atc-labels`](https://huggingface.co/datasets/twangodev/tartanaviation-atc-labels):
531k rows, one transcript and confidence each, 1:1 onto the source.
## Pipeline
| stage | does |
|---|---|
| `infer` | three ASR models over the source shards |
| `fuse` | weighted ROVER + ADS-B callsign snap |
| `serve` | review studio (optional) |
| `publish` | upload-ready parquet shards + card |
```bash
uv sync
uv run readback infer --config configs/models.example.toml --run data/run
uv run readback fuse --run data/run --voters parakeet-v2,canary-qwen,whisper-atc --weights 1,1,2 --advisory rasr-v1
uv run readback serve --run data/run
uv run readback publish --run data/run --out out/atc-labels
hf upload twangodev/tartanaviation-atc-labels out/atc-labels . --repo-type dataset
```
Shard-resumable. `confidence` ranks; it is not calibrated.
## Use
```python
from datasets import load_dataset, concatenate_datasets
src = load_dataset("twangodev/tartanaviation-atc-adsb-utterances", split="train")
lab = load_dataset("twangodev/tartanaviation-atc-labels", split="train")
joined = concatenate_datasets([src, lab], axis=1) # 1:1, same order
```