{"id":50375023,"url":"https://github.com/axanthos/jadt2026-typical-source-estimation","last_synced_at":"2026-05-30T09:02:37.584Z","repository":{"id":359173148,"uuid":"1244853018","full_name":"axanthos/jadt2026-typical-source-estimation","owner":"axanthos","description":"Reproduction package for typical-source estimation in imbalanced corpora (JADT 2026).","archived":false,"fork":false,"pushed_at":"2026-05-20T18:23:32.000Z","size":177,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-05-20T22:59:09.332Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/axanthos.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-20T16:59:18.000Z","updated_at":"2026-05-20T18:23:36.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/axanthos/jadt2026-typical-source-estimation","commit_stats":null,"previous_names":["axanthos/jadt2026-typical-source-estimation"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/axanthos/jadt2026-typical-source-estimation","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axanthos%2Fjadt2026-typical-source-estimation","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axanthos%2Fjadt2026-typical-source-estimation/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axanthos%2Fjadt2026-typical-source-estimation/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axanthos%2Fjadt2026-typical-source-estimation/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/axanthos","download_url":"https://codeload.github.com/axanthos/jadt2026-typical-source-estimation/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/axanthos%2Fjadt2026-typical-source-estimation/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33686018,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2026-05-30T09:02:36.791Z","updated_at":"2026-05-30T09:02:37.575Z","avatar_url":"https://github.com/axanthos.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# JADT 2026 typical-source estimation reproduction package\n\n[![DOI](https://zenodo.org/badge/1244853018.svg)](https://doi.org/10.5281/zenodo.20314184)\n\nThis repository is the paper-specific reproduction package for:\n\n\u003e Aris Xanthos. 2026. *Estimating the typical-source distribution in imbalanced corpora*. JADT 2026.\n\nIt contains code, scripts, documentation, and small toy inputs for reproducing\nthe simulation study and estimator-comparison tables/figures reported in the\npaper.\n\n## Status\n\nRelease-ready reproduction package with the source-count data model, the three\npaper estimators, total-variation distance, source-size summaries, preparation\nscripts for the What's New, Switzerland? (WNS) corpus, final real-data table\nscripts, seeded simulation scripts, and tests.\n\n## What this repository contains\n\n- Implementations of the population estimators discussed in the paper:\n  - `POOL`: pooled maximum-likelihood estimate;\n  - `UNIF`: uniform average over sources;\n  - `CAP`: capped source-mass estimator.\n- Core utilities for source-by-token count datasets.\n- Total-variation distance and source-size imbalance summaries.\n- Small synthetic toy inputs for examples and smoke tests.\n- Preparation scripts that regenerate paper-specific WNS TSV inputs locally for authorized users.\n- Emoji and lexical table-generation scripts consuming prepared TSV inputs.\n- Seeded simulation scripts that regenerate paper-facing simulation summaries and figures.\n- Documentation for data access, simulation provenance, and the reproduction workflow.\n\n## What this repository does not contain\n\nThis repository does **not** redistribute token-level data derived from WNS.\n\nThe dataset is available on demand for research purposes, under a restricted\nlicense contract, from the SWISSUbase repository (https://www.swissubase.ch).\nIn accordance with the corpus privacy commitments, WNS-derived token-level\ninputs used in the paper are not published here. Authorized WNS users can\nregenerate them locally with the preparation scripts provided in this\nrepository.\n\n## Repository layout\n\n```text\n.\n├── README.md\n├── LICENSE\n├── CITATION.cff\n├── pyproject.toml\n├── data/\n│   ├── README.md\n│   └── toy/\n├── docs/\n│   ├── data_access.md\n│   ├── reproduction_plan.md\n│   └── simulation_provenance.md\n├── configs/\n│   └── wns_jadt_preprocessing.ini\n├── scripts/\n│   ├── README.md\n│   ├── prepare_wns_posts_tsv.py\n│   ├── prepare_wns_emoji_tsv.py\n│   ├── prepare_wns_lexical_tsv.py\n│   ├── reproduce_emoji_table.py\n│   ├── reproduce_lexical_tables.py\n│   └── reproduce_simulation.py\n├── src/\n│   └── typical_source_estimation/\n└── tests/\n```\n\n## Installation\n\nThe package declares its runtime dependencies in `pyproject.toml`, including\n`numpy`, `pandas`, `matplotlib`, `emoji`, and `lxml`. The `dev` extra adds\n`pytest` for the test suite.\n\nWith `uv`, set up the development environment and run the tests with:\n\n```bash\nuv sync --extra dev\nuv run python -m pytest -q\n```\n\nWith standard `pip`, use an editable install with the development extra:\n\n```bash\npython -m venv .venv\nsource .venv/bin/activate  # Windows: .venv\\Scripts\\activate\npython -m pip install -e \".[dev]\"\npython -m pytest -q\n```\n\n## Minimal example\n\n```python\nfrom typical_source_estimation import load_sequence_tsv, pooled_mle, uniform_sources, capped_mass_alpha\n\n# Load a small source/sequence table.\nds = load_sequence_tsv(\"data/toy/toy_emoji.tsv\")\n\n# Compute the three paper estimators.\nq_pool = pooled_mle(ds).q_hat\nq_unif = uniform_sources(ds).q_hat\nq_cap = capped_mass_alpha(ds, alpha=1.0).q_hat\n```\n\n## Data access and reproduction\n\nThe real-data analyses in the paper use two TSV inputs derived from WNS:\n\n1. an emoji-sequence table;\n2. a lexical message-text table.\n\nFor public reproducibility, this repository provides:\n\n- conversion scripts that authorized users can run on a local copy of WNS;\n- estimator/table-generation scripts that consume the derived local TSV files;\n- toy inputs with the same column conventions.\n\nFor details, see:\n\n- [Data access](docs/data_access.md)\n- [Reproduction plan](docs/reproduction_plan.md)\n- [Simulation provenance](docs/simulation_provenance.md)\n\n## Citation\n\nPlease cite both the accompanying JADT paper and the archived software release.\nThe paper citation is, pending final proceedings metadata:\n\n\u003e Xanthos, Aris. 2026. *Estimating the typical-source distribution in imbalanced corpora*. JADT 2026.\n\nWhen using the WNS real-data analyses, also cite WNS through its official\nSWISSUbase/LaRS citation:\n\n\u003e Xanthos, A., Gupta, P., Benkais, L., Doudot, L., \u0026 Grütter, A. (2024). What's New, Switzerland? Corpus (Version 1.0.0) [Data set]. LaRS - Language Repository of Switzerland. https://doi.org/10.48656/pa3t-xh52\n\nA `CITATION.cff` file is included for software citation metadata.\n\n## License\n\nCode in this repository is released under the MIT License unless otherwise\nstated. See [LICENSE](LICENSE).\n\nNo license is granted here for WNS-derived token-level data, because such data\nare not redistributed in this repository.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faxanthos%2Fjadt2026-typical-source-estimation","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faxanthos%2Fjadt2026-typical-source-estimation","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faxanthos%2Fjadt2026-typical-source-estimation/lists"}