https://github.com/jturner-uofl/pycorpdiff
Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
https://github.com/jturner-uofl/pycorpdiff
computational-social-science corpus-linguistics digital-humanities discourse-analysis keyness nlp python semantic-shift text-analysis
Last synced: 3 days ago
JSON representation
Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.
- Host: GitHub
- URL: https://github.com/jturner-uofl/pycorpdiff
- Owner: jturner-uofl
- License: mit
- Created: 2026-05-26T12:44:05.000Z (17 days ago)
- Default Branch: main
- Last Pushed: 2026-06-09T21:25:24.000Z (3 days ago)
- Last Synced: 2026-06-09T22:10:02.402Z (3 days ago)
- Topics: computational-social-science, corpus-linguistics, digital-humanities, discourse-analysis, keyness, nlp, python, semantic-shift, text-analysis
- Language: Python
- Homepage: https://pypi.org/project/pycorpdiff/
- Size: 47.4 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE
- Citation: CITATION.cff
Awesome Lists containing this project
README
# pycorpdiff
[](https://pypi.org/project/pycorpdiff/)
[](https://pypi.org/project/pycorpdiff/)
[](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml)
[](https://opensource.org/licenses/MIT)
**Comparative corpus analysis for modern Python workflows.**
`pycorpdiff` is the **missing comparative layer** between R's
[`quanteda`](https://quanteda.io/), the closed-source SketchEngine
platform, and the fragmented Python NLP stack
(`nltk`/`spaCy`/`gensim`/`sentence-transformers`). Three public verbs
— `compare(a, b)`, `track(c, term)`, `compare.before_after(c, event)` —
consolidate keyness, collocations, dispersion, temporal trajectories,
changepoint detection, interrupted time series, causal-impact analysis,
forecasting, online changepoint detection, and embedding-based semantic
shift under a single notebook-native API. Keyness and collocation
results carry their own KWIC evidence: `.explain(term)` returns the
source-text concordances behind any ranked term.
The package answers the questions corpus linguistics, digital humanities,
and computational social science routinely have:
- *How does corpus A differ from corpus B?* — `compare(a, b).keyness()`
- *How has discourse around X evolved over time?* — `track(c, "x").over_time()`
- *What did "migrant" mean in 2005 vs 2023?* — `compare(...).semantic_shift("migrant", embedder=...)`
- *Did this event actually shift the conversation?* — `track(...).causal_impact(event_date=...)`
- *Where is the discourse heading?* — `track(...).forecast(horizon=4)`
`pycorpdiff` is positioned as **orchestration**, not reinvention.
Tokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any
`SBERT`-compatible model) plug in via two `typing.Protocol` extension
points — one-line adapters, no plugin registry. The base install's
direct runtime dependencies are `numpy`, `pandas`, `scipy`, and
`pyarrow`; everything else is opt-in via extras.
> **Status: alpha (0.1.0a32).** Public API is stable for the features
> described below; on PyPI as `pip install pycorpdiff`. Alpha releases
> are intentionally rapid (audit-driven), each shipping fixes and tests
> behind the published version; dependency pins will tighten at beta.
## The three-layer architecture
| Layer | Purpose | Key surface |
|---|---|---|
| **1 — Ingestion + `Corpus`** | get text in, slice it, hash it | `from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars` |
| **2 — Pure math** | statistics with no I/O | `keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift,induce_senses,sense_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}` |
| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each implementing the relevant subset of `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |
## Quick start
```bash
pip install "pycorpdiff[viz]"
```
```python
import pycorpdiff as pcd
# Bundled synthetic Hansard-style sample — runs offline, no data download.
corpus = pcd.load_hansard_sample()
immigration = corpus.slice(topic="immigration")
# Which words separate the humanising and criminalising frames?
keyness = pcd.compare(
immigration.slice(frame="humanising"),
immigration.slice(frame="criminalising"),
).keyness(min_count=3)
keyness.plot() # volcano plot — picture the result
# keyness.table.head(10) # or look at the ranked table directly
# keyness.explain("criminal") # KWIC concordances showing the textual evidence
```
That's the entire surface in five lines: load a corpus, slice it,
compare two slices, plot the result. Every other analytical method —
collocation shifts, semantic drift, temporal trajectories, changepoint
detection, causal-impact analysis, forecasting, co-occurrence networks,
N-way keyness — follows the same shape. See
[the showcase notebook](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
for the full feature tour, or the cheat sheet below for one-line API previews.
### Cheat sheet — every analytical surface in one block
```python
# Compare verbs (returns Result objects; methods exposed vary by Result)
pcd.compare(a, b).keyness() # default formula="rayson" (LL Wizard)
pcd.compare(a, b).keyness(formula="dunning") # full 4-cell G² (Dunning 1993; same family as quanteda / NLTK, edge-case tolerance not certified)
pcd.compare(a, b).keyness(ci="bootstrap", n_boot=999) # adds g2_ci_lower / g2_ci_upper columns
pcd.compare(a, b).collocation_shift("immigrant")
pcd.compare(a, b).semantic_shift("immigrant", embedder=pcd.SBERTEmbedder()) # [semantic]
# SBERTEmbedder downloads a sentence-transformers model on first call;
# use pcd.HashEmbedder() for offline / deterministic-test settings.
# Reference-baseline keyness (bundled or user-built)
pcd.against_baseline(corpus, "gutenberg_fiction") # vs bundled 19th-c. fiction baseline
pcd.against_baseline(corpus, pcd.baseline_from_corpus(reference_corpus)) # vs your own reference
# Sub-corpus balancing — Coarsened Exact Matching before keyness
m = pcd.match(a, b, on=["year", "party"], seed=0) # balances A and B on covariates
pcd.compare(m.a_matched, m.b_matched).keyness() # like-for-like comparison
# Lexical diversity (TTR, MATTR, MTLD, HD-D) — pooled and over time
pcd.lexical_diversity(corpus) # pooled corpus-level values
pcd.lexical_diversity(corpus, freq="Y", ci="bootstrap", n_boot=199) # per-year trajectory + CIs
# Track over time (requires [temporal] for the changepoint + ITS + forecast + causal_impact methods).
# Note: ITS / causal_impact require sufficient pre/post-event periods to fit (min_pre_periods=15,
# min_post_periods=8 by default); the bundled Hansard sample is too small to exercise these
# lines literally -- they are shown here as API previews. See examples/jss_case_study.ipynb
# for a full-corpus run.
tr = pcd.track(corpus, "immigrant").over_time(freq="Y")
tr.changepoints() # offline PELT
tr.changepoints_online(hazard=1/24) # Bayesian online (Adams & MacKay 2007)
tr.burstiness() # Kleinberg 1999 multi-state HMM — burst-intensity states
# tr.interrupted_time_series(event_date="2016") # segmented OLS [needs >=15 pre-periods]
# tr.causal_impact(event_date="2016") # Bayesian counterfactual (Brodersen 2015) [needs >=15 pre-periods]
tr.forecast(horizon=4) # 4 periods at the over_time freq (state-space ETS)
# Before / after a known event
pcd.compare.before_after(corpus, event_date="2016-06-23").keyness()
# N-way (≥ 2 corpora) — the four corpora `a, b, c, d` are illustrative placeholders
# (the cheat sheet's `a, b` from the keyness lines above; you supply `c, d`).
# pcd.keyness_multi([a, b, c, d], labels=["A", "B", "C", "D"])
# The discourse as a graph
pcd.cooccurrence_network(corpus, top_n=30).plot()
# Word-sense induction (BYO embeddings) — audit a hand-built sense classifier
# with an unsupervised second opinion. [semantic] extra for scikit-learn.
X = pcd.SBERTEmbedder().encode(df["text"].tolist()) # or any (n, d) matrix you own
senses = pcd.induce_senses(df, X, k=3) # k=None -> silhouette-selected
senses.agreement_with(df["regex_label"]).summary() # ARI / V-measure vs your buckets
senses.leakage_audit(df["regex_label"], k=20) # records whose geometry disputes the label
senses.share_over_time(freq="Y") # computed sense-fraction trajectory
# Sense drift over time — detect *and explain* when a corpus's sense
# distribution changes (concept-drift + lexical-semantic-change methods).
drift = pcd.sense_drift(df, X, time_col="year", reference=range(2000, 2010), k=3)
drift.summary() # onset, change type, distinctive terms
drift.change_type # "emergence" | "frequency_shift" | "broadening"
drift.drift_terms # what drove the drift (log-ratio vs reference)
drift.plot() # margin density + JSD over time, drift flagged
# For inference, calibrate the flag threshold against a label-shuffle null
# (removes the out-of-sample bias of the in-sample chart) + get a p-value:
drift = pcd.sense_drift(df, X, time_col="year", reference=range(2000, 2010),
k=3, n_permutations=50)
drift.p_value # permutation p (max margin vs shuffled null)
# The fall-off hunt (mirror of emergence): which senses *decline*, and is it
# obsolescence (absolute count falls) or just dilution (share falls, count holds)?
drift.decline_report() # per sense: verdict + early/late share & count + terms
drift.sense_trajectories() # per-period per-sense count + share (plot-ready)
# Three story-carrying charts (the [viz] extra):
drift.plot() # margin density + calibrated threshold + p-value
drift.plot_composition() # stacked-area sense share over time — the takeover, seen
drift.plot_decline() # slopegraph early→late, coloured obsolescence/dilution/rising
```
See [`examples/pycorpdiff_showcase.ipynb`](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
for a walkthrough on the synthetic Hansard-style corpus exercising
every analytical surface.
## Installation
```bash
pip install pycorpdiff # lexical-comparative core (MIT)
pip install "pycorpdiff[viz]" # + altair / matplotlib / networkx
pip install "pycorpdiff[semantic]" # + sentence-transformers
pip install "pycorpdiff[temporal]" # + ruptures / statsmodels
pip install "pycorpdiff[notebooks]" # + jupyter / vl-convert
pip install "pycorpdiff[all]" # everything MIT-compatible
pip install "pycorpdiff[all,showcase]" # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase
```
The base install's direct runtime dependencies are `numpy`, `pandas`,
`scipy`, and `pyarrow`; optional extras land per analytical layer so
you only pay for what you use. `[showcase]` is broken out separately
because `pysofra` is GPL-3.0-or-later — pure `pycorpdiff` use without
that extra remains MIT-only.
To work from source:
```bash
git clone https://github.com/jturner-uofl/pycorpdiff
cd pycorpdiff
pip install -e ".[dev]"
pytest -q
```
## Cross-validation receipts
The math is checked against standard tools by automated test. The
fast tier runs on every push (matrix CI); the slow tier needs heavy
optional dependencies (NLTK, Scattertext, Stanford SNAP downloads)
and runs on main pushes only.
Fast tier:
- **Rayson's LL Wizard** — hand-derived contingency-table reference
triples ([`tests/integration/test_crossval_rayson.py`](https://github.com/jturner-uofl/pycorpdiff/blob/main/tests/integration/test_crossval_rayson.py))
Slow tier:
- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12
on every adjacent bigram
- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012
US Conventions corpus
- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word
sanity check on Stanford SNAP COHA decade embeddings (skips
gracefully if the archive isn't reachable)
## Citation
If you use `pycorpdiff` in academic work, please cite the software via
the `CITATION.cff` file in this repository — GitHub renders a "Cite this
repository" widget directly from it.
## License
MIT — see [LICENSE](https://github.com/jturner-uofl/pycorpdiff/blob/main/LICENSE).
## Case studies and demos (rendered)
GitHub's in-browser notebook renderer is unreliable on larger notebooks
with embedded SVG outputs. The links below point to the **pre-rendered
HTML artefacts** (the canonical read versions) and to nbviewer fallbacks
for the `.ipynb` source. Notebook sources still live under `examples/`
for re-execution.
- **asylum case study — lexicalising asylum in UK Parliament, 2010-2023.**
[📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/jss_case_study.html)
· [nbviewer](https://nbviewer.org/github/jturner-uofl/pycorpdiff/blob/main/examples/jss_case_study.ipynb)
· [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/jss_case_study.ipynb)
- **Full feature tour (showcase).**
[📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/pycorpdiff_showcase.html)
· [nbviewer](https://nbviewer.org/github/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
· [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)
- **Tutorial.**
[📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/pycorpdiff_tutorial.html)
· [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_tutorial.ipynb)
- **Hansard demo.**
[📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/hansard_demo.html)
· [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/hansard_demo.ipynb)
## Further reading
- [`docs/design.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/design.md) — three-layer architecture
- [`docs/statistical-methods.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/statistical-methods.md) — every metric's formula + citation
- [`docs/rendered/`](https://github.com/jturner-uofl/pycorpdiff/tree/main/docs/rendered) — catalogue of static HTML renders for offline viewing