{"id":50203477,"url":"https://github.com/jturner-uofl/pycorpdiff","last_synced_at":"2026-06-10T00:00:42.151Z","repository":{"id":362111157,"uuid":"1250239417","full_name":"jturner-uofl/pycorpdiff","owner":"jturner-uofl","description":"Comparative corpus analysis for Python: keyness, collocations, semantic shift, temporal trajectories with changepoints + causal inference.","archived":false,"fork":false,"pushed_at":"2026-06-09T21:25:24.000Z","size":49743,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-06-09T22:10:02.402Z","etag":null,"topics":["computational-social-science","corpus-linguistics","digital-humanities","discourse-analysis","keyness","nlp","python","semantic-shift","text-analysis"],"latest_commit_sha":null,"homepage":"https://pypi.org/project/pycorpdiff/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jturner-uofl.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-05-26T12:44:05.000Z","updated_at":"2026-06-09T21:25:28.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/jturner-uofl/pycorpdiff","commit_stats":null,"previous_names":["jturner-uofl/pycorpdiff"],"tags_count":14,"template":false,"template_full_name":null,"purl":"pkg:github/jturner-uofl/pycorpdiff","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jturner-uofl%2Fpycorpdiff","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jturner-uofl%2Fpycorpdiff/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jturner-uofl%2Fpycorpdiff/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jturner-uofl%2Fpycorpdiff/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jturner-uofl","download_url":"https://codeload.github.com/jturner-uofl/pycorpdiff/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jturner-uofl%2Fpycorpdiff/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34130642,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-09T02:00:06.510Z","response_time":63,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["computational-social-science","corpus-linguistics","digital-humanities","discourse-analysis","keyness","nlp","python","semantic-shift","text-analysis"],"created_at":"2026-05-26T00:01:29.560Z","updated_at":"2026-06-10T00:00:42.029Z","avatar_url":"https://github.com/jturner-uofl.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# pycorpdiff\n\n[![PyPI](https://img.shields.io/pypi/v/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)\n[![Python versions](https://img.shields.io/pypi/pyversions/pycorpdiff.svg)](https://pypi.org/project/pycorpdiff/)\n[![CI](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml/badge.svg)](https://github.com/jturner-uofl/pycorpdiff/actions/workflows/ci.yml)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n\n**Comparative corpus analysis for modern Python workflows.**\n\n`pycorpdiff` is the **missing comparative layer** between R's\n[`quanteda`](https://quanteda.io/), the closed-source SketchEngine\nplatform, and the fragmented Python NLP stack\n(`nltk`/`spaCy`/`gensim`/`sentence-transformers`). Three public verbs\n— `compare(a, b)`, `track(c, term)`, `compare.before_after(c, event)` —\nconsolidate keyness, collocations, dispersion, temporal trajectories,\nchangepoint detection, interrupted time series, causal-impact analysis,\nforecasting, online changepoint detection, and embedding-based semantic\nshift under a single notebook-native API. Keyness and collocation\nresults carry their own KWIC evidence: `.explain(term)` returns the\nsource-text concordances behind any ranked term.\n\nThe package answers the questions corpus linguistics, digital humanities,\nand computational social science routinely have:\n\n- *How does corpus A differ from corpus B?* — `compare(a, b).keyness()`\n- *How has discourse around X evolved over time?* — `track(c, \"x\").over_time()`\n- *What did \"migrant\" mean in 2005 vs 2023?* — `compare(...).semantic_shift(\"migrant\", embedder=...)`\n- *Did this event actually shift the conversation?* — `track(...).causal_impact(event_date=...)`\n- *Where is the discourse heading?* — `track(...).forecast(horizon=4)`\n\n`pycorpdiff` is positioned as **orchestration**, not reinvention.\nTokenizers (`spaCy`, `Stanza`, `jieba`, `fugashi`) and embedders (any\n`SBERT`-compatible model) plug in via two `typing.Protocol` extension\npoints — one-line adapters, no plugin registry. The base install's\ndirect runtime dependencies are `numpy`, `pandas`, `scipy`, and\n`pyarrow`; everything else is opt-in via extras.\n\n\u003e **Status: alpha (0.1.0a32).** Public API is stable for the features\n\u003e described below; on PyPI as `pip install pycorpdiff`. Alpha releases\n\u003e are intentionally rapid (audit-driven), each shipping fixes and tests\n\u003e behind the published version; dependency pins will tighten at beta.\n\n## The three-layer architecture\n\n| Layer | Purpose | Key surface |\n|---|---|---|\n| **1 — Ingestion + `Corpus`** | get text in, slice it, hash it | `from_dataframe`, `read_csv`, `read_parquet`, `read_txt`, `read_duckdb`, `from_huggingface`, `fetch_hansard`, `Corpus.slice/by_time/__hash__/doc_term_counts(_sparse)/to_polars` |\n| **2 — Pure math** | statistics with no I/O | `keyness.{log_likelihood,chi_squared,log_ratio,percent_diff,bayes_factor,permutation_pvalues,keyness_multi,juilland_d,benjamini_hochberg}`; `collocation.{logdice,pmi,t_score,mi_three,collocation_shift,cooccurrence_network}`; `semantic.{HashEmbedder,SBERTEmbedder,semantic_trajectory,neighborhood_drift,induce_senses,sense_drift}`; `temporal.{changepoints,interrupted_time_series,forecast,causal_impact,bocpd}` |\n| **3 — Verbs + Results** | public API | `compare`, `track`, `compare.before_after`, `keyness_multi`, plus 9 frozen-dataclass Result types each implementing the relevant subset of `.to_df() / .plot() / .explain() / .summary() / .to_html() / .to_json()` |\n\n## Quick start\n\n```bash\npip install \"pycorpdiff[viz]\"\n```\n\n```python\nimport pycorpdiff as pcd\n\n# Bundled synthetic Hansard-style sample — runs offline, no data download.\ncorpus = pcd.load_hansard_sample()\nimmigration = corpus.slice(topic=\"immigration\")\n\n# Which words separate the humanising and criminalising frames?\nkeyness = pcd.compare(\n    immigration.slice(frame=\"humanising\"),\n    immigration.slice(frame=\"criminalising\"),\n).keyness(min_count=3)\n\nkeyness.plot()                # volcano plot — picture the result\n# keyness.table.head(10)      # or look at the ranked table directly\n# keyness.explain(\"criminal\") # KWIC concordances showing the textual evidence\n```\n\nThat's the entire surface in five lines: load a corpus, slice it,\ncompare two slices, plot the result. Every other analytical method —\ncollocation shifts, semantic drift, temporal trajectories, changepoint\ndetection, causal-impact analysis, forecasting, co-occurrence networks,\nN-way keyness — follows the same shape. See\n[the showcase notebook](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)\nfor the full feature tour, or the cheat sheet below for one-line API previews.\n\n### Cheat sheet — every analytical surface in one block\n\n```python\n# Compare verbs (returns Result objects; methods exposed vary by Result)\npcd.compare(a, b).keyness()                                                   # default formula=\"rayson\" (LL Wizard)\npcd.compare(a, b).keyness(formula=\"dunning\")                                  # full 4-cell G² (Dunning 1993; same family as quanteda / NLTK, edge-case tolerance not certified)\npcd.compare(a, b).keyness(ci=\"bootstrap\", n_boot=999)                         # adds g2_ci_lower / g2_ci_upper columns\npcd.compare(a, b).collocation_shift(\"immigrant\")\npcd.compare(a, b).semantic_shift(\"immigrant\", embedder=pcd.SBERTEmbedder())   # [semantic]\n# SBERTEmbedder downloads a sentence-transformers model on first call;\n# use pcd.HashEmbedder() for offline / deterministic-test settings.\n\n# Reference-baseline keyness (bundled or user-built)\npcd.against_baseline(corpus, \"gutenberg_fiction\")                             # vs bundled 19th-c. fiction baseline\npcd.against_baseline(corpus, pcd.baseline_from_corpus(reference_corpus))      # vs your own reference\n\n# Sub-corpus balancing — Coarsened Exact Matching before keyness\nm = pcd.match(a, b, on=[\"year\", \"party\"], seed=0)                             # balances A and B on covariates\npcd.compare(m.a_matched, m.b_matched).keyness()                               # like-for-like comparison\n\n# Lexical diversity (TTR, MATTR, MTLD, HD-D) — pooled and over time\npcd.lexical_diversity(corpus)                                                 # pooled corpus-level values\npcd.lexical_diversity(corpus, freq=\"Y\", ci=\"bootstrap\", n_boot=199)           # per-year trajectory + CIs\n\n# Track over time (requires [temporal] for the changepoint + ITS + forecast + causal_impact methods).\n# Note: ITS / causal_impact require sufficient pre/post-event periods to fit (min_pre_periods=15,\n# min_post_periods=8 by default); the bundled Hansard sample is too small to exercise these\n# lines literally -- they are shown here as API previews. See examples/jss_case_study.ipynb\n# for a full-corpus run.\ntr = pcd.track(corpus, \"immigrant\").over_time(freq=\"Y\")\ntr.changepoints()                                  # offline PELT\ntr.changepoints_online(hazard=1/24)                # Bayesian online (Adams \u0026 MacKay 2007)\ntr.burstiness()                                    # Kleinberg 1999 multi-state HMM — burst-intensity states\n# tr.interrupted_time_series(event_date=\"2016\")    # segmented OLS [needs \u003e=15 pre-periods]\n# tr.causal_impact(event_date=\"2016\")              # Bayesian counterfactual (Brodersen 2015) [needs \u003e=15 pre-periods]\ntr.forecast(horizon=4)                             # 4 periods at the over_time freq (state-space ETS)\n\n# Before / after a known event\npcd.compare.before_after(corpus, event_date=\"2016-06-23\").keyness()\n\n# N-way (≥ 2 corpora) — the four corpora `a, b, c, d` are illustrative placeholders\n# (the cheat sheet's `a, b` from the keyness lines above; you supply `c, d`).\n# pcd.keyness_multi([a, b, c, d], labels=[\"A\", \"B\", \"C\", \"D\"])\n\n# The discourse as a graph\npcd.cooccurrence_network(corpus, top_n=30).plot()\n\n# Word-sense induction (BYO embeddings) — audit a hand-built sense classifier\n# with an unsupervised second opinion. [semantic] extra for scikit-learn.\nX = pcd.SBERTEmbedder().encode(df[\"text\"].tolist())        # or any (n, d) matrix you own\nsenses = pcd.induce_senses(df, X, k=3)                     # k=None -\u003e silhouette-selected\nsenses.agreement_with(df[\"regex_label\"]).summary()         # ARI / V-measure vs your buckets\nsenses.leakage_audit(df[\"regex_label\"], k=20)              # records whose geometry disputes the label\nsenses.share_over_time(freq=\"Y\")                           # computed sense-fraction trajectory\n\n# Sense drift over time — detect *and explain* when a corpus's sense\n# distribution changes (concept-drift + lexical-semantic-change methods).\ndrift = pcd.sense_drift(df, X, time_col=\"year\", reference=range(2000, 2010), k=3)\ndrift.summary()                                            # onset, change type, distinctive terms\ndrift.change_type                                          # \"emergence\" | \"frequency_shift\" | \"broadening\"\ndrift.drift_terms                                          # what drove the drift (log-ratio vs reference)\ndrift.plot()                                               # margin density + JSD over time, drift flagged\n\n# For inference, calibrate the flag threshold against a label-shuffle null\n# (removes the out-of-sample bias of the in-sample chart) + get a p-value:\ndrift = pcd.sense_drift(df, X, time_col=\"year\", reference=range(2000, 2010),\n                        k=3, n_permutations=50)\ndrift.p_value                                              # permutation p (max margin vs shuffled null)\n\n# The fall-off hunt (mirror of emergence): which senses *decline*, and is it\n# obsolescence (absolute count falls) or just dilution (share falls, count holds)?\ndrift.decline_report()                                     # per sense: verdict + early/late share \u0026 count + terms\ndrift.sense_trajectories()                                 # per-period per-sense count + share (plot-ready)\n\n# Three story-carrying charts (the [viz] extra):\ndrift.plot()                                               # margin density + calibrated threshold + p-value\ndrift.plot_composition()                                   # stacked-area sense share over time — the takeover, seen\ndrift.plot_decline()                                       # slopegraph early→late, coloured obsolescence/dilution/rising\n```\n\nSee [`examples/pycorpdiff_showcase.ipynb`](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)\nfor a walkthrough on the synthetic Hansard-style corpus exercising\nevery analytical surface.\n\n## Installation\n\n```bash\npip install pycorpdiff                       # lexical-comparative core (MIT)\npip install \"pycorpdiff[viz]\"                # + altair / matplotlib / networkx\npip install \"pycorpdiff[semantic]\"           # + sentence-transformers\npip install \"pycorpdiff[temporal]\"           # + ruptures / statsmodels\npip install \"pycorpdiff[notebooks]\"          # + jupyter / vl-convert\npip install \"pycorpdiff[all]\"                # everything MIT-compatible\npip install \"pycorpdiff[all,showcase]\"       # + pysofra (GPL-3.0-or-later) for the JAMA-style showcase\n```\n\nThe base install's direct runtime dependencies are `numpy`, `pandas`,\n`scipy`, and `pyarrow`; optional extras land per analytical layer so\nyou only pay for what you use. `[showcase]` is broken out separately\nbecause `pysofra` is GPL-3.0-or-later — pure `pycorpdiff` use without\nthat extra remains MIT-only.\n\nTo work from source:\n\n```bash\ngit clone https://github.com/jturner-uofl/pycorpdiff\ncd pycorpdiff\npip install -e \".[dev]\"\npytest -q\n```\n\n## Cross-validation receipts\n\nThe math is checked against standard tools by automated test. The\nfast tier runs on every push (matrix CI); the slow tier needs heavy\noptional dependencies (NLTK, Scattertext, Stanford SNAP downloads)\nand runs on main pushes only.\n\nFast tier:\n\n- **Rayson's LL Wizard** — hand-derived contingency-table reference\n  triples ([`tests/integration/test_crossval_rayson.py`](https://github.com/jturner-uofl/pycorpdiff/blob/main/tests/integration/test_crossval_rayson.py))\n\nSlow tier:\n\n- **NLTK** `BigramAssocMeasures` — PMI + t-score agreement to ≤ 1e-12\n  on every adjacent bigram\n- **Scattertext (Kessler 2017)** — behavioural agreement on the 2012\n  US Conventions corpus\n- **HistWords (Hamilton et al. 2016)** — known-shifter / stable-word\n  sanity check on Stanford SNAP COHA decade embeddings (skips\n  gracefully if the archive isn't reachable)\n\n## Citation\n\nIf you use `pycorpdiff` in academic work, please cite the software via\nthe `CITATION.cff` file in this repository — GitHub renders a \"Cite this\nrepository\" widget directly from it.\n\n## License\n\nMIT — see [LICENSE](https://github.com/jturner-uofl/pycorpdiff/blob/main/LICENSE).\n\n## Case studies and demos (rendered)\n\nGitHub's in-browser notebook renderer is unreliable on larger notebooks\nwith embedded SVG outputs. The links below point to the **pre-rendered\nHTML artefacts** (the canonical read versions) and to nbviewer fallbacks\nfor the `.ipynb` source. Notebook sources still live under `examples/`\nfor re-execution.\n\n- **asylum case study — lexicalising asylum in UK Parliament, 2010-2023.**\n  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/jss_case_study.html)\n  · [nbviewer](https://nbviewer.org/github/jturner-uofl/pycorpdiff/blob/main/examples/jss_case_study.ipynb)\n  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/jss_case_study.ipynb)\n- **Full feature tour (showcase).**\n  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/pycorpdiff_showcase.html)\n  · [nbviewer](https://nbviewer.org/github/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)\n  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_showcase.ipynb)\n- **Tutorial.**\n  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/pycorpdiff_tutorial.html)\n  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/pycorpdiff_tutorial.ipynb)\n- **Hansard demo.**\n  [📊 rendered HTML](https://raw.githack.com/jturner-uofl/pycorpdiff/main/docs/rendered/hansard_demo.html)\n  · [.ipynb source](https://github.com/jturner-uofl/pycorpdiff/blob/main/examples/hansard_demo.ipynb)\n\n## Further reading\n\n- [`docs/design.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/design.md) — three-layer architecture\n- [`docs/statistical-methods.md`](https://github.com/jturner-uofl/pycorpdiff/blob/main/docs/statistical-methods.md) — every metric's formula + citation\n- [`docs/rendered/`](https://github.com/jturner-uofl/pycorpdiff/tree/main/docs/rendered) — catalogue of static HTML renders for offline viewing\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjturner-uofl%2Fpycorpdiff","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjturner-uofl%2Fpycorpdiff","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjturner-uofl%2Fpycorpdiff/lists"}