https://github.com/musharna/ghostcite

Deterministic, no-LLM CLI that catches ghost citations — when the author/year you cited doesn't match the DOI's CrossRef record (+ retraction flags).
https://github.com/musharna/ghostcite
bibtex citations cli crossref doi research-integrity
Last synced: 7 days ago
JSON representation
Deterministic, no-LLM CLI that catches ghost citations — when the author/year you cited doesn't match the DOI's CrossRef record (+ retraction flags).
Host: GitHub
URL: https://github.com/musharna/ghostcite
Owner: musharna
License: mit
Created: 2026-06-08T06:25:01.000Z (9 days ago)
Default Branch: main
Last Pushed: 2026-06-08T18:03:42.000Z (9 days ago)
Last Synced: 2026-06-08T20:06:14.755Z (9 days ago)
Topics: bibtex, citations, cli, crossref, doi, research-integrity
Language: Python
Homepage: https://pypi.org/project/ghostcite/
Size: 357 KB
Stars: 0
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Code of conduct: CODE_OF_CONDUCT.md
- Citation: CITATION.cff
- Security: SECURITY.md
Awesome Lists containing this project

README

          # ghostcite



[![PyPI](https://img.shields.io/pypi/v/ghostcite.svg)](https://pypi.org/project/ghostcite/)

[![CI](https://github.com/musharna/ghostcite/actions/workflows/ci.yml/badge.svg)](https://github.com/musharna/ghostcite/actions/workflows/ci.yml)

[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](LICENSE)

[![Python 3.9+](https://img.shields.io/badge/python-3.9%2B-blue.svg)](https://www.python.org/downloads/)

**Catch ghost citations — right DOI, wrong author.**



`ghostcite` is a deterministic, **no-LLM** command-line tool that cross-checks a

bibliography's _claimed_ author and year against CrossRef's canonical record for

each DOI. It catches the dominant ghost-citation failure mode — a reference whose

cited authorship doesn't match the paper the DOI actually points to — and flags

retracted or expression-of-concern works along the way.

## The problem

LLM-assisted writing (and plain copy-paste drift) routinely produces references

that _look_ right but attribute the cited DOI to the wrong authors or year. A

manuscript cites "Li et al. 2024," but DOI `10.3390/plants13060869` is actually

**Chen et al.** A reviewer catches it; an automated check catches it first.

> Does the metadata you wrote for this citation match what CrossRef says the DOI actually is?

No model, no API key, no download — just CrossRef's REST API and a comparison.

## Install

```bash

pip install ghostcite          # into the current environment

pipx install ghostcite         # isolated CLI install (recommended)

uv tool install ghostcite      # if you use uv

```

## Usage

```bash

ghostcite refs.bib                         # check a BibTeX file (or .md / DOI list)

ghostcite refs.bib --cross-check pubmed    # corroborate against PubMed

ghostcite refs.bib --json                  # machine-readable output (for CI)

ghostcite refs.bib --fail-on author,year,retraction   # tune the CI gate

cat refs.bib | ghostcite -                 # read from stdin

```

Input format is auto-detected (BibTeX, Markdown reference list, or bare DOI list);

override with `--format {auto,bibtex,markdown,doi}`.

**Real example** — `refs.bib` cites "Li (2024)" for a DOI CrossRef says is Chen:

```text

$ ghostcite refs.bib

ghostcite: 1 entries, 1 with DOIs

  ✗ A  L1  Li (2024)  →  DOI resolves to Chen (2024) — possibly wrong DOI  [10.3390/plants13060869]

  1 A

$ echo $?

1

```

All flags & the anatomy of a finding

```text

  ✗ A   L1    Li (2024)        →  DOI resolves to Chen (2024)…   [10.3390/plants13060869]

  │ │   │     │                    │                               │

  │ │   │     │                    │                               └─ DOI that was checked

  │ │   │     │                    └─ what CrossRef actually records

  │ │   │     └─ what you cited (claimed first author + year)

  │ │   └─ source line in your bibliography

  │ └─ tier: A author · B year · C cosmetic · R retraction · U unresolvable

  └─ glyph: ✗ fails CI · ⚠ retraction · · informational

```

- **`--cross-check pubmed`** — adds PubMed/NCBI as a _second source of truth_.

  When PubMed backs CrossRef a finding is annotated `↳ corroborated by PubMed`;

  when PubMed instead agrees with what you _cited_, it's flagged as a CrossRef↔PubMed

  conflict (the tier is kept so you don't silently trust either source). PubMed can

  also _raise_ a finding CrossRef missed, or supply a record for a DOI absent from

  CrossRef. Optional `--ncbi-email` / `--ncbi-api-key` (or `NCBI_EMAIL` /

  `NCBI_API_KEY`) follow NCBI E-utilities etiquette and unlock a higher rate limit;

  neither is required.

- **`--max-rps `** — cap outbound requests per second. ghostcite already

  self-throttles to CrossRef's advertised rate limit (read from the response

  headers); `--max-rps` lets you be _more_ conservative (the stricter of the two wins).

- **`--color {auto,always,never}`** — colorize the tier glyphs. `auto` (default)

  colorizes only on a TTY. [`NO_COLOR`](https://no-color.org/) is honored and wins

  even over `always`. `--json` output is never colorized.

- **stdin (`-`)** — pass `-` as the filename to read from stdin, e.g.

  `cat refs.bib | ghostcite -` or `ghostcite - --format doi < dois.txt`.

- **`--dry-run`** — parse + classify + count only, no network.

See [`examples/`](examples/) for ready-to-run sample inputs and captured output.

## How it works

```mermaid

flowchart TD

    A["Citation: claimed author + year (+ DOI)"] --> B{"Has DOI?"}

    B -- yes --> C["GET CrossRef /works/{DOI}"]

    B -- no --> D["CrossRef bibliographic search
(low-confidence)"]

    C --> E{"DOI resolves?"}

    E -- no --> U["Tier U — unresolvable"]

    E -- yes --> F["Compare claimed vs. canonical record"]

    D --> F

    F --> G{"First-author surname matches?"}

    G -- no --> TA["Tier A — author mismatch"]

    G -- yes --> H{"Year matches?"}

    H -- no --> TB["Tier B — year mismatch"]

    H -- yes --> OK["OK"]

    C --> R{"Retracted / expression of concern?"}

    R -- yes --> TR["Tier R — retraction (orthogonal)"]

    F -. "--cross-check pubmed" .-> P["PubMed second opinion"]

```

No language model is involved at any step. ghostcite resolves each DOI at CrossRef

(and optionally PubMed), then does a pure, deterministic comparison of the claimed

first-author surname (Unicode-folded, punctuation-stripped) and year against the

canonical record, plus a retraction / expression-of-concern check. Only the HTTP

client touches the network, via CrossRef's polite pool (a descriptive `User-Agent`

with the project URL, never a personal email).

Severity tiers, input formats & exit codes

| Tier   | Meaning                                                               | Fails CI?                       |

| ------ | --------------------------------------------------------------------- | ------------------------------- |

| **A**  | author-mismatch — claimed first author isn't in CrossRef's authors    | Yes                             |

| **B**  | year-mismatch — author matches, claimed year differs                  | Yes                             |

| **C**  | cosmetic — matches only after diacritic/initials fold (Bürger≈Burger) | No (info)                       |

| **R**  | retraction / expression-of-concern per CrossRef                       | Yes (fires regardless of A/B/C) |

| **U**  | unresolvable — DOI 404s, or no-DOI entry search was inconclusive      | No (warn)                       |

| **OK** | first author + year match                                             | —                               |

When the claimed title also diverges strongly from CrossRef's title, a Tier A

finding is annotated **"possibly wrong DOI entirely"** to distinguish a wrong-author

citation from a wrong-DOI one.

| Format       | Detection                                       | Yields claimed author/year?            |

| ------------ | ----------------------------------------------- | -------------------------------------- |

| **BibTeX**   | `@article{…}` / `@…{…}` entries                 | Yes (`author`, `year`, `doi`, `title`) |

| **Markdown** | bullet refs `- **AuthorList (YYYY).** … 10.x …` | Yes                                    |

| **DOI list** | newline-delimited bare DOIs / `doi:` / DOI URLs | No — lookup + retraction sweep only    |

| Exit code | Meaning                                            |

| --------- | -------------------------------------------------- |

| `0`       | clean — no findings at or above the fail threshold |

| `1`       | findings present at/above the threshold            |

| `2`       | tool error (network down, unparseable input, …)    |

`--fail-on` (default `author,year,retraction`) selects which tiers force exit `1`;

`--fail-on none` runs as a passive reporter. Tiers `C` and `U` never force exit `1`.

## Use it in CI

A clean run is quiet and exits `0`:



Drop in the composite **GitHub Action**:

```yaml

- uses: musharna/ghostcite@v1

  with:

    paths: paper/refs.bib

    fail-on: "author,year,retraction"

```

…or the **[pre-commit](https://pre-commit.com/) hook**:

```yaml

repos:

  - repo: https://github.com/musharna/ghostcite

    rev: v0.1.0

    hooks:

      - id: ghostcite

        args: [paper/references.bib, --fail-on, "author,year,retraction"]

```

Either way, a finding at or above the `--fail-on` threshold returns a non-zero

exit, blocking the merge or commit before submission.

## Scope & limitations

`ghostcite` checks **metadata correctness** (does the DOI's record match what you

wrote), not claim support (does the source actually _say_ what your prose claims —

a separate, LLM-based concern). It does no auto-fixing and no citation-style

linting. CrossRef is the source of truth; `--cross-check pubmed` adds PubMed as an

optional second opinion.

- CrossRef stores particle surnames inconsistently (`van der Berg` vs `Berg`), so a

  correctly-cited prefixed surname can rarely produce a Tier A false positive.

- No-DOI entries are resolved by best-effort bibliographic search and flagged

  low-confidence — treat those as hints, not verdicts.

- Some preprints, datasets, and protocols carry no author metadata in CrossRef and

  surface as Tier U rather than a mismatch.

Related work & FAQ

ghostcite's niche is **deterministic, no-LLM, CLI-first** checking focused on the

**byline-mismatch** failure mode (right DOI, wrong author/year) plus **retraction**

flagging — built to run unattended in CI.

| Tool                                                            | What it does                                | How ghostcite differs                                                       |

| --------------------------------------------------------------- | ------------------------------------------- | --------------------------------------------------------------------------- |

| [RefChecker](https://github.com/markrussinovich/refchecker)     | LLM-powered web-search reference validator  | ghostcite is no-LLM, deterministic, and CI-safe (no model, no API key)      |

| claude-skill-citation-checker                                   | A Claude Code skill for an LLM agent        | ghostcite is a standalone CLI + Action — no agent or LLM host needed        |

| [BibTeX Verifier](https://merfanian.github.io/Bibtex-Verifier/) | In-browser BibTeX checker                   | ghostcite is scriptable from the CLI and also flags retractions             |

| [CERCA](https://github.com/lidianycs/cerca)                     | Java / AGPL citation checker                | ghostcite is Python / MIT / `pip install`-able                              |

| [scite Reference Check](https://scite.ai/)                      | Commercial, PDF-oriented, retraction focus  | ghostcite is free / open-source, BibTeX-native, and catches byline mismatch |

| [doimgr](https://github.com/dotcs/doimgr)                       | Formats and manages DOIs (doesn't validate) | ghostcite verifies byline and retraction status, not just formatting        |

**Does it call an LLM?** No — a deterministic comparison of the metadata you wrote

against CrossRef's (and optionally PubMed's) canonical record. No model, no prompt,

no API key required.

**Will it hit rate limits?** It self-throttles to CrossRef's advertised rate limit

(read from the live response headers); use `--max-rps` to be more conservative.

**Does it catch fabricated DOIs?** Indirectly — a DOI that 404s at CrossRef

surfaces as Tier U. The core check is byline-vs-DOI _consistency_, so it catches the

common case of a real DOI attached to the wrong citation.

## License

MIT — see [LICENSE](LICENSE).
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/musharna/ghostcite

Awesome Lists containing this project

README