https://github.com/soxoj/kronikier
๐๏ธ Get historical contacts for a website from web.archive.org snapshots
https://github.com/soxoj/kronikier
cdx-api contact-extraction contact-mining domain-research email-extraction historical-data internet-archive investigative-journalism osint osint-python osint-tool phone-number-extraction phone-numbers wayback-archive wayback-archiver wayback-machine web-archive
Last synced: 6 days ago
JSON representation
๐๏ธ Get historical contacts for a website from web.archive.org snapshots
- Host: GitHub
- URL: https://github.com/soxoj/kronikier
- Owner: soxoj
- License: mit
- Created: 2026-05-29T18:04:15.000Z (27 days ago)
- Default Branch: main
- Last Pushed: 2026-06-13T19:37:12.000Z (12 days ago)
- Last Synced: 2026-06-13T21:24:18.262Z (11 days ago)
- Topics: cdx-api, contact-extraction, contact-mining, domain-research, email-extraction, historical-data, internet-archive, investigative-journalism, osint, osint-python, osint-tool, phone-number-extraction, phone-numbers, wayback-archive, wayback-archiver, wayback-machine, web-archive
- Language: Python
- Homepage: https://kronikier.soxoj.com
- Size: 150 KB
- Stars: 50
- Watchers: 0
- Forks: 7
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# kronikier
**๐๏ธ Get historical contacts for a website from web.archive.org snapshots**
---
_OSINT tool that mines **historical** contacts (email, phone numbers) for a
domain out of **web.archive.org** snapshots._
_Built for investigations where the current site no longer shows contact details (or shows
different ones) but earlier versions are preserved in the archive._
[](https://asciinema.org/a/kZErdPENZJlnSEjA)
**WEB VERSION:** [kronikier-web](https://github.com/soxoj/kronikier-web).
**Full docs:** [docs/USAGE.md](docs/USAGE.md) ยท
[docs/ARCHITECTURE.md](docs/ARCHITECTURE.md).
---
## Quick start
```bash
pip install -e .
```
```bash
# Default scan: 5-minute timeout, contact-URL filter on.
kronikier theranos.com
# How did one specific page evolve across snapshots?
kronikier --single-url https://www.theranos.com/contact-us
# Long-running thorough scan, every URL the host has.
kronikier wirecard.com --exhaustive
# Batch a list of targets.
kronikier --targets-file targets.txt
```
The first ever invocation runs a one-time latency calibration (~3-5 s,
cached for 14 days). Every run writes a CSV next to the current
directory and prints a summary table to the terminal:
```
Contact First seen Last seen
โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโ โโโโโโโโโโ
email info@theranos.com 2014-02-08 2015-02-24
phone +1 855-843-7200 2014-09-02 2015-11-18
CSV saved: ./theranos.com_20260529_200437.csv
[*] Cache: 3 hit / 0 miss (saved 3 fetches on web.archive.org)
```
That's it for getting started. See [docs/USAGE.md](docs/USAGE.md) for the
full CLI reference, batch-input format, snapshot-cache controls, and
troubleshooting.
---
## What you get
Every run produces:
- **A monospace summary table** on the terminal โ first/last seen
timestamps per contact, ordered chronologically.
- **A CSV file** with nine columns:
`kind, value, value_human, value_raw, first_seen, last_seen,
sightings_count, first_archive_url, last_archive_url`. UTF-8 with BOM
so Excel opens non-ASCII content cleanly.
- **A live contact feed** in the terminal (`+ email โฆ`, `+ phone โฆ`) as
results stream in. `-v` adds the snapshot date + URL beneath each one.
> **Phone numbers are reconstructed, not copied.** The `value` column is
> a normalised E.164 form produced by libphonenumber from whatever
> digits and punctuation appeared on the page โ there's a guess about
> country code, trunk prefix, and grouping involved. Most of the time
> it's right, but on weird formats (truncated numbers, ambiguous
> regions, OCR-style typos) it can land on a plausible-but-wrong number.
> **If a phone in the `value` column looks off, always cross-check
> `value_raw` to see exactly how it appeared on the page** โ that's the
> ground truth.
`--json` switches the table to a machine-readable JSON object on stdout;
CSV is still written and the path is reported on stderr.
Empty results print `No contacts found.` and skip the CSV entirely.
---
## When this is the right tool
The Wayback Machine is often the only source of authoritative owner
information when a site:
- is dead;
- was re-registered and the contacts scrubbed;
- changed editorial team and removed the previous people in charge;
- surfaced in a dubious story and the phone numbers vanished overnight.
For everything else (a live site whose contacts page works), just open
the live site.
---
## Snapshot cache
Wayback snapshots are immutable, so rerunning the same scan can answer
without spending more IA bytes. The CLI keeps an on-disk cache, **on by
default**, at `~/.cache/kronikier/snapshots/`. One HTML file per
snapshot, browsable on disk, grouped by host.
Disable with `--no-cache`. Wipe with `--clear-cache`.
Full layout, env-var override, and best-effort semantics are in
[docs/USAGE.md](docs/USAGE.md#snapshot-cache).
---
## Rate limits and ethics
- Default `--rate 4` (4 req/sec to wayback) is polite. Don't raise it
without a reason โ the archive serves everyone.
- All data extracted is already public. The tool doesn't bypass
paywalls, crawler blocks, or any access control.
- Contacts in the archive may belong to people no longer associated
with the domain. The timeline columns (`first_seen` / `last_seen`)
are there so you can judge what was current when.
- Use only within the scope of legal investigations.
---
## OSINT precedents
The "recover contacts from archives" technique is documented across
many published investigations. A few:
- **John Carreyrou / WSJ โ Theranos.** Material from archived
`theranos.com` used as evidence after the company shut down.
- **Brian Krebs (KrebsOnSecurity).** Multiple investigations of malware
and fraud site operators identified via archived contact emails on
early versions of their landing pages.
- **OCCRP shell-company investigations.** "Contact" pages of shell
sites routinely name the human beneficiary before being scrubbed.
- **Bellingcat.** Applied the same technique across multiple
investigations.
The end-to-end test suite (`pytest -m e2e`) hits the live wayback for
`theranos.com` and `enron.com` and asserts that the canonical contacts
of both still surface โ the OSINT use case is the regression test.
## SOWEL classification
OSINT techniques used:
- [SOTL-7.1. Check Archives](https://sowel.soxoj.com/check-archives)
- [SOTL-22.5. Extract Contacts From Page Text](https://sowel.soxoj.com/page-text-contacts)
## License
[MIT](LICENSE).