{"id":50139857,"url":"https://github.com/mqudsi/epstein-ocr","last_synced_at":"2026-05-24T00:33:55.115Z","repository":{"id":337135408,"uuid":"1151842805","full_name":"mqudsi/epstein-ocr","owner":"mqudsi","description":null,"archived":false,"fork":false,"pushed_at":"2026-03-29T17:07:19.000Z","size":1043,"stargazers_count":132,"open_issues_count":0,"forks_count":14,"subscribers_count":5,"default_branch":"master","last_synced_at":"2026-05-24T00:33:47.491Z","etag":null,"topics":["base64","cnn","epstein","ml","ocr","pytorch","rust"],"latest_commit_sha":null,"homepage":"https://neosmart.net/blog/efta00400459-has-been-cracked-dbc12-pdf-liberated/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mqudsi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":"mqudsi","custom":"https://mqudsi.com/donate/"}},"created_at":"2026-02-07T01:07:14.000Z","updated_at":"2026-05-15T22:44:00.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/mqudsi/epstein-ocr","commit_stats":null,"previous_names":["mqudsi/monospace-ocr","mqudsi/epstein-ocr"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mqudsi/epstein-ocr","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mqudsi%2Fepstein-ocr","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mqudsi%2Fepstein-ocr/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mqudsi%2Fepstein-ocr/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mqudsi%2Fepstein-ocr/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mqudsi","download_url":"https://codeload.github.com/mqudsi/epstein-ocr/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mqudsi%2Fepstein-ocr/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33417487,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-23T22:14:44.296Z","status":"ssl_error","status_checked_at":"2026-05-23T22:14:43.778Z","response_time":53,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["base64","cnn","epstein","ml","ocr","pytorch","rust"],"created_at":"2026-05-24T00:33:54.281Z","updated_at":"2026-05-24T00:33:55.100Z","avatar_url":"https://github.com/mqudsi.png","language":"Python","funding_links":["https://github.com/sponsors/mqudsi","https://mqudsi.com/donate/"],"categories":[],"sub_categories":[],"readme":"# Project Summary\n\nIf you're new here, this project was started in response to [an effort to extract some unredacted content in the Epstein archives](https://neosmart.net/blog/recreating-epstein-pdfs-from-raw-encoded-attachments/). The CNN in this repo was used to successfully exfiltrate [DBC12.pdf](https://archive.org/details/dbc-12-one-page-invite-with-reply) from [EFTA00400459](https://archive.org/details/efta-00400459_pages).\n\nYou can read about this code/approach here: [EFTA00400459 has been cracked, DBC12.pdf liberated](https://neosmart.net/blog/efta00400459-has-been-cracked-dbc12-pdf-liberated/).\n\nThe code in this project runs against the images extracted from the PDF with `pdfimages`, you can download [an archive containing them here](https://archive.org/details/efta-00400459_pages).\n\n### Basic Usage Info\n\nExpects `../EFTA00400459-{000..=075}_2x.png` to exist\n\n* Run `./train.sh` to generate training from train_top.txt and train_bot.txt corresponding to page-001_2x.png\n* Run `./run.sh` to OCR all pages and generated recovered.pdf\n\nTrains from top of page-001 and bottom of page-001 non-contiguously to capture vertical drift.\nMemorizes grid location and reuses for subsequent pages (non-training runs) to prevent pixel shifts.\n\nIn training runs with `-d`/`--debug`, generates a debug view that lets you see if you mis-typed anything by showing greatest outliers compared to the rest of the members assigned to the bucket:\n\n![Typo sanity checking when training](./img/training-no-typos.png)\n\nIn inference runs, generates a debug view (when `-d` is in use with no `-q`/`--quiet`) that shows the max outliers compared to the rest of the characters in the image. When `-o`/`--output` is specified, the debug view is saved to `\u003cbasename\u003e-proof.png` so you can inspect it later.\n\n![Post-inference analysis](./img/inference-analysis.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmqudsi%2Fepstein-ocr","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmqudsi%2Fepstein-ocr","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmqudsi%2Fepstein-ocr/lists"}