{"id":51436478,"url":"https://github.com/davidmalko87/whispergram","last_synced_at":"2026-07-05T07:02:08.421Z","repository":{"id":367177150,"uuid":"1277489793","full_name":"davidmalko87/whispergram","owner":"davidmalko87","description":"Local, offline transcriber for Telegram \u0026 Instagram chat exports — voice/video notes via Whisper (faster-whisper), screenshots via OCR, photos/stickers/GIFs via a local vision model. Interactive menu or CLI; merges everything into one chronological, LLM-ready Markdown file. No cloud, no API key.","archived":false,"fork":false,"pushed_at":"2026-07-04T19:28:23.000Z","size":130,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"master","last_synced_at":"2026-07-04T21:05:42.968Z","etag":null,"topics":["chat-export","cli","direct-messages","faster-whisper","image-captioning","instagram","instagram-dm","llm","ocr","offline","privacy","python","speech-to-text","telegram","telegram-export","transcription","voice-messages","voice-notes","voice-to-text","whisper"],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/davidmalko87.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-06-23T00:10:16.000Z","updated_at":"2026-07-04T19:28:04.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/davidmalko87/whispergram","commit_stats":null,"previous_names":["davidmalko87/whispergram"],"tags_count":6,"template":false,"template_full_name":null,"purl":"pkg:github/davidmalko87/whispergram","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidmalko87%2Fwhispergram","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidmalko87%2Fwhispergram/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidmalko87%2Fwhispergram/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidmalko87%2Fwhispergram/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/davidmalko87","download_url":"https://codeload.github.com/davidmalko87/whispergram/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/davidmalko87%2Fwhispergram/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":35145900,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-07-05T02:00:06.290Z","response_time":100,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["chat-export","cli","direct-messages","faster-whisper","image-captioning","instagram","instagram-dm","llm","ocr","offline","privacy","python","speech-to-text","telegram","telegram-export","transcription","voice-messages","voice-notes","voice-to-text","whisper"],"created_at":"2026-07-05T07:02:07.105Z","updated_at":"2026-07-05T07:02:08.399Z","avatar_url":"https://github.com/davidmalko87.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# whispergram\n\n[![CI](https://github.com/davidmalko87/whispergram/actions/workflows/ci.yml/badge.svg?branch=master)](https://github.com/davidmalko87/whispergram/actions/workflows/ci.yml)\n[![PyPI version](https://img.shields.io/pypi/v/whispergram.svg)](https://pypi.org/project/whispergram/)\n[![PyPI downloads](https://img.shields.io/pypi/dm/whispergram.svg)](https://pypi.org/project/whispergram/)\n[![Python](https://img.shields.io/pypi/pyversions/whispergram.svg?logo=python\u0026logoColor=white)](https://pypi.org/project/whispergram/)\n[![License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n[![Platform](https://img.shields.io/badge/platform-Windows%20%7C%20macOS%20%7C%20Linux-lightgrey.svg)](#)\n[![Chats](https://img.shields.io/badge/chats-Telegram%20%7C%20Instagram-success.svg)](#2-export-your-chat)\n[![Offline](https://img.shields.io/badge/100%25-local%20%26%20offline-success.svg)](#%EF%B8%8F-privacy)\n[![Round-trip](https://img.shields.io/badge/round--trip-validated-success.svg)](#-round-trip-validated)\n[![Last commit](https://img.shields.io/github/last-commit/davidmalko87/whispergram.svg)](https://github.com/davidmalko87/whispergram/commits/master)\n[![GitHub issues](https://img.shields.io/github/issues/davidmalko87/whispergram.svg)](https://github.com/davidmalko87/whispergram/issues)\n\n\u003e **Your Telegram _or_ Instagram chat — voice, video *and* photos — as one searchable transcript, fully local.**\n\u003e Point it at a **Telegram** export or an **Instagram** DM export and it transcribes voice \u0026 video\n\u003e notes with Whisper ([faster-whisper](https://github.com/SYSTRAN/faster-whisper)),\n\u003e **reads text from screenshots with OCR**, and **captions photo/sticker/GIF scenes with a local\n\u003e vision model** — all merged into one chronological, LLM-ready file. **100% offline, no API key, no\n\u003e cloud.** The platform is auto-detected; the same pipeline serves both.\n\nEvery line is tagged by sender and timestamp — voice, video **and photos** turned into readable text:\n\n```\n[2026-06-20 12:33] Alex (voice 14s): just finished the auth flow, take a look\n[2026-06-20 12:35] You: nice, send the diagram\n[2026-06-20 12:46] Alex (photo, described): a hand-drawn architecture diagram on a whiteboard | text: Login -\u003e API -\u003e DB\n[2026-06-20 12:47] You (video-note 6s): looks great, let's ship it\n[2026-06-20 12:47] Alex (sticker 👍)\n```\n\n\u003e Photos become text two ways: a local vision model captions the scene (on by default), and `--ocr` reads any text in the image.\n\n---\n\n## Why?\n\nAn audio-heavy chat is unreadable and unsearchable — you cannot grep a voice note, and you cannot\nhand a folder of `.ogg`/`.opus`/`.m4a` files to an LLM. This is true whether the chat lives in\n**Telegram** or **Instagram DMs**, and the built-in options are worse: Telegram Premium transcribes\none message at a time by hand, Instagram has no bulk transcription at all, and cloud speech APIs\nupload your private audio to a third party.\n\n**whispergram** transcribes **every** voice and video note in one pass, entirely on your own machine,\nand weaves them back into the text timeline as a single file you can read, search, or feed to a\nmodel. It reads both **Telegram Desktop JSON exports** and **Instagram \"Download your information\"\nmessage exports** — no flag, the format is detected for you.\n\n---\n\n## Features\n\n| Feature | Description |\n|---|---|\n| **Two platforms, one pipeline** | Reads **Telegram** exports *and* **Instagram** DM exports — auto-detected, merged the same way, into the same `[time] sender` format |\n| **Voice + video notes** | Both voice messages and round video notes are transcribed inline with the text — on **both** platforms, by default |\n| **One merged file** | A single chronological `merged_chat.md`, every line tagged `[time] sender` |\n| **100% local \u0026 offline** | faster-whisper runs on your machine — no upload, no API key, no account |\n| **Lossless mapping** | Stickers, photos, animations/GIFs, documents, music, locations, polls, contacts and shared Reels appear as markers — nothing content-bearing is dropped |\n| **Handles missing media** | Notes excluded from the export are clearly marked `[not exported]`, never fed to the model |\n| **All text shapes** | Reconstructs plain, rich, and entity-based message text (links, mentions, custom emoji) |\n| **Instagram encoding repair** | Instagram mangles non-Latin text (mojibake); whispergram repairs it so Ukrainian/Russian and emoji read correctly, and merges paginated `message_*.json` files chronologically |\n| **Dry-run** | Preview the full merge with `--dry-run` — no model download, no GPU, instant |\n| **GPU or CPU** | CUDA with automatic CPU fallback; a one-command Windows CUDA fix is built in |\n| **Auto-detect** | Finds the export JSON (any filename), the platform, and the language per file |\n| **Regular videos** | `--video-files` also transcribes ordinary video files' audio, not just round notes |\n| **Photo OCR** | `--ocr` pulls text out of photos with local Tesseract — great for screenshots |\n| **Photo/sticker/GIF descriptions** | Captioned automatically by the best installed local model — BLIP (`[describe]`) for photos, or Qwen2-VL (`[describe-hq]`) for photos + stickers + GIFs |\n| **Resumable** | Progress is cached per file — close the terminal or crash, then re-run and it continues where it left off |\n| **Queue chats** | Transcribe many exports (Telegram and/or Instagram, mixed) in one command — models load once; `--out-dir` collects the results |\n| **Interactive menu (default)** | In a terminal, `whispergram` opens a picker of all your Telegram **and** Instagram chats — choose which to transcribe with a best-models preset, no flags to remember. `--no-menu` (or any action flag, or a non-interactive/cron run) transcribes directly |\n| **Progress bar** | Live `done/total` + ETA per chat |\n| **Round-trip verified** | Rich synthetic exports run through the full pipeline and are diffed line-for-line; validated against real Telegram **and** Instagram exports (see below); 128 offline tests on the Python 3.9–3.13 CI matrix |\n\n---\n\n## Quick Start\n\n### 1. Install\n\n**Via PyPI (recommended):**\n\n```bash\npip install whispergram\n```\n\n**Or clone for development:**\n\n```bash\ngit clone https://github.com/davidmalko87/whispergram.git\ncd whispergram\npip install -r requirements.txt\n```\n\nYou also need **ffmpeg** on your PATH:\n\n```bash\n# Linux:  sudo apt install ffmpeg\n# macOS:  brew install ffmpeg\n# Windows: choco install ffmpeg   (or: winget install Gyan.FFmpeg)\n```\n\n### 2. Export your chat\n\nwhispergram reads **either** a Telegram export **or** an Instagram DM export. Grab whichever you\nhave — you don't tell whispergram which platform it is, it detects it.\n\n#### 2a. From Telegram\n\nTelegram **Desktop** → open the chat → ⋮ menu → **Export chat history**. In the dialog:\n\n- **Format: JSON** (required — whispergram reads the JSON export, not the HTML one).\n- Tick the media you want whispergram to use:\n\n| Export option | Tick it? | What whispergram does with it |\n|---|---|---|\n| **Voice messages** | ✅ | Transcribed — the core feature |\n| **Video messages** | ✅ | Round video notes — transcribed |\n| **Photos** | ✅ for captions / `--ocr` | Scene-captioned and/or OCR'd; without it, photos show as a plain `(photo)` |\n| **Videos** | optional, for `--video-files` | Regular videos — their audio is transcribed |\n| **Stickers** | for `--describe-hq` | `(sticker 😅)` comes from JSON; tick to let `--describe-hq` caption the image too |\n| **GIFs** | for `--describe-hq` | `(animation)` comes from JSON; tick to let `--describe-hq` caption it (multi-frame) |\n| **Files** | ⬜ not needed | Shown as `(file: report.pdf)` from the JSON metadata |\n\n\u003e **⚠️ Drag the \"Size limit\" slider up.** It defaults to **8 MB**, and any file larger than that is\n\u003e **not** downloaded — those messages come out as `[not exported]`. Voice notes are tiny, but video\n\u003e notes, videos, and hi-res photos routinely exceed 8 MB, so raise the slider (toward the max) to be\n\u003e sure your media actually lands in the export. *(This is the usual reason notes show as `[not exported]`.)*\n\nYou get a folder with a `.json` file plus `voice_messages/`, `round_video_messages/`, `photos/` …\nsubfolders for whatever you ticked.\n\n#### 2b. From Instagram\n\nInstagram → **Settings → Accounts Center → Your information and permissions → Download your\ninformation** (also reachable on the web at *accountscenter.instagram.com*). Then:\n\n- Choose **Some of your information → Messages** (you don't need your whole account).\n- **Format: JSON** (required — not HTML).\n- **Media quality: High** (so the voice-note audio and photos are actually included).\n- Request the download, wait for the email, and unzip it.\n\nInside you get a tree like this — **each conversation is its own folder** under `inbox/`:\n\n```\nyour_instagram_activity/\n└── messages/\n    └── inbox/\n        ├── alex_17842…/          ← one conversation\n        │   ├── message_1.json    ← (large chats are paginated: message_1, message_2, …)\n        │   ├── audio/            ← voice notes  (.mp4/.m4a)\n        │   ├── photos/\n        │   └── videos/\n        └── maria_15412…/\n            └── …\n```\n\nPoint whispergram at a **single conversation folder** (the one that contains `message_1.json`).\n\n\u003e **Instagram specifics** (all handled for you):\n\u003e - **Voice notes transcribe by default** — no extra flag. Instagram stores them as `audio_files`,\n\u003e   which whispergram maps to voice messages.\n\u003e - Thread folders are named after the person + a numeric id; the human-readable **chat name comes\n\u003e   from the thread's `title`** and is used as the output filename.\n\u003e - **Shared Reels/posts** appear as `[shared reel/post by \u003cauthor\u003e: \u003clink\u003e]` — Instagram doesn't put\n\u003e   the Reel's video in the export, only the link, so there's no audio to transcribe.\n\u003e - **End-to-end-encrypted chats aren't in the standard export** — they need Instagram's separate\n\u003e   *encrypted-chat* download.\n\n### 3. Run\n\n**Just run `whispergram`** — in a terminal it **opens the interactive picker by default** (no flags to\nremember). Run it in a folder that holds your exports; it looks **recursively**, so a single chat\nfolder, a parent folder full of `ChatExport_*`, or an Instagram `your_instagram_activity` root all\nwork (scanning a real 260-thread inbox takes ~4 s):\n\n```bash\nwhispergram            # or, without installing: python whispergram.py\n```\n\nYou get a picker like this:\n\n```\n==================================================\n  whispergram v1.4.0\n  Local, offline Telegram \u0026 Instagram transcriber\n  by David Malko - github.com/davidmalko87/whispergram\n==================================================\n\nFound 3 chat(s):\n\n    #  platform  voice photo video  dates                   name\n    1  Telegram    141    16     0  2026-06-30..2026-07-01  Alex\n    2  Instagram    30     8     2  2026-05-01..2026-06-20  Maria\n    3  Telegram      4    98    12  2024-10-28..2025-04-10  Work\n\n  Which chats? (e.g. 1,3-5 or 'all') [all]:\n```\n\nIt lists every Telegram **and** Instagram chat it finds with platform, name, **date range** and\nvoice/photo/video counts (voice-heavy first, or `--sort messages`/`recent`/`name`) — so same-named\nexports are easy to tell apart — lets you pick which to do (`1,3-5` or `all`), and offers a one-keystroke\npreset — **\"Everything, best models\"** is the recommended default (transcribe voice+video, describe\nphotos/stickers/GIFs, OCR). That's the simplest way to \"transcribe everything with the best models\"\nwithout learning the flags below. (`--menu` still forces the picker, but you rarely need it.)\n\n**Skip the picker and transcribe directly** — this happens **automatically when there's no terminal**\n(a cron job or a pipe, so scripts never block), or on demand:\n\n```bash\nwhispergram --no-menu                                           # bare direct run, with defaults\nwhispergram \"path/to/ChatExport_2026-06-20\" --ocr --lang uk     # any action flag also runs direct\nwhispergram \"your_instagram_activity/messages/inbox/alex_17842…\" --no-menu\n```\n\nPassing any transcription flag (`--ocr`, `--lang`, `--describe-hq`, `--video-files`, …) is read as\n\"I've already chosen — just run it,\" so the picker doesn't open. The result is `merged_chat.md` in the\nexport folder (or use `--out` / `--out-dir`, below).\n\n**Best quality for Instagram** (lots of photos, Reels and GIFs) — install the HQ describer and it's\nused automatically:\n\n```bash\npip install -U \"whispergram[describe-hq]\"\nwhispergram \"your_instagram_activity/messages/inbox/alex_17842…\" --out-dir \"C:\\merged\"\n# voice notes → transcribed, photos/videos/GIFs → described, Reels → [shared reel/post …] markers\n```\n\n### Best quality (use your GPU)\n\nAudio and video already use the most accurate model — Whisper **large-v3** — on your GPU by default.\nFor the best **photo, sticker and GIF** captions, just install the HQ extra — it's then used\n**automatically, no flag** — and, for speed, put torch on your GPU:\n\n```bash\npip install -U \"whispergram[describe-hq]\"\n# optional, for GPU-fast captions (match your CUDA, e.g. cu121/cu124):\npip install torch torchvision --index-url https://download.pytorch.org/whl/cu124\nwhispergram        # auto-uses large-v3 + Qwen2-VL; add --ocr --ocr-lang ukr+rus+eng for screenshot text\n```\n\nThat runs **large-v3** (audio/video) + **Qwen2-VL** (photos, stickers, GIFs). ⚠️ **On Windows, a CUDA\nbuild of torch can clash with faster-whisper's GPU** (cuDNN) — see [GPU on Windows](#gpu-cuda-setup)\nfor the two reliable setups before installing CUDA torch.\n\n### Queue \u0026 resume\n\nPass **several export folders** (Telegram, Instagram, or a mix) to transcribe them back-to-back — the\nmodels load **once** and are reused, and sequential is safe for your GPU:\n\n```bash\nwhispergram \"ChatExport_Anastasia\" \"inbox/olha_15900…\" \"ChatExport_Work\" --out-dir \"C:\\merged\"\n# -\u003e C:\\merged\\Anastasia.md, C:\\merged\\Olha.md, C:\\merged\\Work.md\n```\n\nRuns are **resumable**: each transcript/caption is cached to `.whispergram_cache.json` in the export\nfolder as it's produced, so if you close the terminal or it crashes, just **run it again** — finished\nfiles are skipped and it continues where it left off. A progress bar shows `done/total` + ETA:\n\n```\n 60%|████████████        | 28/47 [02:14\u003c01:31], audio_28.ogg\n```\n\nIf two chats share a name, the second is saved as `Work (2).md` rather than overwriting the first,\nand a folder that fails (e.g. a corrupt export) is skipped so the rest of the queue still runs.\n`--no-cache` disables the cache; `--out FILE` sets a custom path for a single folder (it can't be\ncombined with `--out-dir`).\n\n---\n\n## Example output\n\n```\n[2026-06-20 12:33] Alex: did you get the files?\n[2026-06-20 12:34] Alex (voice 6s): one sec, recording the summary now ...\n[2026-06-20 12:35] You (photo, described): a screenshot of a calendar app | text: Sprint review - Fri 15:00\n[2026-06-20 12:36] Alex (video-note 8s): [not exported]\n[2026-06-20 12:36] You (sticker 😅)\n[2026-06-20 12:37] Alex: [shared reel/post by @some.creator: https://www.instagram.com/reel/…]\n```\n\n\u003e Photo captioning is automatic once `whispergram[describe]` is installed; add `--ocr` for the in-image text, or `--no-describe` for a plain `(photo)` marker. The last line shows an Instagram shared Reel — link only, since the video isn't in the export.\n\n---\n\n## How each message appears\n\n| Message type | In the merged file |\n|---|---|\n| Text | `[time] sender: message text` |\n| Voice note | `[time] sender (voice 12s): \u003ctranscript\u003e` |\n| Round video note | `[time] sender (video-note 8s): \u003ctranscript\u003e` |\n| Voice/video note **with caption** | `[time] sender (voice 12s): \u003ctranscript\u003e \\| caption: \u003ctext\u003e` |\n| Voice/video not downloaded | `[time] sender (voice 12s): [not exported]` |\n| Sticker | `[time] sender (sticker 😅)` |\n| Photo, `--no-describe` | `[time] sender (photo): caption` (plain marker, no captioning) |\n| Animation / GIF | `[time] sender (animation)` |\n| Document | `[time] sender (file: report.pdf): caption` |\n| Location / poll / contact | `[time] sender (location)` · `(poll)` · `(contact)` |\n| Music / audio file | `[time] sender (audio: Artist - Title)` — transcribe with `--audio-files` |\n| Regular video file | `[time] sender (video)` — transcribe the audio with `--video-files` |\n| Instagram shared Reel/post | `[time] sender: [shared reel/post by \u003cauthor\u003e: \u003clink\u003e]` |\n| Photo (default, `[describe]` installed) | `[time] sender (photo, described): a caption of the scene` |\n| Photo + `--ocr` | `[time] sender (photo, described): \u003cscene\u003e \\| text: \u003ctext found in the image\u003e` |\n| Photo + `--ocr --no-describe` | `[time] sender (photo, text): \u003ctext found in the image\u003e` |\n| Sticker / GIF + `--describe-hq` | `[time] sender (sticker 😅, described): …` · `(animation, described): …` |\n\nMarkers can be turned off with `--no-media-markers` (voice/video notes are always transcribed).\n\n---\n\n## Describe modes: photos, stickers \u0026 GIFs\n\nImage captioning is opt-in via an extra. **The best installed describer is used automatically — no\nflag needed:**\n\n| Mode | How to enable | What it captions | Model | Size | Speed |\n|---|---|---|---|---|---|\n| **Off** | `--no-describe` | nothing (media shown as markers) | — | — | instant |\n| **Light** | `pip install whispergram[describe]` | **photos** | BLIP-large | ~1.9 GB | fast on CPU |\n| **High-quality (auto)** | `pip install whispergram[describe-hq]` | **photos + stickers + GIFs** (GIFs multi-frame) | Qwen2-VL-2B | ~4.4 GB | slow on CPU / fast on GPU |\n\n- Install the quality you want, then just run `whispergram`: if `[describe-hq]` is present it's used\n  automatically (and captions **stickers + GIFs**); otherwise BLIP captions photos. `--describe-hq`\n  forces HQ; `--no-describe` turns captioning off.\n- **HQ (Qwen2-VL)** is markedly better on cartoons, characters and *actions*, and reads GIFs several\n  frames at a time so it catches motion. **BLIP** is a quick photo gist (rough on cartoons).\n- Add `--ocr` to also pull any in-image text. Everything is local; captions are best-effort, never\n  literal fact. To run the models on your GPU, see [GPU setup](#gpu-cuda-setup).\n\n---\n\n## ✅ Round-trip Validated\n\nA faithful merge is only proven once it has been run end-to-end and the output diffed back against\n**every** message type — structural validity alone is not enough. whispergram is validated on **both\nplatforms** against **real, private exports** (measured locally; the counts below are aggregates, no\ncontent), plus a synthetic fixture that guards the same lossless mapping in CI.\n\n**Telegram** — a live, audio-heavy 770-message chat, every dimension diffed against the source JSON:\n\n| Dimension | In export | In merged file | Result |\n|---|---|---|---|\n| Voice notes (downloaded) | 4 | 4 transcribed | ✅ |\n| Round video notes (not downloaded) | 5 | 5 `[not exported]` | ✅ |\n| Other media (stickers, photos, animations, videos, audio, …) | 107 | 107 markers | ✅ |\n| Text messages | 654 | 654 | ✅ |\n| **Messages dropped** | — | **0** | ✅ |\n\nThe same invariant was re-checked on a much larger real Telegram export — **20,156 messages →\n20,136 transcript lines** (the remaining 20 are service/empty events, all accounted for), with\n**1,925** voice/video notes transcribed and **2,050** media items described. **Zero messages\ndropped.**\n\n**Instagram** — a real DM thread, normalized (paginated `message_*.json` merged, mojibake repaired)\nand diffed the same way:\n\n| Dimension | In thread | In merged file | Result |\n|---|---|---|---|\n| Voice notes | 141 | 141 transcribed (default flags) | ✅ |\n| Photos | 8 | 8 described | ✅ |\n| Text messages | 1,711 | 1,711 | ✅ |\n| **Messages dropped** | — | **0** | ✅ |\n\nIn every case the per-type counts match the source exactly, not-exported notes are never sent to the\nmodel, and the progress-bar total equals the real work performed. (An earlier version silently\ndropped media items — every sticker, photo, and caption-less item — leaving misleading gaps. The\nround-trip is what surfaced it.)\n\n\u003e Those exports are private, so the counts were measured locally and are not reproducible from this\n\u003e repo. The synthetic export under [`tests/fixtures/`](tests/fixtures/) reproduces the same lossless\n\u003e mapping across every media type and guards it automatically in CI.\n\n---\n\n## Known Limitations\n\nThese follow from the **export formats** (Telegram's and Instagram's) and from speech recognition\nitself — not from a lack of effort in the tool:\n\n| Area | Status | Notes |\n|---|---|---|\n| Telegram round video notes | Audio only, if downloaded | Telegram often excludes the binary above the size limit; those show `[not exported]` |\n| Instagram shared Reels/posts | Link only | Instagram doesn't include the Reel's video in the export, so there's nothing to transcribe — rendered as `[shared reel/post by \u003cauthor\u003e: \u003clink\u003e]` |\n| Instagram encrypted chats | Not in the export | End-to-end-encrypted conversations need Instagram's separate *encrypted-chat* download |\n| Music / `audio_file` (Telegram) | Off by default | Opt in with `--audio-files`; songs are otherwise not run through ASR. (Instagram voice notes are *not* affected — they transcribe by default.) |\n| Photo OCR | Text-in-image only | `--ocr` reads visible text (great for screenshots), not a description of the scene; needs Tesseract + language packs |\n| Photo/sticker/GIF descriptions | Best-effort, local | Captions are a short, English scene *gist*, not literal fact; local models caption cartoons/memes roughly (`--describe-hq` is much better but heavier); `--no-describe` to skip |\n| Speaker labels | Sender only | Each note is attributed to its sender; no in-audio diarization |\n| Timestamps | Minute resolution | Both platforms are rendered to `YYYY-MM-DD hh:mm`; seconds are not shown |\n| Reactions / edits / replies | Not represented | The merged file is a clean reading transcript, not a full forensic dump |\n| Transcription accuracy | Model-dependent | `large-v3` is best for uk/ru; `--lang` forces a language if auto-detect slips |\n\n---\n\n## Options\n\n```bash\nwhispergram --device cpu --model large-v3-turbo   # no GPU, fast\nwhispergram --compute-type int8_float16           # fit large-v3 on a small (\u003c=4 GB) GPU\nwhispergram --lang uk                             # force a language\nwhispergram --dry-run                             # preview the merge, no transcription\nwhispergram --audio-files                         # also transcribe music/long audio files (Telegram)\nwhispergram --video-files                         # also transcribe regular videos' audio\nwhispergram --ocr --ocr-lang ukr+rus+eng          # read text from photos (local Tesseract)\nwhispergram --no-describe                         # skip photo scene captions\nwhispergram --describe-hq                         # better captions + describe stickers/GIFs (Qwen2-VL)\nwhispergram --offline                             # zero network calls (use cached models only)\nwhispergram --out result.md                       # custom output path\n```\n\n| Flag | Default | Notes |\n|---|---|---|\n| `--menu` | auto | interactive picker (scan a folder, choose chats + preset). **On by default in a terminal**; forced by `--menu` |\n| `--no-menu` | off | skip the picker even in a terminal — transcribe directly with flags/defaults (for scripts or a quick run) |\n| `--sort` | `voice` | menu order: `voice`, `messages`, `recent` (last message), or `name` |\n| `--device` | `cuda` | `cuda` or `cpu`; auto-falls back to CPU if the GPU fails |\n| `--model` | `large-v3` | try `large-v3-turbo` or `medium` if CPU is slow |\n| `--compute-type` | `auto` | `auto`, `float16`, `int8_float16`, `int8`, `float32`. `auto` = int8 on CPU, float16 on GPU (auto **int8_float16 on low-VRAM GPUs** so large-v3 fits). Use `int8_float16` if a GPU run hangs at `0%` on a ≤4 GB card |\n| `--lang` | auto | force a code like `uk`, `ru`, `en` if auto-detect mislabels |\n| `--batch-size` | 0 | `N`\u003e1 batches segments for a big **GPU** speedup; 0 = sequential (best quality) |\n| `--out` | `merged_chat.md` | output file for a **single** folder (mutually exclusive with `--out-dir`) |\n| `--out-dir` | off | collect each queued folder's transcript here as `\u003cchat name\u003e.md` |\n| `--no-cache` | off | don't read/write the per-folder `.whispergram_cache.json` resume cache |\n| `--audio-files` | off | also transcribe Telegram `audio_file` messages (music, long memos) |\n| `--video-files` | off | also transcribe regular video files' audio track |\n| `--ocr` | off | extract text from photos with local Tesseract OCR |\n| `--ocr-lang` | `eng` | Tesseract language(s), e.g. `ukr+rus+eng` |\n| `--no-describe` | off | skip photo scene captions (on by default when `[describe]` is installed) |\n| `--describe-model` | `blip-large` | BLIP caption model id; use `...-base` for faster/lighter |\n| `--describe-hq` | off | high-quality describer (Qwen2-VL) + captions stickers/GIFs; needs `[describe-hq]` |\n| `--offline` | off | use only cached models; make zero network calls |\n| `--no-media-markers` | off | omit `(sticker)` / `(photo)` / `(file)` markers |\n| `--dry-run` | off | map the chat without loading a model or transcribing |\n| `--setup-cuda-windows` | — | copy CUDA DLLs next to ctranslate2, then exit (Windows GPU fix) |\n\n---\n\n## GPU (CUDA) setup\n\n**Linux / macOS:** with a working CUDA install it runs as-is on `--device cuda`.\n\n**Windows** — the common pitfall is `RuntimeError: Library cublas64_12.dll is not found`:\n\n1. Install the CUDA runtime wheels (no full CUDA Toolkit needed):\n   ```bash\n   pip install nvidia-cublas-cu12 nvidia-cudnn-cu12\n   pip install -U \"ctranslate2\u003e=4.5\"\n   ```\n2. If it *still* can't find the DLL, copy them next to CTranslate2 (the reliable fix):\n   ```bash\n   python whispergram.py --setup-cuda-windows\n   ```\n3. Or skip the GPU entirely: `--device cpu --model large-v3-turbo`.\n\n\u003e CTranslate2 loads cuBLAS/cuDNN lazily in native code that ignores `os.add_dll_directory`,\n\u003e which is why placing the DLLs inside the package dir is the dependable solution.\n\n**Low-VRAM GPUs (≤ 4 GB) — a run that hangs at `0%`.** `large-v3` in float16 (~3 GB weights plus\ncuDNN 9 workspace) can fail to fit a 4 GB card, and CTranslate2 then **hangs** at\n`Model: large-v3 on cuda (float16)` / `0%` instead of erroring. whispergram handles this\nautomatically: on a GPU with little free VRAM it loads **`int8_float16`** (~1.6 GB, near-identical\nquality) and prints a one-line note, so `large-v3` fits and runs on the GPU. Force it any time with\n`--compute-type int8_float16` (or `--compute-type float16` to opt out). This is why the default\n`--compute-type auto` \"just works\" on small GPUs.\n\n**GPU for photo/sticker/GIF captions is separate from Whisper's.** The describe models (BLIP /\nQwen2-VL) use PyTorch, and `pip install` fetches the **CPU** build by default — so captioning runs on\nthe CPU even when Whisper is on your GPU. For fast captioning (especially `--describe-hq`), install a\nCUDA build of torch:\n\n```bash\npip install torch torchvision --index-url https://download.pytorch.org/whl/cu121   # match your CUDA\n```\n\nwhispergram auto-detects CUDA and moves the caption model to the GPU — no flag needed.\n\n\u003e **⚠️ Windows: a CUDA torch can clash with Whisper-on-GPU.** A CUDA build of torch bundles its own\n\u003e cuDNN, which can collide with the cuDNN that faster-whisper (CTranslate2) uses — surfacing as\n\u003e `OSError: [WinError 127] … cudnn_*.dll` on startup. Both can't reliably share the GPU out of the\n\u003e box, so pick one of these stable setups:\n\u003e - **Whisper on GPU + captions on CPU** (default, recommended): keep the **CPU** build of torch\n\u003e   — `pip install torch torchvision --index-url https://download.pytorch.org/whl/cpu`. Fast audio,\n\u003e   slower captions.\n\u003e - **Captions on GPU + Whisper on CPU**: `pip uninstall nvidia-cudnn-cu12 nvidia-cublas-cu12`,\n\u003e   install a CUDA torch, and run with `--device cpu`. Fast captions, slower audio.\n\u003e\n\u003e whispergram prints this guidance if it hits the conflict.\n\n**Both on the GPU — two passes (fastest for big batches).** Because runs are resumable, you can do\neach heavy step on the GPU in turn without the two libraries ever colliding:\n\n```bash\n# Pass 1 — transcribe everything on the GPU (CPU torch + faster-whisper GPU), no captions:\nwhispergram \u003cfolders\u003e --out-dir DIR --no-describe\n\n# switch torch to CUDA (captions-on-GPU setup):\npip uninstall -y nvidia-cudnn-cu12 nvidia-cublas-cu12\npip install torch torchvision --index-url https://download.pytorch.org/whl/cu124   # match your CUDA\n\n# Pass 2 — caption on the GPU; Whisper runs on CPU but every transcript is already cached, so it's\n# instant and only the captioning does work:\nwhispergram \u003cfolders\u003e --out-dir DIR --describe-hq --device cpu\n```\n\nBoth expensive stages run on the GPU, the resume cache means no transcription is repeated, and the\ncuDNN clash never happens because only one library touches the GPU per pass.\n\n### Ukrainian / Russian OCR\n\n`--ocr` needs the Tesseract binary (auto-found on Windows since 0.8.2) **and** the language packs.\nOn Windows: `winget install UB-Mannheim.TesseractOCR`, then add the `ukr`/`rus` data — either re-run\nthat installer and tick them, or drop `ukr.traineddata` + `rus.traineddata`\n([tessdata_best](https://github.com/tesseract-ocr/tessdata_best)) into a folder and point\n`TESSDATA_PREFIX` at it. Verify with `tesseract --list-langs`.\n\nIn the **menu**, when OCR is enabled you don't have to remember the codes: it shows a numbered\nshortlist of common languages (English, Ukrainian, Russian, German, French, …) — pick `1,2,3` — and\nlinks the full ~100-language list. On the **CLI**, pass them with `--ocr-lang`, joined by `+`\n(e.g. `--ocr-lang ukr+rus+eng`). The codes are Tesseract's 3-letter names (`ukr`, `rus`, `eng`, `deu`,\n…); the complete list is in [tessdata_best](https://github.com/tesseract-ocr/tessdata_best).\n\n---\n\n## FAQ\n\n**How do I transcribe Telegram voice messages?**\nExport the chat from Telegram Desktop as JSON (with voice messages), then run `whispergram` in the\nexport folder. Every voice note is transcribed with Whisper and merged into the text chat.\n\n**How do I transcribe Instagram DMs / voice messages?**\nDownload your Instagram information (Accounts Center → *Download your information* → **Messages**,\nformat **JSON**, media quality **High**), unzip it, and point whispergram at a single conversation\nfolder under `messages/inbox/` (the one with `message_1.json`). It's auto-detected — **no flag** — and\nInstagram voice notes are transcribed **by default**. Photos/videos/GIFs are described (install\n`whispergram[describe-hq]` for the best captions), and shared Reels appear as link markers.\n\n**Do I have to tell it which platform the export is?**\nNo. whispergram detects Telegram vs Instagram from the folder's files (`result.json` with Telegram's\nschema vs `message_1.json` with Instagram's), so the same command works for both, including in a\nmixed `--menu` scan or a mixed queue.\n\n**Is it private / offline? Does my audio leave my machine?**\nYes. Transcription, captioning and OCR run locally and need no account or API key. The tool makes no\nnetwork calls **with your data** — your chat audio, photos and transcripts never leave your machine.\nThe only network use is a **one-time download of the model weights** (public files) from Hugging\nFace; usage telemetry is **off by default**, and `--offline` forces cache-only with **zero** network\ncalls once the models are downloaded.\n\n**Do I need a GPU?**\nNo. It runs on CPU (`--device cpu`); use `--model large-v3-turbo` for speed. A CUDA GPU is faster.\n\n**Does it handle round video messages / video notes?**\nYes — round `video_message` notes are transcribed from their audio, just like voice notes. Regular\nvideo files are transcribed too with `--video-files`.\n\n**Can it read text from photos / screenshots?**\nYes — `--ocr` runs local Tesseract over photos and drops the extracted text inline as\n`(photo, text): ...` (ideal for screenshots).\n\n**Can it describe what's *in* a photo, not just the text?**\nYes, and it's **automatic**: once you `pip install whispergram[describe]`, photos are captioned by a\nsmall local model (BLIP via transformers — uses your GPU if you have one, else CPU)\nwith no flag needed. It composes with `--ocr` to give both the scene and the in-image text. Captions\nare a short, English, best-effort gist. Pass `--no-describe` to turn it off, or `--describe-model\nSalesforce/blip-image-captioning-base` for a faster/lighter model. The BLIP-large model (~1.9 GB)\ndownloads once on the first photo, then stays offline.\n\n**Can it describe stickers and GIFs too?**\nYes, with `--describe-hq` (`pip install whispergram[describe-hq]`). That switches to a stronger model\n(Qwen2-VL) that's much better on cartoons and *actions*, and it reads GIFs **multi-frame** to catch\nthe motion — e.g. `(animation, described): a character in a suit walking into an arena`. It's heavier\n(~4.4 GB; slow on CPU, fast on GPU), and cartoon/meme captions are still best-effort, never exact.\n\n**Why is some Instagram text garbled in other tools but fine here?**\nInstagram's JSON export double-encodes non-Latin text (mojibake). whispergram repairs the encoding\nso Ukrainian/Russian and emoji read correctly, and merges paginated `message_*.json` files in\nchronological order.\n\n**Which languages work?**\nAny language Whisper supports. `large-v3` handles Ukrainian and Russian well; use `--lang uk` (or\n`ru`, `en`, …) to force one if auto-detection slips.\n\n**How is this different from Telegram Premium's transcription?**\nPremium transcribes one message at a time, by hand, in the app. whispergram transcribes the\n**entire** chat in one pass, offline, and produces a single searchable file — and it also does\nInstagram, which has no built-in transcription at all.\n\n---\n\n## Project Structure\n\n```\nwhispergram/\n├── whispergram.py             # The tool: text reconstruction, mapping, transcription, CLI\n├── requirements.txt           # Runtime dependency (faster-whisper)\n├── pyproject.toml             # Packaging + ruff + pytest configuration\n├── CHANGELOG.md\n├── CONTRIBUTING.md\n├── LICENSE\n├── README.md\n│\n├── .github/\n│   ├── workflows/\n│   │   ├── ci.yml             # ruff + pytest on Python 3.9–3.13 (no transcription deps)\n│   │   └── publish.yml        # tag v* → verify version → build → PyPI (trusted publishing)\n│   ├── ISSUE_TEMPLATE/\n│   └── dependabot.yml\n│\n└── tests/\n    ├── test_whispergram.py    # 128 offline tests — no model download or GPU required\n    └── fixtures/\n        └── sample_export/\n            └── result.json    # synthetic export (safe to commit; used by tests + CI)\n```\n\n---\n\n## ⚠️ Privacy\n\nThis tool processes **private conversations**, and the transcripts it produces are just as\nsensitive as the audio. Two rules:\n\n- **Nothing leaves your machine.** Transcription, captioning and OCR are fully local; the tool makes\n  no network calls with your data and needs no credentials. The only network use is a one-time\n  download of public model weights from Hugging Face — telemetry is off by default, and `--offline`\n  guarantees zero network calls once the models are cached.\n- **Never commit your exports or transcripts.** The included `.gitignore` blocks chat data by default\n  — Telegram (`result.json`, `ChatExport_*/`), Instagram (`your_instagram_activity/`), all audio and\n  media (including `round_video_messages/` and `*_thumb.jpg`), and every `merged_chat.md`. Keep it.\n  Build your repo in a folder **separate** from any export, keep any `--out` path **inside** the\n  export folder, and run `git status` before pushing to confirm only code is staged. The only data\n  file in this repo is the synthetic fixture under `tests/fixtures/`.\n\n---\n\n## Requirements\n\n- Python **3.9+**\n- [ffmpeg](https://ffmpeg.org/) on your PATH\n- [`faster-whisper`](https://pypi.org/project/faster-whisper/) \u003e= 1.0 (`pip install -r requirements.txt`)\n- For NVIDIA GPU on Windows: `nvidia-cublas-cu12`, `nvidia-cudnn-cu12`, `ctranslate2\u003e=4.5`\n- For `--ocr` (optional): the [Tesseract](https://github.com/tesseract-ocr/tesseract) binary on your PATH (with language packs, e.g. `ukr`, `rus`) plus `pip install whispergram[ocr]`\n- For photo descriptions (optional): `pip install whispergram[describe]` (transformers + torch — prebuilt wheels, no compiler; uses your GPU if present). Captioning is then automatic; the ~1.9 GB BLIP-large model downloads once on the first photo, then runs offline. Use `--describe-model Salesforce/blip-image-captioning-base` for a lighter model, or `--no-describe` to turn it off\n- For high-quality captions + sticker/GIF describe (optional): `pip install whispergram[describe-hq]` (adds `torchvision`) and pass `--describe-hq`. Uses Qwen2-VL (~4.4 GB, slow on CPU / fast on GPU)\n\n\u003e The test suite needs none of the above — only `ruff` and `pytest`.\n\n---\n\n## Changelog\n\nSee [CHANGELOG.md](CHANGELOG.md) for the full version history.\n\n## Contributing\n\nSee [CONTRIBUTING.md](CONTRIBUTING.md) for the development setup, the privacy rule, and the\nversioning / release policy.\n\n## License\n\n[MIT](LICENSE)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidmalko87%2Fwhispergram","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdavidmalko87%2Fwhispergram","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdavidmalko87%2Fwhispergram/lists"}