https://github.com/exbuf/peekdocs
Document search and analysis across 100+ file types — offline, private, OCR-enabled, with highlighted reports, regex, Boolean, fuzzy, proximity, wildcard, and search suites. Windows, macOS, Linux. GUI, CLI, and Python API. Free and open-source (MIT).
https://github.com/exbuf/peekdocs
document-search docx file-search offline pdf-search pii-scanner privacy python-cli python-gui search-tool text-search
Last synced: 2 days ago
JSON representation
Document search and analysis across 100+ file types — offline, private, OCR-enabled, with highlighted reports, regex, Boolean, fuzzy, proximity, wildcard, and search suites. Windows, macOS, Linux. GUI, CLI, and Python API. Free and open-source (MIT).
- Host: GitHub
- URL: https://github.com/exbuf/peekdocs
- Owner: exbuf
- License: mit
- Created: 2026-03-08T01:35:10.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2026-06-24T16:43:28.000Z (7 days ago)
- Last Synced: 2026-06-24T17:12:47.904Z (7 days ago)
- Topics: document-search, docx, file-search, offline, pdf-search, pii-scanner, privacy, python-cli, python-gui, search-tool, text-search
- Language: Python
- Homepage: https://robertdschoening.com
- Size: 30.7 MB
- Stars: 1
- Watchers: 0
- Forks: 1
- Open Issues: 7
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- Contributing: CONTRIBUTING.md
- License: LICENSE
- Security: SECURITY.md
- Notice: NOTICE
Awesome Lists containing this project
README
👀 peekdocs
Actively maintained — last reviewed June 2026.
🌍 🇪🇸 🇫🇷 🇩🇪 🇯🇵 🇨🇳 🇧🇷 — GUI available in 7 languages — click for an intro in yours (partial UI translation)
### 🇪🇸 Español
**Tienes archivos. Necesitas encontrar algo en ellos.**
peekdocs es un banco de trabajo de búsqueda local que hace exactamente eso en más de 100 formatos de archivo — Word, PDF, Excel, correo electrónico, documentos escaneados, archivos comprimidos, código fuente — sin subir nada a ningún lugar. GUI, CLI y API de Python. Funciona en Windows, macOS y Linux. Gratuito y de código abierto bajo la Licencia MIT.
Diseñado para personas que prefieren herramientas locales, transparentes y deterministas. Sin nube, sin telemetría, sin llamadas de red.
**Flujo de trabajo típico:** Buscar en una carpeta de documentos de formato mixto → revisar coincidencias en la Vista previa de resultados → generar un informe DOCX o HTML resaltado → guardar la búsqueda → añadirla a un Conjunto de Búsqueda → programarla semanalmente.
**El flujo principal de trabajo está disponible en español** — la pantalla principal, los botones de búsqueda, las opciones de búsqueda avanzada y los mensajes de estado más comunes. Las ventanas de ayuda, los diálogos detallados, los mensajes del CLI y los informes de salida permanecen en inglés más abajo.
*Los términos legales (Licencia MIT, garantía, licencias de dependencias) son vinculantes solo en inglés.*
### 🇫🇷 Français
**Vous avez des fichiers. Vous devez y trouver quelque chose.**
peekdocs est un atelier de recherche locale qui fait exactement cela à travers plus de 100 formats de fichiers — Word, PDF, Excel, e-mail, documents numérisés, archives, code source — sans rien téléverser nulle part. GUI, CLI et API Python. Fonctionne sous Windows, macOS et Linux. Gratuit et open source sous licence MIT.
Conçu pour les personnes qui préfèrent les outils locaux, transparents et déterministes. Pas de cloud, pas de télémétrie, pas d'appels réseau.
**Flux de travail typique :** Rechercher dans un dossier de documents de formats mixtes → examiner les correspondances dans l'Aperçu des résultats → générer un rapport DOCX ou HTML surligné → enregistrer la recherche → l'ajouter à une Suite de recherche → la planifier chaque semaine.
**Le flux de travail principal est disponible en français** — la page principale, les boutons de recherche, les options de recherche avancées et les messages de statut les plus courants. Les fenêtres d'aide, les dialogues détaillés, les messages CLI et les rapports de sortie restent en anglais ci-dessous.
*Les termes juridiques (Licence MIT, garantie, licences des dépendances) font foi uniquement en anglais.*
### 🇩🇪 Deutsch
**Sie haben Dateien. Sie müssen etwas darin finden.**
peekdocs ist eine lokale Such-Werkbank, die genau das über 100+ Dateiformate hinweg leistet — Word, PDF, Excel, E-Mail, gescannte Dokumente, Archive, Quellcode — ohne irgendetwas irgendwohin hochzuladen. GUI, CLI und Python-API. Läuft unter Windows, macOS und Linux. Kostenlos und Open Source unter MIT-Lizenz.
Entwickelt für Menschen, die lokale, transparente und deterministische Werkzeuge bevorzugen. Keine Cloud, keine Telemetrie, keine Netzwerkaufrufe.
**Typischer Arbeitsablauf:** Einen Ordner mit gemischten Dokumenten durchsuchen → Treffer in der Ergebnis-Vorschau prüfen → einen hervorgehobenen DOCX- oder HTML-Bericht erstellen → die Suche speichern → sie zu einer Such-Suite hinzufügen → wöchentlich planen.
**Der Haupt-Arbeitsablauf ist auf Deutsch verfügbar** — die Hauptseite, die Such-Schaltflächen, die erweiterten Suchoptionen und die häufigsten Status-Meldungen. Hilfe-Fenster, detaillierte Dialoge, CLI-Meldungen und Ausgabe-Berichte bleiben auf Englisch weiter unten.
*Rechtliche Bedingungen (MIT-Lizenz, Gewährleistung, Abhängigkeitslizenzen) sind nur in englischer Sprache verbindlich.*
### 🇯🇵 日本語
**ファイルがあります。その中から何かを見つける必要があります。**
peekdocs はまさにそれを行うローカルな検索ワークベンチで、Word、PDF、Excel、メール、スキャンドキュメント、アーカイブ、ソースコードなど 100 以上のファイル形式を、どこにもアップロードせずに検索します。GUI、CLI、Python API として利用できます。Windows、macOS、Linux で動作します。MIT ライセンスの下で無料・オープンソース。
ローカル、透明性のある、決定論的なツールを好む人のために構築されています。クラウドなし、テレメトリーなし、ネットワーク通信なし。
**典型的なワークフロー:** 混合形式のドキュメントフォルダを検索 → 結果プレビューで一致箇所を確認 → ハイライト付きの DOCX または HTML レポートを生成 → 検索を保存 → 検索スイートに追加 → 毎週スケジュール実行。
**主要なワークフロー (メインページ、検索ボタン、詳細検索オプション、一般的なステータスメッセージ) は日本語で利用できます。** ヘルプウィンドウ、詳細ダイアログ、CLI メッセージ、出力レポートは英語のままです。詳細は下の英語版をご覧ください。
*法的条件 (MIT ライセンス、保証、依存ライブラリのライセンス) は英語版のみが正式なものです。*
### 🇨🇳 简体中文
**您有文件。您需要在其中找到某些内容。**
peekdocs 是一款本地搜索工作台,正是为此而生 — 可在 100 多种文件格式中搜索(Word、PDF、Excel、电子邮件、扫描文档、归档、源代码),不会将任何内容上传到任何地方。提供 GUI、CLI 和 Python API。可在 Windows、macOS 和 Linux 上运行。基于 MIT 许可证免费开源。
为偏好本地、透明、确定性工具的人士而构建。无云端、无遥测、无网络调用。
**典型工作流程:** 搜索混合格式的文档文件夹 → 在结果预览中查看匹配项 → 生成高亮显示的 DOCX 或 HTML 报告 → 保存搜索 → 将其添加到搜索套件 → 安排每周运行。
**主要工作流程已提供简体中文版本 — 主页面、搜索按钮、高级搜索选项以及最常见的状态消息。** 帮助窗口、详细对话框、CLI 消息和输出报告仍为英文。详细信息请参见下方英文版。
*法律条款(MIT 许可证、保修、依赖项许可)仅以英文版本为准。*
### 🇧🇷 Português brasileiro
**Você tem arquivos. Você precisa encontrar algo neles.**
O peekdocs é uma bancada de trabalho de busca local que faz exatamente isso em mais de 100 tipos de arquivos — Word, PDF, Excel, e-mail, documentos digitalizados, arquivos compactados, código-fonte — sem enviar nada para lugar nenhum. GUI, CLI e API Python. Funciona em Windows, macOS e Linux. Software livre e de código aberto sob a Licença MIT.
Feito para quem prefere ferramentas locais, transparentes e determinísticas. Sem nuvem, sem telemetria, sem chamadas de rede.
**Fluxo de trabalho típico:** Pesquisar uma pasta de documentos de formatos mistos → inspecionar correspondências na Pré-visualização de Resultados → gerar um relatório DOCX ou HTML destacado → salvar a pesquisa → adicioná-la a um Conjunto de Pesquisa → agendá-la semanalmente.
**O fluxo de trabalho principal está disponível em português brasileiro — página principal, botões de pesquisa, Opções Avançadas de Pesquisa e as mensagens de status mais comuns.** Janelas de ajuda, diálogos detalhados, mensagens do CLI e relatórios de saída permanecem em inglês. Veja abaixo a versão em inglês para detalhes completos.
*Os termos legais (Licença MIT, garantia, licenciamento de dependências) são oficialmente válidos apenas em inglês.*
### You have files. You need to find something in them.
peekdocs is a local search workbench that does exactly that across 100+ file types — Word, PDF, Excel, email, scanned documents, archives, source code — without uploading anything anywhere. GUI, CLI, and Python API. Runs on Windows, macOS, and Linux. Free and open-source under the MIT License.
Built for people who prefer private, transparent, deterministic tools. No cloud, no telemetry, no network calls.
**Typical workflow:** Search a folder of mixed-format documents → inspect matches in the Results Preview → generate a highlighted DOCX or HTML report → save the search → add it to a Search Suite → schedule it weekly.
## Watch peekdocs in action

*A ~46-second walkthrough as a looping GIF: peekdocs searches for `budget` across a 10,411-file folder and reports back in 3.17 seconds\*, with matches highlighted in yellow in the preview pane. The clip then opens the **File Types** and **Categories** charts to show the breadth of what was searched in that single pass — PDFs, Word and Excel docs, slides, emails, e-books, OCR'd images, archives, source code, and plain text. \* MacBook M4 Pro*
Free · Open-Source (MIT License) · No Cloud · Private · Easy to Use
Windows · macOS · Linux | GUI · CLI · Python API
## Feature Highlights
A workbench for document collections: search them, characterize them through built-in analysis tools, produce highlighted reports, monitor folders live via `--watch`, and drive it all through whichever interface fits — GUI, CLI, or Python API.
- **100+ file types in one query** — Word, PDF, Excel, email, source code, archives, scanned PDFs (OCR), and more, searched simultaneously.
- **Local-only by design** — no network calls, no telemetry, no account; runs with your normal user permissions on Windows, macOS, and Linux.
- **Search depth beyond grep** — 20-form Search Wizard, regex collections, Boolean / fuzzy / proximity / inverse / range, plus a long-running `--watch` mode for live folder monitoring.
- **Built-in analysis and reporting** — Duplicate Finder, File Inventory, Age Distribution, Change Tracking; highlighted reports in DOCX / HTML / PDF and machine-readable CSV / JSON / NDJSON.
- **Repeatable workflows** — Saved Searches, Search Suites, Regex Collections, Schedule Search, Search History, and Diff Snapshots compose into one workflow system.
- **Same engine across GUI, CLI, and Python API** — schemas are shared, so a search you build in the GUI today drives from a Python script or cron job tomorrow with identical results.
- **Polished GUI** — yellow-highlighted matches in the preview and the reports, tooltips on every control, dark/light/system theme, adjustable text size, and contextual `?` help popups throughout.
- **Works in any language** — GUI workflow translated into 7 languages (partial, native-reviewed contributions welcome). Like most modern search tools, peekdocs supports Unicode-based exact-character matching for searching documents in any language (no stemming or word segmentation; works equally for English prose, Chinese text, code identifiers, account numbers).
*Detail and caveats on each capability live in the [Features](#features) section below.*
> **Local-only by design.** No network calls, no telemetry, no cloud, no account. peekdocs runs entirely on your machine with your normal user permissions — no admin or root required, and it works fine on air-gapped systems with no internet connection.
> **Why local?** Most people have at least some documents they would rather not hand to a third party — drafts, work-in-progress, personal correspondence, financial paperwork. peekdocs is local-only because that's the only way the answer to "where does this go?" stays "nowhere — it stayed on my machine." The tradeoff is real: peekdocs doesn't summarize, doesn't answer questions about your documents, doesn't infer meaning. Those are jobs cloud AI tools do well; peekdocs is for finding exact text in a lot of files, repeatably, on your own machine.
> **Transparency over magic.** If a file wasn't searched, peekdocs tells you why. If OCR couldn't extract text, you'll know. If a report was created, you'll know where it is. peekdocs favors observable behavior over hidden processing.
**Quick install**
1. **No Python?** [Download the standalone app](#option-a-standalone-download-no-python-needed) — the GUI and CLI binaries are separate downloads; pick what you need.
2. **Have Python 3.10+?** A single command installs everything — the GUI, the CLI, and the Python API:
```bash
pipx install git+https://github.com/exbuf/peekdocs.git
```
*(Already installed? Upgrade with `pipx upgrade peekdocs`.)*
See [Installation](#installation) below for per-platform notes, the `pip` alternative, upgrade, and uninstall.
> **Windows tip:** if this fails with an SSL / SNI / certificate error in **Command Prompt**, try the same command in **PowerShell** instead. See [docs/INSTALLATION.md → Windows cmd.exe SSL / SNI / certificate errors](docs/INSTALLATION.md#windows-cmd-ssl) for the diagnosis and fix.
**What running peekdocs looks like:**
```bash
# Search from the terminal — peekdocs searches the current directory,
# so cd to the folder you want first
cd ~/Documents
peekdocs "budget"
# Found 47 match(es) in 12 file(s). Files searched: 238 (142.50 MB).
# 2024_tax_return_summary.pdf: 8
# quarterly_report_Q1.docx: 6
# vendor_contract_2024.pdf: 5 ...
# Search with the GUI
peekdocs-gui
# Search from the Python API — pass a real path (no shell ~ expansion here)
import os
from peekdocs import search
results = search(["budget"], directory=os.path.expanduser("~/Documents"))
for match in results.matches:
print(f"{match.filename}:{match.line_num} {match.text}")
```
## Contents
- [Watch peekdocs in action](#watch-peekdocs-in-action)
- [Feature Highlights](#feature-highlights)
- [CLI at a Glance](#cli-at-a-glance)
- [Who Is It For?](#who-is-it-for)
- [Features](#features)
- [Supported File Types](#supported-file-types)
- [Installation](#installation)
- [Prerequisites](#prerequisites)
- [Quick Start](#quick-start)
- [Documentation](#documentation)
- [Why peekdocs?](#why-peekdocs)
- [What peekdocs Is Not](#what-peekdocs-is-not)
- [Performance](#performance)
- [Platform Notes](#platform-notes)
- [Preparing Documents](#preparing-your-documents-for-searching)
- [Questions and troubleshooting](#questions-and-troubleshooting)
- [Glossary](#glossary)
- [For IT and Security Teams](#for-it-and-security-teams)
- [Testing](#testing)
- [Contributing](#contributing)
- [Author](#author)
- [Disclaimer](#disclaimer)
- [License](#license)
## CLI at a Glance
```bash
# Recursive search for "budget"
peekdocs -r budget
# Preflight: how many files would this search touch, and how big?
peekdocs --dry-run -r ~/Documents budget
# Regex pattern, piped through jq for the match count
peekdocs --stdout -x "\d{3}-\d{4}" | jq '.matches_found'
# Run a saved Search Suite by name
peekdocs --suite "Code hygiene"
```
`peekdocs -h` shows every flag, file type, and regex pattern. The [User Guide](docs/USER_GUIDE.md) covers the CLI in full.
> **Pointing peekdocs at your whole home directory or `/` is slow** — even with `--dry-run`. Tree walks across `~/Library`, every git repo, every `node_modules`, every Python venv, and every browser cache can easily mean hundreds of thousands of files; the enumeration phase alone can run 5–10+ minutes before any content is read. Press **Ctrl+C** to cancel at any time. Narrow the path (`peekdocs -r ~/Documents budget`) or restrict file types (`peekdocs -r -t pdf,docx,xlsx ~ budget`) to cut the corpus to seconds. During long runs, peekdocs prints `Scanning files (this may take a while on large folders)...` to stderr while enumerating, then switches to a live `[██░░] 12345/89201 file.pdf` progress bar once content reads begin.
## Who Is It For?
peekdocs is built for anyone who has files and needs to find something in them — across many kinds of files at once (Word, PDF, Excel, email, scanned documents, archives, and 100+ more), entirely on your own computer.
**A few examples of what people could do with it:**
- **Home user** — Find a tax document from any of the last seven years across mixed folders.
- **Office worker** — Find all invoices over $10,000 from 2024. *(fully worked, GUI and CLI, in [User Guide → Example 8](docs/USER_GUIDE.md#example-8-real-world-workflow--invoices-over-10000-from-2024))*
- **IT consultant** — Search a folder of client documents for a set of terms.
- **Sysadmin** — Search 20 GB of log files for a request ID across mixed archives.
- **Developer** — Run a regex collection against a source tree and generate JSON.
- **Engineer** — Search 200 datasheets for a part number across PDFs and scanned drawings.
- **Researcher** — Search 3,000 PDFs and export highlighted results.
- **Small business owner** — Find vendor contracts expiring in the next 90 days.
*The audiences and scenarios above describe possible uses of peekdocs. peekdocs is provided "as is" under the [MIT License](LICENSE), without warranty of any kind, express or implied.*
### What makes peekdocs distinctive
The combination of **local + privacy-first + grep-like power + OCR + regex workflows + reporting + automation** across heterogeneous document collections is unusual. peekdocs delivers all of them in one tool.
Detailed use cases by role (click to expand)
- **Home users** — tax returns, insurance policies, receipts, warranties, estate documents, email archives. Once installed, type your keyword(s), click Run Standard Search, done. No configuration, no manual.
- **Small businesses** — find information across contracts, invoices, reports, and correspondence. Save searches by name and reload them later. Search across vendor contracts for specific terms, pricing, or expiration dates.
- **Documentation teams and tech writers** — search for outdated references, inconsistent terminology, deprecated product names, or specific version numbers across an entire documentation set. Verify consistency across Word docs, PDFs, HTML exports, and Markdown files in a single search.
- **Researchers** — search across hundreds of downloaded journal articles (PDF), interview transcripts, survey responses, field notes, and datasets for a specific term, author, citation, or data point. OCR reads scanned source materials and historical documents. The highlighted Word report doubles as an annotated bibliography.
- **Engineers** — search hundreds of datasheets, design reviews, test reports, and failure analyses for a specific component value, part number, or tolerance. Find which documents reference a standard (MIL-STD-810, IEC 61508, ISO 9001). Search old design reviews and trade studies to find why a decision was made years ago. Locate error codes and symptoms across equipment manuals and maintenance logs. OCR reads scanned engineering drawings and handwritten notes. The highlighted Word report can be attached to a design review or emailed directly. Supported engineering formats: .m (MATLAB), .v .vhd .vhdl .sv (Verilog/VHDL/SystemVerilog), .cir .sp .spice (SPICE netlists), .dxf (AutoCAD interchange), .vsdx (Visio diagrams), .cmake (CMake build files)
- **Data researchers** — search hundreds of CSV and Excel files for a specific value, account number, or outlier. Cross-reference interview transcripts, survey responses, and field notes for the same keyword to triangulate findings. Literature review: search 500 downloaded PDFs for a method name, author, or statistical technique. Find which analysis scripts reference a specific dataset, parameter, or threshold.
- **AI/ML engineers** — search training logs for specific metrics, hyperparameters, or error messages across experiment runs. Find every reference to a model name, checkpoint path, or dataset version across scripts, configs, and documentation. peekdocs reads Jupyter notebooks (`.ipynb`), JSONL training data (`.jsonl`), Scala Spark pipelines (`.scala`), and all common config formats. Search across READMEs, docstrings, and markdown files for outdated model names or deprecated API versions.
- **Programmers** — peekdocs covers the documents that live outside the source tree: legacy specs and requirements in Word/PDF, email archives from past projects, vendor documentation and SDK guides in PDF, archived releases inside `.zip` / `.7z` files, scanned whiteboard photos (OCR), old project logs and meeting notes. A developer who needs to find *"what did the client say about the authentication requirement in 2019"* can pull the answer out of a `.docx` email attachment buried in a `.zip` archive without unpacking anything. One pipx command and you're running in seconds — CLI, GUI, or Python API (see [Option B](#option-b-quick-install-with-pipx-for-python-users)).
Also useful for **searching across entire codebases** — find every file that references a function, variable, endpoint, or error message in all source files across all folders at once. Use Lines Before/After to see the full function or block surrounding each match, not just the matching line. peekdocs handles 40+ source-code and shell-script extensions; see [Supported File Types](#supported-file-types) for the full list.
- **More for programmers** — find every TODO, FIXME, and HACK across all your projects at once, not just the one open in your IDE. Pre-upgrade audit: search all repos for a deprecated API or library before upgrading. Search log files for error patterns or request IDs across gigs of `.log` files. Search config files (`.yaml`, `.toml`, `.json`, `.ini`, `.properties`, `.conf`) and build files (`.gradle`, `.cmake`) to find where a setting, port, or environment variable is referenced. Multi-repo search: point peekdocs at a parent folder containing all your repos and search everything at once.
- **Email archives** — search exported email files (.eml, .msg, .pst, .mbox) for old correspondence, attachments, and contacts. peekdocs reads each format natively.
The full per-feature breakdown lives in the **[Features](#features)** section below — search modes, reporting, analysis tools, automation, privacy. The [Feature Highlights](#feature-highlights) up top is the executive summary; this is the detailed reference.
## Features
peekdocs has **three search modes**, each writing its own self-described report family next to your documents so they never collide:
| Mode | How to run | Reports |
|------|-----------|---------|
| **Standard Search** | Blue **Run Standard Search** button on the main screen, or `peekdocs ` | `peekdocs_standard_results.{txt,docx,csv,json,pdf,html}` |
| **Regex Search** | Orange **Regex Search** button on the main screen (opens the regex popup; its own Run Regex Search button executes the collection), or `peekdocs --regex-collection NAME` | `peekdocs_regex_results.{txt,docx}` |
| **Suite** (group of saved searches) | Green **Search Suites** button on the main screen (opens the suite popup; its own Run Search Suite button executes the selected suite), or `peekdocs --suite NAME` | `peekdocs_suite_results.{txt,docx,html,csv,json}` |
> *The "mode" is the workflow, not the flag set. A one-off `peekdocs -x "pattern"` (or `-z`, `-w`, `-W`) is a Standard Search with a regex/fuzzy/wildcard flag and writes `peekdocs_standard_results.*`. Only the dedicated Regex Search workflow — the GUI popup or `--regex-collection` — produces `peekdocs_regex_results.*`.*
All three share the same engine, flags, and 100+ file-type support. The matching `peekdocs__results.*` naming means a Regex run never overwrites a Standard run (and vice versa), and `peekdocs --clear` / **Clear Files** can find them by prefix. Within a mode, each run overwrites the previous report — add `--timestamp` (CLI) or check **Timestamp** in Advanced Search Options (GUI) to append `_YYYYMMDD_HHMMSS` so every run is preserved. The **Schedule Search** dialog enables timestamping by default for cron / Task Scheduler use.
> **Naming convention — no exceptions.** Every file peekdocs creates uses the `peekdocs_` prefix (visible outputs like the reports above, the error log, tools-menu outputs, and release binaries — which use the dash variant `peekdocs-`) or the `.peekdocs` prefix (hidden user-state / per-folder dotfiles: `~/.peekdocsrc`, `~/.peekdocs_history.json`, `.peekdocs_collection.json`, `.peekdocs.db`, etc.). Anything in your folders that doesn't start with one of these two prefixes was not created by peekdocs. For the per-file inventory — what each file contains, sensitivity rating, and how to clean it up — see [docs/SECURITY.md](docs/SECURITY.md).
#### Search & discovery
- **100+ file types** — Word, PDF, Excel, PowerPoint, emails (.eml, .msg, .pst, .mbox), archives (.zip, .7z, .rar), source code (Python, C/C++, Java, Go, Rust, and more), engineering files (MATLAB, Verilog, VHDL, SPICE, DXF, Visio), Apple Pages/Numbers/Keynote, calendars (.ics), contacts (.vcf), e-books, HTML, and more. **Note:** `.pst` requires `libpff-python` (no Windows wheel) and `.rar` requires the `unrar` tool — see [Prerequisites](#prerequisites)
- **Search modes** — plain keywords, AND/OR, Boolean expressions, regex, wildcards, fuzzy matching, whole-word, word proximity, line proximity
- **Range queries** — filter by dollar amounts, dates, percentages, ages, file sizes
- **OCR** — search scanned PDFs and images (requires Tesseract)
- **Multi-folder search** — search across multiple folders at once, with optional recursive searching into subfolders. Click **+Folder** to add folders, or type semicolon-separated paths. Results are combined from all folders
- **Inverse search** — find files that are *missing* required content
- **Search Wizard** — guided search builder with 20 pre-built search types (phone, email, dollar range, date range, Boolean, fuzzy, and more) plus a regex pattern builder with 35 named patterns across 6 categories — no flags or regex knowledge needed
- **▶ Save / ▶ Reload** — save a configured search by name and reload it later with one click
- **Recent searches** — your last 10 searches are remembered for re-use. Each entry captures the **FULL** search context (terms + folder + every Advanced Search Options setting), so selecting one from the **▼ Recent** popup restores all of those in one click. With the search bar focused, press **↑** / **↓** to walk through the same list — the arrow shortcut copies only the search-terms text into the bar (leaving your current Advanced options untouched), so use the arrows when you want to reuse the wording with the current settings, and the **Recent** popup when you want the whole configuration back. **▶ Save** is for keeping a configuration permanently under a name, beyond the 10-entry rolling Recent window
- **Search index** — optional SQLite FTS5 index for faster repeated searches
- **Works in any language** — Unicode-based text handling; searches documents in any language with exact character-sequence matching (no stemming or word segmentation). Documentation is English-only; the GUI ships partial UI translation in seven languages (English, Español, Français, Deutsch, 日本語, 简体中文, Português brasileiro) for the search workflow — see *UI translation* in the Feature Highlights above — but help popups, dialogs, the CLI banner, and reports remain English. The PDF report uses a Latin-1 font, so non-Latin text shows as `?` in `.pdf` only — use `.docx`, `.html`, `.txt`, `.json`, or `.csv` for non-Latin content.
#### Reporting
- **Highlighted reports** — results saved to `.docx` and `.pdf` with yellow-highlighted matches, `.txt` with full context, and optional CSV and JSON output
- **Results preview** — see matches inline in the GUI with highlighted terms. **View Text** on any matched file shows the file's full extracted text with every match highlighted, without opening external software. Double-click any file to open in its native application; click **DOCX**, **HTML**, or **PDF** to open the highlighted multi-file report
- **HTML export** — no Word or LibreOffice? Enable HTML output and the highlighted report opens in any browser. The file is stored locally — nothing is uploaded, and it's easy to share by email
- **Desktop notification on complete** — opt-in checkbox in Advanced Search Options. When a Standard / Suite / Regex run finishes, fires a native desktop notification (macOS Notification Center, Windows toast, Linux libnotify) with the match count, file count, and elapsed time. Suppressed when the peekdocs window is focused — if you can already see the result, no notification fires. No data leaves the machine
#### Analysis
- **Collection Summary** — one-page consolidated overview of the search folder: total file count and size, oldest/newest file, top file types, age histogram, top 10 largest files, recent-activity counts, unsearchable breakdown, and empty-file count — all in a single fast pass
- **File Inventory** — instant summary of every file in a folder: total count, size breakdown by type, oldest and newest files
- **Duplicate Finder** — finds identical files by content (not just name), shows how much space is wasted by extra copies
- **Large Files** — shows the 50 biggest files so you can reclaim disk space
- **Empty Files** — finds zero-byte files: failed downloads, placeholders, junk
- **File Age Distribution** — histogram of how recently files were modified, in six buckets from 0–6 months out to 10+ years. Useful for archives, document collections, and personal files — surfaces stale folders at a glance and shows what fraction of a collection is recent activity vs. long-untouched material
- **Recent Changes** — which files were modified in the last 7, 30, or 90 days
- **Protected Files** — detects password-protected PDFs, Word/Excel/PowerPoint, ZIP/7z/RAR archives that peekdocs can't search
- **Unsearchable Files** — categorizes every file peekdocs cannot search (unsupported types, oversized, empty, hidden / OS metadata, peekdocs-created) with counts and per-category file lists. Answers "what fraction of this folder is even searchable?" before you run a search
- **Bookmarks** — pin files from search results for quick access later
#### Automation & integration
- **Search Suites** — group saved searches into a named suite and run them all at once (green **Search Suites** button on the main screen)
- **Repeatable workflows** — Saved Searches, Search Suites, Regex Collections, Schedule Search, Search History, and Diff Snapshots compose into a workflow system: define a search by name; group related searches into a suite; reuse pattern sets via Regex Collections; schedule a suite to run on a cadence; audit every run via Search History; compare today's run against last week's via Diff Snapshots.
- **Search History** — automatic diary of every search you run: date, terms, match count, file count, elapsed time
- **Diff Snapshots** — compare two saved scans to see what files are new, changed, removed, or unchanged between them
- **Schedule Search** — generates a ready-to-paste cron (Mac/Linux) or Task Scheduler (Windows) command to run any saved search suite or regex collection on a schedule. Step-by-step instructions walk you through pasting it into the scheduler
- **Indexes** — build, refresh, or delete the optional search index that makes repeated searches dramatically faster
- **Three interfaces** — terminal CLI, point-and-click GUI (`peekdocs-gui`), Python API
- **Cross-platform** — Windows, macOS, Linux
#### Privacy & transparency
- **Offline and private** — your documents never leave your computer. peekdocs never uploads, transmits, alters, moves, or deletes your files. No cloud, no accounts, no subscriptions. Everything runs locally and stays local
- **Read-only** — peekdocs never modifies, moves, or deletes your files. It does create its own output files (reports, indexes, settings) and can delete those when you ask (e.g., Tools → Clear Files, Tools → Indexes → Delete Index(es))
- **Delete on Close** — one checkbox automatically deletes every result file and the search index across the session when you close peekdocs. Saved reports, saved searches, settings, and bookmarks are preserved
- **Safe defaults** — files over 100 MB are skipped automatically to prevent slow searches and memory issues; archives that would expand past 500 MB are skipped to prevent archive bombs. Adjust **Max File Size** in Advanced Search Options or set it to 0 for no limit
- **Excluded Files view** — after each search, see exactly which files were skipped and why (unsupported type, oversized, hidden, etc.) — no guessing what was missed
- **Error Log** — opens `peekdocs_errors.log` to see any files that couldn't be read and why (corrupt, locked, password-protected, etc.)
- **Clear Files** — selectively delete peekdocs's output files (reports, error log, saved searches, index) from the current folder
- **Clean Folder** — same idea for any other folder, in case peekdocs files were generated elsewhere
### Supported File Types
| Category | Formats |
|----------|---------|
| **Documents** | .doc .docx .epub .html .key .md .odp .odt .pages .pdf .ppt .pptx .rst .rtf .tex |
| **Spreadsheets** | .csv .numbers .ods .tsv .xls .xlsx |
| **Email** | .eml .mbox .msg .pst (`.pst` requires `libpff-python` — no Windows wheel; see [Troubleshooting](docs/TROUBLESHOOTING.md)) |
| **Archives** | .7z .bz2 .gz .rar .tar .tgz .zip (`.rar` requires the `unrar` tool — see [Prerequisites](#prerequisites)) |
| **Calendar/Contacts** | .ics .vcf |
| **Source Code** | .asm .bat .c .cmake .cpp .cs .css .f .f90 .go .gradle .h .hpp .java .js .kt .lua .pl .ps1 .py .r .rb .rs .s .scala .scss .sh .swift .tcl .ts .vb |
| **Engineering** | .cir .dxf .m .sp .spice .sv .v .vhd .vhdl .vsdx |
| **Data/Config** | .cfg .conf .dockerfile .env .graphql .gql .ini .json .jsonl .log .makefile .ndjson .properties .proto .sql .tf .toml .txt .xml .yaml .yml |
| **Notebooks** | .ipynb (Jupyter) |
| **Images (OCR)** | .bmp .jpg .jpeg .png .tif .tiff (requires `-O` flag) |
**Note:** Apple Numbers (.numbers) and Keynote (.key) files created with recent versions of iWork use a protobuf-based internal format. peekdocs extracts whatever readable text exists inside these files, which may be partial. Older iWork files extract fully. Apple Pages (.pages) is fully supported.
## Installation
[Prerequisites](#prerequisites) · [Option A: Standalone Download](#option-a-standalone-download-no-python-needed) · [Option B: pipx (for Python users)](#option-b-quick-install-with-pipx-for-python-users) · [Upgrading](#upgrading)
> **Cautious about installing?** See [docs/INSTALL_SAFETY.md](docs/INSTALL_SAFETY.md) — plain-English explanation of what peekdocs does and doesn't do, what the SmartScreen / Gatekeeper warnings actually mean, and five ways to verify the download yourself before you run it (checksum match, VirusTotal scan, network monitor, source-code grep, sandbox install).
### Prerequisites
*Using Option A (standalone download)? Skip this section — no prerequisites needed.*
| Requirement | Why | How |
|---|---|---|
| **Python 3.10+** | Required for Option B and source install | macOS: `brew install python` (or [python.org](https://www.python.org/downloads/)). Windows: [python.org](https://www.python.org/downloads/), check "Add Python to PATH". Linux: `sudo apt install python3-venv python3-pip python3-tk`. Per-platform deep dives in [docs/INSTALLATION.md](docs/INSTALLATION.md) |
| **Tkinter** | GUI only (CLI works without it) | Windows: included. macOS Homebrew: `brew install python-tk@`. Linux: covered by `python3-tk` above |
| **pipx** | Recommended over `pip` for Option B | `pip install pipx` (Windows) · `brew install pipx` (macOS) · `sudo apt install pipx` (Linux). Then `pipx ensurepath` and reopen your terminal |
| **Tesseract** (optional) | OCR for scanned PDFs and images | `brew install tesseract` · Windows [installer](https://github.com/UB-Mannheim/tesseract/wiki) · `sudo apt install tesseract-ocr` |
| **UnRAR** (optional) | Search inside `.rar` archives | `brew install unrar` · WinRAR · `sudo apt install unrar` |
| **libpff-python** (optional) | Search inside Outlook `.pst` archives (no Windows wheel) | macOS/Linux: `pip install libpff-python`. Windows: convert `.pst` to `.mbox` — see [TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md) |
**Everything else installs automatically.** `pipx install` (or `pip install`) downloads the 18 Python libraries peekdocs needs (PDF reader, Word/Excel/PowerPoint parsers, email reader, and more) plus their transitive dependencies — typically around 200 packages and a few hundred megabytes of disk space. See [Dependencies](docs/USER_GUIDE.md#dependencies) for the full list and what each one does.
### Option A: Standalone Download (no Python needed)
Pick this if you don't have Python installed or don't want to install it. No setup — just download and run. (If you already have Python set up, [Option B](#option-b-quick-install-with-pipx-for-python-users) is one command, gives you the CLI and Python API alongside the GUI, and starts noticeably faster — especially on macOS.)
The GUI and CLI standalones are **separate downloads**. Grab whichever fits how you'll use peekdocs — or both. The GUI is the click-driven interface for interactive search and report viewing; the CLI is for scripting from the terminal, running on a schedule (cron / Task Scheduler), and piping JSON output into other tools. They're independent — installing one doesn't require the other.
*Why two binaries instead of one?* Each standalone is built with PyInstaller, which freezes its own Python interpreter and every dependency into a single executable. A PyInstaller bundle has one entry point — it can't be both a GUI launcher and a CLI without one carrying the other's weight (the CLI would haul tkinter / customtkinter it never uses; the GUI would carry CLI-only argument-parsing surface). Splitting them keeps each binary small and lets each ship independently. The [pipx / pip install path](#option-b-quick-install-with-pipx-for-python-users) doesn't have this constraint — it drops both `peekdocs` and `peekdocs-gui` console scripts into one shared venv from a single command.
**Direct GUI downloads** (always the latest release):
| Platform | Download | After download |
|---|---|---|
| Windows | [**peekdocs-gui-windows.exe**](https://github.com/exbuf/peekdocs/releases/latest/download/peekdocs-gui-windows.exe) | Double-click to run. **First launch:** Windows SmartScreen blocks the .exe with "Windows protected your PC" — click **More info** (small link near the top of the dialog) → **Run anyway** (the button that appears). This is expected for unsigned open-source software and does not indicate the app is unsafe. |
| macOS | [**peekdocs-gui-macos.zip**](https://github.com/exbuf/peekdocs/releases/latest/download/peekdocs-gui-macos.zip) | Unzip, open `peekdocs-gui.app`. **First launch:** macOS Gatekeeper shows a dialog with only **Done** / **Move to Trash** (no Open button). Two ways to bypass — both expected for unsigned open-source software, neither indicates the app is unsafe: (1) **System Settings UI:** open **System Settings → Privacy & Security**, scroll down to the message `"peekdocs-gui.app" was blocked because it is not from an identified developer`, click **Open Anyway**, then re-launch the app and click **Open** in the confirmation dialog. (2) **Terminal one-liner:** `xattr -dr com.apple.quarantine ~/Downloads/peekdocs-gui.app`, then double-click. Each new download (including upgrades) re-triggers the warning. |
| Linux | [**peekdocs-gui-linux**](https://github.com/exbuf/peekdocs/releases/latest/download/peekdocs-gui-linux) | In the download folder (typically `~/Downloads`): `cd ~/Downloads && chmod +x peekdocs-gui-linux && ./peekdocs-gui-linux`. No first-launch security prompt on Linux. |
Why the warnings appear and the full per-platform bypass walkthrough: [First-launch security warnings](#first-launch-security) below.
**Direct CLI downloads** (always the latest release):
| Platform | Download | After download |
|---|---|---|
| Windows | [**peekdocs-cli-windows.exe**](https://github.com/exbuf/peekdocs/releases/latest/download/peekdocs-cli-windows.exe) | `cd $HOME\Downloads`, then `peekdocs-cli-windows.exe --version` (cmd.exe — bare name works) or `.\peekdocs-cli-windows.exe --version` (PowerShell needs the `.\` prefix). **First launch:** SmartScreen may block the .exe — click **More info** → **Run anyway**. For global access from any terminal, see **Windows: make `peekdocs` work from any terminal** below the table. PowerShell-specific `--%` token and `.rar`/`.pst` limitations: [docs/INSTALLATION.md → CLI on Windows footnotes](docs/INSTALLATION.md#cli-on-windows-footnotes). |
| macOS | [**peekdocs-cli-macos.zip**](https://github.com/exbuf/peekdocs/releases/latest/download/peekdocs-cli-macos.zip) | Safari auto-unzips → a `peekdocs/` **folder** (the binary is `peekdocs/peekdocs`; the folder also contains `_internal/` with the bundled Python and libraries). `cd ~/Downloads && xattr -dr com.apple.quarantine peekdocs && ./peekdocs/peekdocs --version`. For global access from any terminal: `sudo mv peekdocs /usr/local/lib/peekdocs && sudo ln -s /usr/local/lib/peekdocs/peekdocs /usr/local/bin/peekdocs && sudo xattr -dr com.apple.quarantine /usr/local/lib/peekdocs` so `peekdocs "query" /path` works from any terminal session. **The post-move `xattr` matters** — without it Gatekeeper re-verifies on every launch. The folder distribution replaces the older single-binary one because PyInstaller `--onedir` mode skips the per-invocation self-extraction cost (~5–7s for an unsigned `--onefile` CLI on macOS dropped to ~1–2s). |
| Linux | [**peekdocs-cli-linux**](https://github.com/exbuf/peekdocs/releases/latest/download/peekdocs-cli-linux) | In the download folder: `cd ~/Downloads && chmod +x peekdocs-cli-linux && ./peekdocs-cli-linux --version`. Optionally `sudo mv peekdocs-cli-linux /usr/local/bin/peekdocs` for global access. |
> **Running the CLI from the download folder — the `./` / `.\` prefix rule.** When you run a downloaded executable from the same folder you're sitting in, most shells require an explicit prefix telling them "look here, not on `PATH`":
> - **macOS:** `./peekdocs/peekdocs --version` — the unzip produces a folder; the launcher is one level inside (forward slash + dot, then into the folder)
> - **Linux:** `./peekdocs-cli-linux --version` (forward slash + dot)
> - **Windows PowerShell:** `.\peekdocs-cli-windows.exe --version` (backslash + dot)
> - **Windows cmd.exe:** `peekdocs-cli-windows.exe --version` (bare name works; cmd.exe includes the current directory in its search by default)
>
> The reason: shells search `$PATH` (`$env:Path` on Windows) for executables, and the current directory isn't on `PATH` by default on macOS / Linux / PowerShell (a security default — prevents accidentally running a malicious binary in a folder you `cd`'d into). The `./` or `.\` prefix overrides that. Once you've installed the binary to a folder that *is* on `PATH` (`/usr/local/bin` on macOS / Linux, `$HOME\bin` on Windows after the steps below), the prefix becomes unnecessary and `peekdocs ...` works from any directory.
**Windows: make `peekdocs` work from any terminal.** Rename the CLI to `peekdocs.exe`, move it to a folder on your user `PATH`, and add the folder to `PATH`. Run this in PowerShell from the download folder:
```powershell
Rename-Item peekdocs-cli-windows.exe peekdocs.exe
New-Item -ItemType Directory -Force -Path "$HOME\bin" | Out-Null
Move-Item peekdocs.exe "$HOME\bin\"
[Environment]::SetEnvironmentVariable("Path", $env:Path + ";$HOME\bin", "User")
```
Open a fresh PowerShell window afterward; `peekdocs --version` then works from any directory.
Or browse the [**Releases page**](https://github.com/exbuf/peekdocs/releases/latest) for older versions, the full asset list (all six GUI + CLI binaries side by side), or release notes. *On the GitHub repo page, "Releases" is in the right sidebar under "About" — it's easy to miss if you're not looking for it.*
**\* First-launch security warnings (one-time, per platform).** Free, open-source software that hasn't paid for an OS-vendor code-signing certificate triggers a warning on first launch. This is normal and does not mean the software is unsafe.
- **Windows (SmartScreen):** Click **More info** → **Run anyway**.
- **macOS (Gatekeeper):** Recent macOS (Sequoia / Sonoma) shows a warning dialog with only **Done** and **Move to Trash** — no **Open** button. The bypass:
1. Click **Done** to dismiss the warning.
2. Open **System Settings → Privacy & Security**, scroll down to *"peekdocs-gui.app was blocked..."*, and click **Open Anyway**.
3. Re-launch the app and click **Open** in the final confirm dialog.
From then on a regular double-click on *that copy* works. **Each new download (including upgrades) re-triggers the warning** — the trust is per downloaded file, not per app. The one-line terminal alternative is faster if you upgrade often: `xattr -dr com.apple.quarantine ~/Downloads/peekdocs-gui.app`. Full walkthrough: [docs/INSTALLATION.md → macOS first-launch Gatekeeper](docs/INSTALLATION.md#macos-gatekeeper). *Note: Safari auto-unzips downloaded `.zip` files, so you'll see `peekdocs-gui.app` directly in Downloads rather than the `peekdocs-gui-macos.zip` you clicked — no extra unzip step.*
- **Linux:** Open a terminal in the folder where the file landed (typically `~/Downloads`), then `chmod +x peekdocs-gui-linux && ./peekdocs-gui-linux`. The `./` prefix is required because the current directory is not on `$PATH` by default — `./` tells the shell "run the file in *this* folder." If you moved the file elsewhere, `cd` there first or run it by absolute path (`/path/to/peekdocs-gui-linux`).
**Upgrading.** No need to uninstall the old version first — just download the new version from the same direct download links above and overwrite the existing file (GUI, CLI, or both — whichever you use). Your settings and saved searches live in your home directory, not in the executable — nothing is lost. See [Uninstalling](#uninstalling) below for full removal instructions.
**No dependency breakage.** The standalone bundles Python, all libraries, and peekdocs into a single file frozen at versions that were tested together — nothing external to upgrade, conflict, or break.
**Safe for your computer.** No installation option (standalone, pipx, or source) modifies your existing Python, installs system services, writes to the registry, or interferes with any other program.
---
*Done with Option A? Skip ahead to [Quick Start](#quick-start). If you have Python installed, Option B below is the better path — one command, faster startup, and you get the CLI and Python API alongside the GUI.*
### Option B: Quick Install with pipx (for Python users)
If you already have Python set up — or you want the CLI and Python API alongside the GUI — one command installs everything. Works the same on every OS.
```bash
pipx install git+https://github.com/exbuf/peekdocs.git # recommended (isolated venv)
# — or —
pip install git+https://github.com/exbuf/peekdocs.git # if you prefer pip
```
These are the **first-time install** commands. To upgrade later, use `pipx upgrade peekdocs` (or `pip install --upgrade git+https://github.com/exbuf/peekdocs.git`). `pipx upgrade` is cleaner than `pipx install --force` — it replaces the package's contents in place instead of leaving stale `.dist-info` directories around (which can desync the reported version from the running code).
After install, `peekdocs` and `peekdocs-gui` work from any terminal, any folder, every time — even after restarting your computer. pipx manages the underlying virtual environment for you (pip drops the package into whichever Python environment you used). To uninstall completely: `pipx uninstall peekdocs` (or `pip uninstall peekdocs`). See the [User Guide](docs/USER_GUIDE.md#will-peekdocs-affect-my-existing-python-installation) for what is and isn't preserved across upgrades.
**GUI prerequisite** — only if you'll use `peekdocs-gui`:
- **macOS Homebrew Python:** `brew install python-tk@3.14` (match your `python@`)
- **Linux:** `sudo apt install python3-tk`
- **Windows / python.org macOS installer:** already included — nothing to do
**Niche cases** (macOS python3.13 selection, no-git ZIP install, Windows pipx fallback, source install for contributors) are documented in [docs/INSTALLATION.md](docs/INSTALLATION.md).
### Upgrading
Your saved searches, settings, indexes, and reports are stored outside the peekdocs installation — in your home directory and your document folders. Upgrading replaces only the code. These files are **never overwritten** by an upgrade:
- `~/.peekdocsrc` — your saved settings and preferences
- `~/.peekdocs_history.json` — your search history
- `~/.peekdocs_bookmarks.json` — your bookmarks
- `.peekdocs_collection.json` (in each search folder) — your saved searches and search suites
- `.peekdocs.db` (in each search folder) — your search index
- `peekdocs_report_*`, `peekdocs_accumulated_*` files — your saved reports
How to upgrade depends on which install method you used:
- **Standalone (Option A):** download the new file from the [Releases page](https://github.com/exbuf/peekdocs/releases/latest) and replace the old one. **No need to uninstall first.**
- **pipx (Option B):** `pipx upgrade peekdocs` — replaces the package contents in place without leaving stale `.dist-info` directories behind. (`pipx install --force git+…` also works but can accumulate stale dist-info entries that desync the reported version from the running code; `pipx uninstall peekdocs && pipx install git+…` is the nuclear option if you ever hit that.) **Windows note:** if either upgrade method fails with "Access is denied" on `.pyd` / `.dll` / `python.exe` files, the existing venv is being held open by a running peekdocs process (or a terminal sitting inside the venv folder). See [pipx upgrade on Windows: locked files](docs/TROUBLESHOOTING.md#pipx-upgrade-windows-locked-files) for the recovery walkthrough. macOS and Linux aren't affected — they let a running process keep using a file that's been replaced.
- **Source install:** `cd peekdocs && git pull && pip install -e .` (see [CONTRIBUTING.md](CONTRIBUTING.md#development-setup)).
- **Niche paths** (no-git ZIP, Windows pip fallback): see [docs/INSTALLATION.md](docs/INSTALLATION.md).
### Uninstalling
peekdocs doesn't use a system installer — no registry entries, no system services, no kernel extensions. "Uninstalling" just means deleting the executable (standalone) or the Python package (pipx / pip). Your settings, history, bookmarks, saved searches, and indexes are stored in your home directory and search folders — **they persist after uninstall** so you can reinstall later and pick up where you left off. To wipe those too, see the *factory reset* paragraph at the end of this section.
How to uninstall depends on which install method you used:
- **Standalone (Option A):**
- **Windows:** delete `peekdocs-gui-windows.exe` and/or `peekdocs-cli-windows.exe` from wherever you saved them (Downloads, Desktop, a folder on `PATH`, etc.).
- **macOS:** drag `peekdocs-gui.app` from Finder to the Trash. If you put `peekdocs-cli` on `PATH` (e.g., `/usr/local/bin/peekdocs`), `sudo rm /usr/local/bin/peekdocs`.
- **Linux:** delete `peekdocs-gui-linux` and/or `peekdocs-cli-linux` from wherever you put them. If either is on `PATH`, e.g. `sudo rm /usr/local/bin/peekdocs`.
- **pipx (Option B):** `pipx uninstall peekdocs` — removes the isolated venv cleanly.
- **pip:** `pip uninstall peekdocs` — removes the package from whichever Python environment you installed into.
- **Source install:** `pip uninstall peekdocs` from inside the venv you used. Then `rm -rf` the cloned repo folder if you no longer need it.
**Factory reset (complete wipe).** The files listed under [Upgrading](#upgrading) above are intentionally preserved by uninstall. If you also want those gone — settings, search history, bookmarks, saved searches, indexes, saved reports — delete them manually:
```bash
# macOS / Linux
rm -f ~/.peekdocsrc ~/.peekdocs_history.json ~/.peekdocs_bookmarks.json
rm -rf ~/peekdocs_reports
# Plus, in each folder you ever searched:
# rm -f .peekdocs_collection.json .peekdocs.db .peekdocs.db-wal .peekdocs.db-shm
```
```powershell
# Windows PowerShell
Remove-Item $HOME\.peekdocsrc, $HOME\.peekdocs_history.json, $HOME\.peekdocs_bookmarks.json -ErrorAction SilentlyContinue
Remove-Item $HOME\peekdocs_reports -Recurse -ErrorAction SilentlyContinue
# Plus, in each folder you ever searched, remove .peekdocs_collection.json and .peekdocs.db*
```
After that combination, no trace of peekdocs remains on your machine.
## Quick Start
**Want a quick demo first?** Clone this repo and try peekdocs on the bundled samples: `cd samples/engineering_test && peekdocs BUILD -r` returns 29 hits across multiple source-code and engineering file types (the corpus spans 41 extensions in total). No setup beyond installing peekdocs.
### GUI
```bash
peekdocs-gui
```
On first launch, the GUI opens with a **Getting Started** tab that walks you through your first search. Close it when you're ready to dive in, or skip it and follow these four steps:
1. Click **Browse** to select a folder (or **Single File** to search a specific file)
2. Type your search terms
3. Click **Run Standard Search**
4. View highlighted matches in the preview pane. To also save a Word report, check **DOCX** in Advanced Search Options before searching (or **HTML**, **PDF**, etc.).
The search bar covers the common case — type your keywords and click **Run Standard Search**. For more advanced searches, you have two choices: configure **Advanced Search Options** yourself (regex, fuzzy, Boolean, range queries, and all other settings) — click the **▶ Advanced Search Options** header to expand the inline panel in the left pane — or let the **Search Wizard** do it for you (blue **Search Wizard** button on the main page, between Run Standard Search and Search Suites): pick a search type from 20 pre-built forms, fill in your values, and click Apply. The wizard also has a separate regex pattern builder with 35 named patterns across 6 categories; it configures Advanced Search Options automatically. The green **Search Suites** button (run a group of saved searches together) lives on the main screen next to Run Standard Search. The **Tools** menu in the upper-right also includes **Schedule Search**, which generates a ready-to-paste cron / Task Scheduler command rather than installing the schedule for you.
The Search tab is split horizontally into a scrollable controls column on the left and a results-preview column on the right, with a draggable sash between them. The right pane carries the search-results headline (files searched · matches · elapsed time), Matched / Excluded count buttons, a Chart popup, and the matches themselves. The left pane carries Steps 1–4, the status row, the report-open buttons, and the collapsible Advanced Search Options panel. The split opens with a slight bias toward the left pane (52%) so the five-wide output-format checkbox row fits at first paint; drag the blue sash to rebalance.
**If buttons overlap or text looks too large**, use the **Text Size** dropdown on the bottom-right toolbar to adjust (Normal is recommended).
### Terminal
If you used Option A (standalone download) or Option B (pipx), peekdocs is always ready — just open any terminal. If you used the source install for contributors, navigate to the cloned repo folder and activate the virtual environment first:
```bash
cd /path/to/peekdocs # the folder containing pyproject.toml
source venv/bin/activate # macOS/Linux (you'll see (venv) in your prompt)
venv\Scripts\activate # Windows
```
**Tip:** Type `peekdocs` with no arguments to see a handy cheat sheet of all search modes, common options, and cleanup commands — right above your command prompt. Type `peekdocs -h` for the full reference with all flags, file types, and regex patterns.
Then navigate to your documents and search:
```bash
cd /path/to/your/documents
peekdocs budget # search for "budget"
peekdocs budget revenue # OR search (any term)
peekdocs -a budget revenue # AND search (both terms)
peekdocs -r budget # include subfolders
peekdocs -t pdf,docx budget # only PDFs and Word docs
peekdocs -x "\d{3}-\d{2}-\d{4}" # regex (9-digit ID with dashes)
peekdocs -e "(budget OR revenue) AND NOT draft" # Boolean expression
peekdocs -R amount:1000..5000 budget # range query
peekdocs -R date:2024-01-01..2024-12-31 invoice # date range (also accepts 01/01/2024 format)
peekdocs -P 3 budget acme # line proximity (terms within 3 lines)
peekdocs --open docx budget # search and auto-open the .docx report
peekdocs --open html budget # auto-generate HTML and open in your browser
peekdocs --open csv budget # auto-generate CSV and open in Excel/LibreOffice
peekdocs --open pdf budget # auto-generate PDF and open in a PDF viewer
peekdocs --open json budget # auto-generate JSON and open in a text editor
peekdocs -sa archive --open docx budget # append to accumulated report and open it
peekdocs -sa archive --open html budget # append and open accumulated report in browser
peekdocs --clear # delete peekdocs_*_results* files in current directory
peekdocs --clear-all # delete all peekdocs output files (results, saved reports, index)
```
**No matches?** First search not turning anything up is common. Try `-r` to include subfolders, `-z` for typo-tolerance, drop `-W` if you had whole-word on (it excludes partial matches like "logger" when searching "log"), or check whether your search terms actually appear in those files by opening one manually. Run `peekdocs --list-files` to confirm peekdocs sees the files you expect.
**Why doesn't the OR match count add up?** OR mode counts each matching line ONCE, even when more than one of your terms appears on it. So if `bowling` alone finds 342 matches and `tunick` alone finds 23, an OR search for `bowling tunick` will return *fewer* than 365 whenever some lines mention both words. For example, if the OR total is 350, that means 15 lines contain both terms — inclusion-exclusion: `|A ∪ B| = |A| + |B| − |A ∩ B|`. To list those overlap lines, re-run with `-a` (AND mode) — it returns exactly the intersection. The same explanation lives inside the GUI under **Advanced Search Options → ? help → Match counting in OR mode**.
If you used the manual install, you'll see `(venv)` before each command in your terminal — that's normal and means the virtual environment is active.
Results are saved to `peekdocs_standard_results.txt` in the current directory — the same folder your terminal is in when you run the search. **The .txt report is always written and cannot be disabled** because the GUI's Results Preview pane and the Matched Files popup both parse it; the matplotlib match-heatmap and other downstream views all read from it too. Every other format is opt-in: `peekdocs_standard_results.docx` (the highlighted Word report) is produced when **DOCX** is checked under **Advanced Search Options → Output formats** in the GUI, or when `-o docx` is passed on the CLI. CSV / JSON / PDF / HTML work the same way — opt in via the GUI checkbox or `-o csv,json,pdf,html`. A typical CLI invocation that produces TXT + DOCX is `peekdocs -o docx `; to also write HTML, `peekdocs -o docx,html `.
**All result files are overwritten each time you run a new search.** To keep previous results, use `-s my_report` to save a named copy (saved as `peekdocs_report_my_report.txt/.docx` so peekdocs never searches its own reports), or `--timestamp` to add a date/time stamp to each filename so nothing is ever overwritten.
The `.docx` report opens in whatever app you've set as your OS default for `.docx` files — Microsoft Word or [LibreOffice](https://www.libreoffice.org/download/download-libreoffice/) (free) are common choices. The `.txt` report works on any computer with no extra software.
To clean up output files: `peekdocs --clear` (deletes results files) or `peekdocs --clear-all` (deletes results, saved reports, error log, and index). Neither touches your saved searches or settings.
Run `peekdocs -h` for the full flag reference with examples. The complete flag list with detailed descriptions is in the [User Guide](docs/USER_GUIDE.md#flag-use-summary). All flags can be combined freely except: regex (`-x`), fuzzy (`-z`), and wildcard (`-w`) are mutually exclusive (pick one); and expression mode (`-e`) cannot be combined with AND (`-a`), exclude (`-n`), or proximity (`-p`) since those are built into the expression syntax.
### Python API
```python
from peekdocs import search
if __name__ == "__main__":
result = search(["budget", "revenue"], directory="/path/to/docs")
print(f"Found {len(result.matches)} matches in {len(result.files_searched)} files")
for match in result.matches:
print(f" {match.filename}:{match.line_num}: {match.text}")
```
The `if __name__ == "__main__":` guard is **required** — peekdocs uses `multiprocessing` internally, and on macOS and Windows child processes re-import the calling script. Without the guard, the script will crash with `RuntimeError` on those platforms. See the [API Reference](docs/API.md) for all parameters and options.
---
**Stuck?** Run `peekdocs --check` first — or, if you're using the GUI, open **Tools → System Check** for the same diagnostic in a window. Either way verifies Python, dependencies, Tesseract, SQLite, and free disk space and tells you what's missing. If the check looks clean but you're still hitting issues, see [FAQ & Troubleshooting](docs/TROUBLESHOOTING.md) for common questions and fixes across Windows, macOS, and Linux.
## Documentation
| Document | Description |
|----------|-------------|
| [User Guide](docs/USER_GUIDE.md) | Complete reference — GUI, CLI flags, search modes, indexing, file reference |
| [Walkthroughs](docs/WALKTHROUGHS.md) | Seven annotated screenshot tours — same search across three interfaces, Advanced Search Options, Regex Search, Search Suites, Diff Snapshots, Schedule Search, and `peekdocs --check` |
| [Installation](docs/INSTALLATION.md) | Per-platform Python prerequisites, optional tools (Tesseract, UnRAR, libpff-python), CLI-on-Windows footnotes, and less-common install paths |
| [API Reference](docs/API.md) | Python library API — `search()` function, parameters, return values |
| [Glossary](docs/GLOSSARY.md) | 85 peekdocs terms: FTS5, regex modes, deterministic, exit codes, Tesseract, jq, SIEM, MSP, network calls, and more |
| [FAQ & Troubleshooting](docs/TROUBLESHOOTING.md) | Common questions and solutions for Windows, macOS, and Linux |
| [Security architecture](docs/SECURITY.md) | Deep dive for IT and Security teams — data architecture, per-file sensitivity notes, and limitations outside the application's control |
| [Reporting security issues](SECURITY.md) | Vulnerability-reporting policy — preferred channel, supported versions, scope, expected response timing |
| [Changelog](CHANGELOG.md) | Version history and release notes |
| [Contributing](CONTRIBUTING.md) | How to report bugs, suggest features, and submit code |
## Why peekdocs?
Every search tool — `grep`, OS file search, cloud AI assistants, enterprise search software — matches text at its core. The differences are in what each one can read, how it presents results, what stays private, and what you can do with the output.
If all you need is to find a word in a plain text file, many search tools work well. If you want to *see inside your own files* — across 100+ file formats, with context, in a report you can share, without uploading anything — that's what peekdocs was built for.
### Why Is peekdocs a Search and *Analysis* Tool?
peekdocs is a search tool because it helps you find information across PDFs, Office documents, email archives, source code, scanned documents, and 100+ other file types. It is also an *analysis* tool because it helps you characterize document collections, not just search them. Features such as Duplicate Finder, File Inventory, Large Files, Recent Changes, Protected Files, Diff Snapshots, Bookmarks, and Search History reveal patterns, changes, and characteristics within your files. peekdocs does not interpret results, assign risk scores, or make decisions for you; instead, it gathers and organizes information so you can analyze it yourself. In that sense, peekdocs goes beyond answering "Where is this?" and also helps answer "What do I have?", "What changed?", "What is duplicated?", and "What is taking up the most space?"
**Compared with built-in OS search (Windows Search, macOS Spotlight, Linux file managers).** OS search is convenient for everyday file discovery. peekdocs is purpose-built for document-search workflows across mixed-format collections — including `.pst`, `.msg`, `.7z`, `.rar`, `.odt`, `.eml`, `.mbox`, Jupyter notebooks, and scanned PDFs. Results show *where* each match occurs (filename, line number, surrounding context), and you can run them in Boolean, fuzzy, regex, proximity, or range mode, save them by name, group them into suites, and produce highlighted `.docx`, `.pdf`, and `.html` reports you can save or share. The index is yours to build and refresh on demand, and the same searches work across the GUI, CLI, and Python API.
**Compared with cloud AI document tools.** Cloud AI tools excel at summarization, question answering, semantic search, and extracting meaning from large document collections — often the right reach for those tasks. peekdocs serves a different purpose: it runs entirely on your computer. For keyword, pattern, date, amount, regex, fuzzy, and proximity searches across mixed-format folders, peekdocs delivers deterministic and repeatable results while keeping your documents local.
peekdocs processes the whole folder in one local pass with no upload step — same engine whether the folder has dozens of files or many thousands. It reads 100+ file types natively, including archives (`.zip`, `.7z`, `.rar`) and Outlook email containers (`.pst`, `.msg`, `.mbox`) opened in place, and OCRs scanned PDFs and images when you enable the `-O` flag. The size of the corpus, the connection speed, and the formats involved are not constraints peekdocs has to plan around — it works on whatever's on disk, however large, in any of the formats it supports.
peekdocs's JSON output is also the deterministic keyword-retrieval half of a fully-local privacy-preserving LLM workflow. Use peekdocs to narrow a 10,000-file corpus to the 30 files containing the exact terms, dates, or regex patterns you care about, then feed those 30 (not the 10,000) to a local model — Llama 3, Mistral, Gemma, or whatever you run via [Ollama](https://ollama.com), [llama.cpp](https://github.com/ggml-org/llama.cpp), or [LM Studio](https://lmstudio.ai) — for summarization or Q&A. peekdocs doesn't produce embeddings; it returns precise file paths, line numbers, and optional SHA-256 content fingerprints — the structured inputs a local LLM needs to ground its citations against your actual source files. Nothing leaves your machine.
**Compared with `grep`.** For plain-text search in a terminal, `grep` is excellent — use it. peekdocs is built for mixed-format document collections (PDF, Word, Excel, PowerPoint, email, OCR-able scans), with highlighted reports, saved searches, search suites, regex collections, indexing, a GUI, and a Python API. Both can live in your toolkit; they're designed for different jobs.
| Capability | grep | peekdocs |
|---|---|---|
| Plain text files (.txt, .log, .csv) | Yes | Yes |
| PDF text extraction | Requires external conversion (`pdftotext`) | Built in |
| Word documents (.docx) | Requires external conversion | Built in |
| Excel spreadsheets (.xlsx) | Requires external conversion | Built in |
| PowerPoint presentations (.pptx) | Requires external conversion | Built in |
| Email files and archives (.eml, .msg, .mbox, .pst) | Requires external conversion | Built in |
| OCR (scanned PDFs and images) | Requires external OCR pipeline | Built in (`-O`) |
| EPUB, RTF, ODT, ODS, ODP, archives | Format-specific tools required | Built in |
| Source code (40+ extensions) | Yes | Yes |
| Highlighted .docx / .pdf / .html reports | No | Yes |
| CSV and JSON export | Requires scripting | Built in (`-o csv,json`) |
| Boolean expressions | Requires shell composition | Yes (`-e "A AND (B OR C)"`) |
| Proximity search | Requires custom scripting | Yes (`-p 5`) |
| Fuzzy / typo-tolerant matching | Requires specialized tools | Yes (`-z`) |
| Range queries (amounts, dates) | Requires custom scripting | Yes (`-R amount:1000..5000`) |
| Saved searches and suites | No | Yes |
| Regex collections (batch pattern sets) | Requires scripting | Built in (`--regex-collection`) |
| Search index with on-demand refresh | Requires separate indexing tool | Built in (`--index`) |
| Consistent behavior across Windows, macOS, and Linux | Varies (GNU vs BSD grep) | Same flags on all three platforms |
| GUI | No | Yes |
| Python API | No | Yes |
## What peekdocs Is Not
> **In one line:** peekdocs is a search utility — not a judgment engine, not a compliance certifier, not a forensic platform, not a threat-assessment tool.
peekdocs is a general-purpose local text-search application. To set honest expectations, here are the things it is **not**, alongside the kind of tool you would reach for instead:
- **Not a security or threat-detection product.** peekdocs matches the text patterns you give it. It does not score risk, classify findings, recognize malware, or judge whether a match is good or bad — that's your call. For threat detection, reach for a dedicated security product.
- **Not a substitute for human review.** peekdocs surfaces matches; it does not decide which matches matter. Treat its output as a starting point for code review, document review, or whatever judgment task brought you here.
- **Not a forensic or evidence-collection system.** The optional SHA-256 with `--hash` is a content fingerprint for snapshot comparison, not notarized, tamper-evident, or court-admissible evidence handling. For chain-of-custody workflows, reach for a dedicated forensic suite.
- **Not an AI or summarization tool.** peekdocs does not infer, summarize, paraphrase, answer questions, or reason about what your documents say. It finds matches; that's it. For summarization or question-answering, use an LLM-based system.
- **Not a file manager or backup tool.** peekdocs reads your files; it never moves, modifies, renames, syncs, archives, or version-controls them. It writes its own report and state files — every one named with the `peekdocs_` prefix (visible outputs) or `.peekdocs` prefix (hidden user-state / per-folder dotfiles), with no exceptions — and nothing else.
- **Not networked.** peekdocs operates only on files mounted as local paths. It does not crawl websites, hit APIs, read SharePoint or Confluence over a network, or talk to a remote search index. A mapped network drive that appears as a regular folder works; everything else does not.
- **Not a search-index server or enterprise document platform.** peekdocs runs as a single-user CLI / GUI / library on one machine. It does not host a shared indexable corpus for many users, manage permissions or roles, version content, or expose an HTTP API for other systems to query. For multi-user document management, reach for Elasticsearch / OpenSearch / Solr (search servers) or SharePoint / M-Files / Documentum / Box (enterprise document platforms).
- **Not a high-assurance or safety-critical tool.** peekdocs is offered under the MIT License "as is" without warranty. It is not designed for environments where an incorrect or missed match could cause significant harm. Users remain solely responsible for how they use and interpret its output.
For what peekdocs *is*, see [Feature Highlights](#feature-highlights) and the [User Guide](docs/USER_GUIDE.md).
## Performance
**Test machine:** MacBook Pro, Apple M-series, 24 GB RAM, SSD, Python 3.13. peekdocs used 7 of 14 cores (its default is half; adjustable in Advanced Search Options). Your results will vary depending on CPU, RAM, disk type (SSD vs hard drive), and whether files are local or on a network drive.
### Mixed-format test (realistic documents)
The file mix represents a typical home or small business folder:
| File type | % of files | Examples |
|-----------|--:|-----|
| PDF | 35% | Bank statements, receipts, tax forms, manuals |
| Word (.docx) | 25% | Letters, resumes, reports, contracts |
| Plain text (.txt, .csv, .log) | 15% | Notes, data exports, logs |
| Excel (.xlsx) | 10% | Budgets, lists, financial records |
| Email (.eml) | 8% | Exported correspondence |
| PowerPoint (.pptx) | 5% | Presentations |
| Other (.html, .rtf) | 2% | Saved web pages, legacy docs |
**Results (files stored locally on SSD).** Each test folder contained the mix of file types shown above. Individual file sizes varied (PDFs 50–500 KB, Word docs 20–200 KB, text files 1–50 KB, etc.). "Total size" is the entire folder.
| Files | Total folder size | Search time |
|------:|-----------:|------------:|
| **1,000** | 13 MB | **~1 second** (no index) |
| **10,000** | 133 MB | **~5 seconds** (no index) |
| **50,000** | 663 MB | **~22 seconds** (no index) |
| **105 real Word docs** | 1,878 MB | **~4 seconds** without index, **0.24 seconds** with index |
10× more files doesn't mean 10× longer — peekdocs processes files in parallel across multiple CPU cores.
### Plain-text stress test
We also tested with small .txt files (~113 bytes each) to see how peekdocs handles extreme file counts:
| Files | Search time |
|------:|------------:|
| 10,000 | 1.4 seconds |
| 50,000 | 4.1 seconds |
| **1,000,000** | **90 seconds** |
**What does testing 1,000,000 files prove?** These were tiny text files (~113 bytes each), not real documents — nobody has a million small .txt files. The test confirms that peekdocs doesn't crash, doesn't run out of memory, and produces correct results at extreme scale. It's a stress test of the software's stability, not a realistic performance benchmark. The mixed-format results above are what real-world performance looks like.
### Should you build an index?
Direct search is fast enough for most folders — just click Run Standard Search. An index helps when you have large files or search the same folder repeatedly:
| Situation | Index helps? | Why |
|-----------|:-----------:|-----|
| Large files (PDFs, Word, Excel) | **Yes** | Skips expensive parsing — about 18× faster on the 105-Word-doc test in the Performance section |
| Same folder searched repeatedly | **Yes** | Pre-pays parsing cost once |
| Files on a network drive | **Yes** | Reads local index instead of files over the network |
| Small files, small folder | **No** | Direct search is already fast enough |
| One-time search you won't repeat | **No** | Build time won't be recouped |
To try it: open **Tools → Indexes**, click **Build Index(es)**, or run `peekdocs --index`.
### First-run timing and the banner notice
The first time peekdocs searches a folder, it builds the search index by reading every file once. This can take from a few seconds (small folders) to a few minutes (thousands of files, large PDFs, or scanned documents). Every search after that uses the index and runs in milliseconds.
To make this expectation clear up front, peekdocs prints a short notice in the CLI banner when the search folder has no index yet:
```
Note: no search index for this folder yet — the first search builds
one (may take longer); subsequent searches are much faster.
Use --no-index to skip indexing entirely.
```
The notice is shown only when it's relevant — peekdocs respects every existing CLI contract:
| Scenario | Notice shown? |
|---|:---:|
| Cold folder (no `.peekdocs.db`) — interactive search | ✓ shown |
| Warm folder (index exists) | — not shown |
| `--no-index` flag passed | — not shown |
| Non-search command (`--check`, `--runs`, `--diff`, `--list-files`, `--clear*`, `--index*`, `--config`) | — not shown |
| Quiet mode (`-q` or `-qq`) — banner suppressed entirely | — not shown |
| `--stdout` JSON output mode — JSON pipeline stays clean | — not shown |
| `--runs --json` / `--diff --json` — machine-parsed output stays clean | — not shown |
Folder detection is `-d`/`--directory`-aware, so running `peekdocs -d /some/other/folder TODO` checks that folder, not the current directory.
If you'd rather avoid indexing entirely, add `--no-index` to your CLI command or uncheck **Use Index** in the GUI. Searches will then read files directly each time — fine for one-off searches, slower for repeated searches in the same folder. See the [Why is my first search slow but later searches are fast?](docs/TROUBLESHOOTING.md) FAQ entry for additional notes including the `2>/dev/null` idiom for absolutely silent automation.
**Cold-cache first search even with the index already built.** Once the index exists, a fresh terminal session's first search is still slower than the next — typically a few seconds vs. half a second — and there's no rebuild involved. That's the OS filesystem cache being cold for the `.peekdocs.db` file (often hundreds of MB), Python interpreter startup paid by each fresh invocation, and the `refresh_index` `os.stat()` pass hitting disk on its first walk. After the first search in a session, peekdocs is sub-second. The same FAQ entry above covers this in more detail along with a way to pre-warm the cache via a scheduled job.
**Network folders:** If your files are on a network drive, searches will be slower because every file must be read over the network. Building an index is strongly recommended — the first build is slow, but all subsequent searches query the local index instead.
**Why Python?** Python was chosen because it has mature, well-established libraries for every file format peekdocs supports — PyMuPDF for PDFs, python-docx for Word, openpyxl for Excel, python-pptx for PowerPoint, and dozens more. In C++ or Rust, equivalent libraries either don't exist or would require years of integration work. Python also runs on Windows, macOS, and Linux without recompilation, installs with a single `pip` command (no compiling from source), and produces readable open-source code that anyone can inspect or extend. The Python API means any Python programmer can call peekdocs directly from their own scripts. As for speed: the performance-critical work — PDF decoding, ZIP decompression, regex matching — is handled by C-backed libraries under the hood. Python orchestrates; C does the heavy lifting. Multiprocessing (separate OS processes, not threads) means Python's GIL (Global Interpreter Lock — a concurrency limitation) is not a factor.
## Platform Notes
**Tested on:** macOS (development machine), Windows 10/11, and Linux Mint 22.3 (Cinnamon) in a VirtualBox VM on Windows. The CLI and GUI work on all three platforms.
- **High-DPI displays (4K monitors)** — if buttons overlap or text looks too large, use the **Text Size** dropdown on the bottom-right toolbar to adjust. Normal is recommended for most screens
- **Antivirus software (Windows)** — some antivirus programs flag Python scripts as suspicious. If peekdocs is blocked, add your Python installation or the peekdocs folder to your antivirus allow list
- **Files locked by other programs (Windows)** — Windows locks files that are open in another program. If peekdocs reports "permission denied" on a file, close the program that has it open and search again. Errors are logged to `peekdocs_errors.log`
- **Corporate firewalls** — if `pip` or `pipx` can't download packages, use the [Standalone Download](#option-a-standalone-download-no-python-needed) (no Python, no network needed beyond the initial download) or the [ZIP-based pipx install](docs/INSTALLATION.md#no-git-install-from-a-downloaded-zip) documented in `docs/INSTALLATION.md`
- **macOS file picker vs Windows** — on macOS, the file picker includes a preview panel; on Windows, it does not — this is an OS difference, not peekdocs
- **Linux GUI requires python3-tk** — the CLI works without it, but `peekdocs-gui` needs tkinter. Install with `sudo apt install python3-tk` (see [Prerequisites](#prerequisites))
### File Handling
peekdocs handles a wide range of real-world file issues automatically on all platforms:
| Issue | Windows | macOS | Linux | What happens |
|-------|:-------:|:-----:|:-----:|-------------|
| Word/Excel lock files (`~$`) | Yes | Yes | Rare | Silently skipped |
| System files (Thumbs.db, .DS_Store) | Yes | Yes | — | Silently skipped |
| Temp files (`~`) | Yes | Yes | Yes | Silently skipped |
| Symlinks | Rare | Yes | Yes | Silently skipped |
| Password-protected archives | Yes | Yes | Yes | Reported with clear message |
| Cloud-only placeholders (OneDrive, iCloud) | Yes | Yes | Rare | Reported: "download the file first" |
| Path length limit (260 chars) | Yes | — | — | Files in archives silently skipped |
| Raw .gz files (not tar) | Yes | Yes | Yes | Decompressed and searched |
| SSL .key files | Yes | Yes | Yes | Detected as non-Keynote, skipped |
| BOM in text files | Common | Rare | Rare | Stripped automatically |
| macOS resource forks (`._`) | — | Yes | — | Silently skipped |
| Named pipes / sockets | — | Possible | Yes | Detected via stat(), skipped |
| Virtual filesystems (/proc, /sys) | — | — | Yes | Excluded from recursive search |
| Corrupted files | Yes | Yes | Yes | Logged to error log, search continues |
See [File-handling details by platform](docs/USER_GUIDE.md#file-handling-details-by-platform) in the User Guide for the reasoning behind each row and platform-specific behavior. For installation and runtime gotchas, see [TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md).
## Preparing Your Documents for Searching
Most digital files (PDFs from banks, Word docs, emails, spreadsheets) are already searchable — just point peekdocs at the folder and search. No preparation needed.
**For paper documents** (tax returns, receipts, old letters), you'll need to scan them first:
1. **Scan at 300 DPI** — this is the sweet spot for text recognition. Lower resolutions produce poor OCR results. Most scanners default to 300 DPI.
2. **Save as searchable PDF** — modern scanners with built-in OCR (like the Fujitsu ScanSnap) automatically embed a text layer in the PDF. peekdocs reads these directly — no OCR flag needed.
3. **If your scanner doesn't have OCR** — save as PDF, JPG, or PNG. peekdocs can still search these using its OCR feature (enable the OCR checkbox in the GUI or use the `-O` flag in the CLI). Requires [Tesseract](https://github.com/UB-Mannheim/tesseract/wiki) to be installed.
4. **Already have image-only PDFs?** If you have a backlog of scans without a text layer, [ocrmypdf](https://github.com/ocrmypdf/OCRmyPDF) (free, open-source, runs locally) adds a text layer in place. Install with `brew install ocrmypdf` (macOS), `pipx install ocrmypdf` (Windows), or `sudo apt install ocrmypdf` (Linux), then run `ocrmypdf input.pdf input.pdf` (same path twice = convert in place). Batch a folder with `for f in *.pdf; do ocrmypdf --skip-text "$f" "$f"; done` — `--skip-text` leaves already-searchable PDFs alone, so it's safe to re-run. Once converted, peekdocs finds them instantly without the `-O` flag. peekdocs itself never modifies your PDFs; ocrmypdf is a separate tool you opt into for permanent conversion.
5. **Organize by topic, not by date** — folders like `Tax Returns`, `Insurance`, `Receipts` make it easier to target searches. But peekdocs also works fine with one big folder and recursive search.
6. **Phone camera works too** — take a photo of a document and save it as JPG or PNG. peekdocs can OCR it. For best results, photograph in good lighting with the document flat and square in the frame.
**Consider going paperless.** Scanned PDFs are widely accepted for tax and financial records — the IRS has accepted digital records since 1997, and banks, brokerages, and the IRS itself deliver documents as PDFs. Scan your paper receipts and tax returns, then organize them into folders. Once digitized, peekdocs can search years of documents in seconds — no more digging through shoeboxes. (Consult your tax advisor for your specific situation.)
**Tip:** Before selling or donating a computer, search your entire documents folder for sensitive data — passwords, account numbers, and personal information you may have forgotten about.
## Questions and troubleshooting
Common questions, installation gotchas, and platform-specific issues are collected in **[docs/TROUBLESHOOTING.md](docs/TROUBLESHOOTING.md)** — ~90 entries covering search behavior, indexes, OCR, scheduling, email archives, network drives, uninstall steps, PDF report caveats, and more.
Quick diagnostic: run `peekdocs --check` (CLI) or open **Tools → System Check** (GUI). Both report your Python version, dependency status, Tesseract availability, SQLite version, and free disk space — most install-time issues resolve there.
Found a bug or have a feature idea? [Open an issue on GitHub](https://github.com/exbuf/peekdocs/issues).
## Glossary
The full glossary of peekdocs terms (FTS5, regex modes, deterministic, exit codes, Tesseract, jq, SIEM, MSP technician, and 85 entries in all — including a list of common Python networking libraries peekdocs deliberately does *not* use) lives in **[docs/GLOSSARY.md](docs/GLOSSARY.md)**.
## For IT and Security Teams
If you're evaluating peekdocs for your organization, here are the answers to the questions your security team will ask:
| Question | Answer |
|----------|--------|
| **Does it send data anywhere?** | No. peekdocs has no network calls, no telemetry, no tracking, no analytics, no phone-home. It never connects to the internet. All processing happens locally on the user's machine. |
| **Does it store what it finds?** | Yes — results are written to disk as a `.txt` report (always written, used internally by the GUI preview pane and chart views). Optional formats — DOCX, CSV, JSON, PDF, HTML — are opt-in via the Advanced Search Options checkboxes or `-o docx,csv,json,pdf,html` on the CLI. These files contain matched text from your documents. Use **Delete on Close** to automatically remove them when you close the app, or **Wipe Session** (Tools → Clear Files) to remove them immediately. If your search folder is cloud-synced, peekdocs automatically redirects reports to a safe local folder (`~/peekdocs_reports`) so no report files are uploaded by the syncing service. |
| **What about the search index?** | The optional search index (`.peekdocs.db`) is a SQLite database that contains the extracted text of every indexed file — this means it holds a searchable copy of your document content, including any sensitive data in those documents. Treat the index file with the same care as the documents themselves. The index is never required (uncheck "Index" to search files directly), and **Wipe Session** (Tools → Clear Files) deletes the index along with all result files, preview content, and search history. If you index a folder containing sensitive documents, consider deleting the index when you're done. |
| **Can it access files the user can't?** | No. peekdocs runs with the user's own file permissions. It cannot read files the user doesn't already have access to. It does not elevate privileges or bypass OS security. |
| **What kind of tool is it?** | A general-purpose local text search application. It reads documents you point it at, reports what it found, and writes nothing else. See [Disclaimer](#disclaimer). |
| **What does it install?** | Python packages only — no system services, no drivers, no registry entries, no background processes. It runs when launched and stops when closed. |
| **Can it modify or delete user files?** | No. peekdocs only reads user files. It creates its own report and state files — every one named with the `peekdocs_` prefix (visible outputs) or `.peekdocs` prefix (hidden user-state / per-folder dotfiles), with no exceptions — but never modifies, moves, or deletes any user documents. |
| **Is the source code available?** | Yes. Fully open-source under the MIT License. Available for audit at [github.com/exbuf/peekdocs](https://github.com/exbuf/peekdocs). |
| **How is it installed?** | Via `pipx` from the public GitHub source (`pipx install git+https://github.com/exbuf/peekdocs.git`; upgrade with `pipx upgrade peekdocs`) — fully auditable, no unsigned executables required. (PyPI upload is planned.) |
*For the deep dive — every file peekdocs writes (path, contents, sensitivity rating, cleanup), plus a documented list of risks that are outside the application's control (process arguments, swap space, force-kill, backup software, etc.) — see **[docs/SECURITY.md](docs/SECURITY.md)**. To report a suspected vulnerability, see **[SECURITY.md](SECURITY.md)** at the repository root.*
## Testing
**Unit tests** — 648 pytest tests that verify correctness: exact match counts, error messages, edge cases, argument validation, regex patterns, expression parsing, range queries, and more.
```bash
pytest tests/ -v
```
**Integration test** — end-to-end runs of every search mode and flag combination. Verifies that flag combinations run without crashing, all output formats are generated, file type coverage across 100+ sample files is reported, and match counts are confirmed stable. Results are saved to `peekdocs_global_test_results.txt`. The bash script is run on macOS and Linux, the PowerShell script on Windows, before each release. See the script headers for details.
```bash
cd samples/test-files
bash peekdocs_global_test_unix.sh "test file for peekdocs" # macOS / Linux
# Windows: powershell -ExecutionPolicy Bypass -File peekdocs_global_test_windows.ps1 "test file for peekdocs"
```
## Contributing
Ideas, bug reports, and pull requests are welcome. See [CONTRIBUTING.md](CONTRIBUTING.md) for details. PRs require a [Developer Certificate of Origin](https://developercertificate.org/) sign-off — one flag on `git commit -s`; full how-to in CONTRIBUTING.md.
If peekdocs saves you time, star the repo and share feedback — it helps others discover the tool.
## Author
Built by [Robert D. Schoening](https://robertdschoening.com) — electrical engineer, U.S. software patent holder, and independent developer. Developed with assistance from [Claude Code](https://claude.ai/code) by Anthropic. All architecture, review, testing, and maintenance performed by the author.
**Why I built it.** I built peekdocs to solve a problem I had myself: searching large collections of mixed-format documents locally, privately, and efficiently. It also became an opportunity to learn AI-assisted software development and explore what a single developer can build with today's tools. After relying on it in my own workflow, I decided to share it as free and open-source software under the MIT License.
## Disclaimer
peekdocs is provided as a general-purpose local text-search tool under the [MIT License](LICENSE), offered "as is" without warranty of any kind.
Regex Search performs pattern matching against text. Results depend entirely on the patterns the user supplies, and may include false positives or miss content that does not match those patterns. Review results in context before making decisions.
The tool is not designed or intended for high-assurance or safety-critical use cases. Users remain solely responsible for how they use and interpret its output.
## License
Copyright (c) 2026 Robert D. Schoening. peekdocs's own source code is licensed under the [MIT License](LICENSE).
### Note on dependencies
peekdocs depends on a number of third-party Python libraries, each with its own license. **End users running peekdocs are not affected by this** — the AGPL and similar copyleft terms govern distribution and modification, not use. A user who installs peekdocs to search their own documents triggers no obligations.
peekdocs's dependency tree includes a mix of permissive (MIT / BSD / Apache 2.0 / ISC / CC0) and copyleft (LGPL / GPL / AGPL) licenses. **The most significant ones to be aware of are:**
- **[PyMuPDF](https://github.com/pymupdf/PyMuPDF) (the PDF reader)** — AGPL v3 or a commercial license from [Artifex Software](https://artifex.com/licensing/)
- **[EbookLib](https://github.com/aerkalov/ebooklib) (the EPUB reader)** — AGPL v3 (no documented commercial-license alternative)
- **[extract-msg](https://github.com/TeamMsgExtractor/msg-extractor) (Outlook `.msg` email reader)** — GPL
- **[py7zr](https://py7zr.readthedocs.io/), [fpdf2](https://py-pdf.github.io/fpdf2/), and the optional [libpff-python](https://github.com/libyal/libpff)** — LGPL (weak copyleft, generally permits proprietary use through dynamic linking)
For the full per-library license listing — including every direct dependency declared in `pyproject.toml`, grouped by license category, with upstream links — see [`THIRD_PARTY_NOTICES.md`](THIRD_PARTY_NOTICES.md).
**Developers integrating peekdocs into derivative work should be aware that the dependency chain transitively carries AGPL / GPL / LGPL obligations.** Three common scenarios:
- **Your derivative work is open-source under an AGPL-compatible license.** Straightforward — all licenses coexist.
- **Your derivative work is closed-source or under a permissive license that's incompatible with AGPL** (MIT, BSD, Apache 2.0, etc.). You have three practical options: (a) accept that the combined work falls under AGPL terms, (b) acquire a commercial PyMuPDF license from Artifex Software for the PDF-reader piece *and* avoid the `.epub` reading code path entirely (since EbookLib has no commercial-license alternative), or (c) vendor or replace these libraries with permissively-licensed alternatives where your use case allows.
- **Internal-only company use without distribution.** Generally fine. Copyleft obligations are triggered by distribution / conveyance, not by internal use.
peekdocs makes no representations about license compatibility in your downstream context — consult your own counsel for derivative-work questions.