{"id":27102439,"url":"https://github.com/shredengineer/archive-agent","last_synced_at":"2026-02-23T23:20:00.821Z","repository":{"id":286321600,"uuid":"957420315","full_name":"shredEngineer/Archive-Agent","owner":"shredEngineer","description":"Archive Agent: Smart Indexer with RAG Engine","archived":false,"fork":false,"pushed_at":"2025-04-05T17:32:50.000Z","size":536,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-05T18:25:03.533Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"gpl-3.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/shredEngineer.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-03-30T10:28:44.000Z","updated_at":"2025-04-05T17:32:53.000Z","dependencies_parsed_at":"2025-04-05T18:36:49.675Z","dependency_job_id":null,"html_url":"https://github.com/shredEngineer/Archive-Agent","commit_stats":null,"previous_names":["shredengineer/archive-agent"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shredEngineer%2FArchive-Agent","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shredEngineer%2FArchive-Agent/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shredEngineer%2FArchive-Agent/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/shredEngineer%2FArchive-Agent/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/shredEngineer","download_url":"https://codeload.github.com/shredEngineer/Archive-Agent/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247505859,"owners_count":20949911,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-04-06T15:38:32.461Z","updated_at":"2026-02-23T23:20:00.811Z","avatar_url":"https://github.com/shredEngineer.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"![Archive Agent Logo](archive_agent/assets/Archive-Agent-800x300.png)\n\n---\n\n# Archive Agent\n\n*An intelligent file indexer with powerful AI search (RAG engine), automatic OCR, and a seamless MCP interface.*\n\n![GitHub Release](https://img.shields.io/github/v/release/shredEngineer/Archive-Agent)\n![GitHub License](https://img.shields.io/github/license/shredEngineer/Archive-Agent)\n[![Listed on RAGHub](https://img.shields.io/badge/RAGHub-listed-green)](https://github.com/Andrew-Jang/RAGHub?tab=readme-ov-file#rag-projects)\n[![Verified on MCPHub](https://img.shields.io/badge/MCPHub-verified-green)](https://mcphub.com/mcp-servers/shredEngineer/Archive-Agent)\n[![Listed on MCP.so](https://img.shields.io/badge/MCP.so-listed-green)](https://mcp.so/server/Archive-Agent/shredEngineer)\n[![Verified on MseeP](https://mseep.ai/badge.svg)](https://mseep.ai/app/499d8d83-02c8-4c9b-9e4f-8e8391395482)\n[![Ask DeepWiki](https://deepwiki.com/badge.svg)](https://deepwiki.com/shredEngineer/Archive-Agent)\n\n**Archive Agent** brings RAG to your command line and connects to your tools via MCP — it's *not* a chatbot.\n\n---\n\n## Find what you need with natural language\n\n- **Unlock your documents with semantic AI search \u0026 query**\n- Files are split using [semantic chunking with context headers](#how-smart-chunking-works) and committed to a local database.\n- [RAG engine](#how-chunks-are-retrieved)**¹** uses [reranking and expanding](#how-chunks-are-reranked-and-expanded) of retrieved chunks\n \n**¹** *[Retrieval Augmented Generation](https://en.wikipedia.org/wiki/Retrieval-augmented_generation) is the method of matching pre-made snippets of information to a query.*\n\n---\n\n## Natively index your documents on-device\n\n- **Includes local AI file system indexer**\n- Natively ingests [PDFs, images, Markdown, plaintext, and more…](#which-files-are-processed)\n- [Selects and tracks files using patterns](#how-files-are-selected-for-tracking) like `~/Documents/*.pdf` \n- Transcribes images using [automatic OCR](#ocr-strategies) (experimental) and entity extraction\n- Changes are automatically synced to a local [Qdrant](https://qdrant.tech/) vector database.\n\n---\n\n## Your AI, Your Choice\n\n- **Supports many AI providers and MCP** \n- [OpenAI](https://platform.openai.com/docs/overview) or compatible API ¹ for best performance\n- [OpenRouter](https://openrouter.ai/) for access to 400+ models from all providers\n- [Ollama](https://ollama.com/) and [LM Studio](https://lmstudio.ai/) for best privacy (local LLM)\n- Integrates with your workflow** via a built-in [MCP](https://modelcontextprotocol.io/introduction) server.\n\n\u003csmall\u003e**¹** Includes [xAI / Grok](https://x.ai/api) and [Claude](https://docs.anthropic.com/en/api/openai-sdk) OpenAI compatible APIs.\nSimply adjust the URL [settings](#archive-agent-settings) and overwrite `OPENAI_API_KEY`.\u003c/small\u003e\n\n---\n\n## Scalable Performance\n\n- **Fully resumable parallel processing**\n- Processes multiple files at once using optimized multi-threading.  \n- Uses AI cache and generous request retry logic for all network requests.\n- Leverages AI structured output with high-quality prompts and schemas.\n\n---\n\n## Architecture\n\n([If you can't see the diagram below, view it on Mermaid.live](https://mermaid.live/edit#pako:eNqFU9tu2kAQ_ZXRSnkDamwDiRUhEYcQEmgT6EVqycPanpgV9q61XieQKP_evUBuqtQX5J1z5sycmeGZpCJDEpFc0moNs8WKrzjA0RFMeY61YoLDDauwYBwNUDeJY77CJgow-nOaDL9Lmm6A8gxiUZZMwQUrsD79kgxPEzlcNlUlpKrhRop7A9xBuz2EM5Pp1D7yR40SJVUshW_xApZKUoX5ziDTkuYIY66Y2sF4q5HUdHLnWjmzsrGRXWJJuVGI1w3fMJ4ftB-ZWusmucKtgkukGcraxGfaZg0VSjgrRLqJoOt5Jv5LyMzFrVIEvufty8W23LkpNy4TzBzj1cVcj7eIwBRqo8Ez3UY7aBdU5mgIPzFVQsKSPWEEgTfw97rnVndsbWgCftKdiZQWcJtJbRAyqmhCa7SpyLPXJd42KHf_XqCFXKkLU2RUb0zM7tQUcduZvPl6D_7X2d7ExIpcGpEFKsnwAeErUmmW_dHPMjUe54xH4HW6JuJwmNOtmctecOwE3ePSPqZOXQ9iA8kOFljgA-UpHpTf6xzvZaY288qa21buYu0xfEpa0Iw1tT6Dfd6Vzbs2eRPkaE4SRrx-RGlH5kjXljRzJPUON8KMQzybtmDyQ__M45vDykhL_wVZRiIlG2yREmVJzZM8G8KKqDWWuCKR_syo3KzIir_oHN37byHKQ5oUTb4m0T0tav1qKn0YeM6o3vgbRRdDGYuGKxKdWAUSPZMtifqDTs_vhqHve4HX94OwRXYk8sOwc-ydhN7xwO8O-v0weGmRJ1uz2wl6vYEX9sLeySAcBN3w5S8sgV42))\n\n```mermaid\ngraph LR\n\n  %% Ingestion Pipeline\n  subgraph Ingestion\n    A[\u003cb\u003eTrack and Commit Files\u003c/b\u003e\u003cbr\u003eSupports Profiles] --\u003e B[\u003cb\u003eIngest Files\u003c/b\u003e\u003cbr\u003eAutomatic OCR Strategy\u003cbr\u003eImage Entity Extraction]\n    B --\u003e C[\u003cb\u003eSemantic Chunking\u003c/b\u003e\u003cbr\u003ewith Context Headers\u003cbr\u003eLines per Block: 100\u003cbr\u003eWords per Chunk: 200]\n    C --\u003e D[\u003cb\u003eEmbed Chunks\u003c/b\u003e\u003cbr\u003eModel: text-embedding-3-large\u003cbr\u003eVector Size: 3072]\n    D --\u003e E[\u003cb\u003eStore Chunks\u003c/b\u003e\u003cbr\u003eLocal Qdrant database]\n  end\n\n  %% Query Pipeline\n  subgraph Query\n    F[\u003cb\u003eAsk Question\u003c/b\u003e] --\u003e G[\u003cb\u003eEmbed Question\u003c/b\u003e\u003cbr\u003eModel: text-embedding-3-large]\n    G --\u003e H[\u003cb\u003eRetrieve Nearest Chunks\u003c/b\u003e\u003cbr\u003eScore Min: 0.1\u003cbr\u003eChunks Max: 30]\n    E --\u003e H\n    H --\u003e I[\u003cb\u003eRerank by Relevance\u003c/b\u003e\u003cbr\u003eChunks Max: 8]\n    I --\u003e J[\u003cb\u003eExpand Context\u003c/b\u003e\u003cbr\u003eChunks Radius: 1]\n    J --\u003e K[\u003cb\u003eGenerate Answer\u003c/b\u003e]\n    K --\u003e L[\u003cb\u003eGet Answer\u003c/b\u003e\u003cbr\u003ein CLI, GUI, MCP]\n  end\n```\n\n---\n\n## Just getting started?\n\n- 👉 [Install Archive Agent on Linux](#install-archive-agent)\n- 👉 [Run Archive Agent](#run-archive-agent)\n- 👉 [MCP Tools](#mcp-tools)\n- 👉 [Update Archive Agent](#update-archive-agent)\n\n---\n\n## Documentation\n\n\u003c!-- TOC --\u003e\n* [Archive Agent](#archive-agent)\n  * [Find what you need with natural language](#find-what-you-need-with-natural-language)\n  * [Natively index your documents on-device](#natively-index-your-documents-on-device)\n  * [Your AI, Your Choice](#your-ai-your-choice)\n  * [Scalable Performance](#scalable-performance)\n  * [Architecture](#architecture)\n  * [Just getting started?](#just-getting-started)\n  * [Documentation](#documentation)\n  * [Supported OS](#supported-os)\n  * [Install Archive Agent](#install-archive-agent)\n    * [Ubuntu / Linux Mint](#ubuntu--linux-mint)\n  * [AI provider setup](#ai-provider-setup)\n    * [OpenAI provider setup](#openai-provider-setup)\n    * [OpenRouter provider setup](#openrouter-provider-setup)\n    * [Ollama provider setup](#ollama-provider-setup)\n    * [LM Studio provider setup](#lm-studio-provider-setup)\n  * [Which files are processed](#which-files-are-processed)\n  * [How files are processed](#how-files-are-processed)\n  * [OCR strategies](#ocr-strategies)\n  * [How smart chunking works](#how-smart-chunking-works)\n  * [How chunk references work](#how-chunk-references-work)\n  * [How chunks are retrieved](#how-chunks-are-retrieved)\n  * [How chunks are reranked and expanded](#how-chunks-are-reranked-and-expanded)\n  * [How answers are generated](#how-answers-are-generated)\n  * [How files are selected for tracking](#how-files-are-selected-for-tracking)\n  * [Run Archive Agent](#run-archive-agent)\n  * [Quickstart on the command line (CLI)](#quickstart-on-the-command-line-cli)\n  * [CLI command reference](#cli-command-reference)\n    * [See list of commands](#see-list-of-commands)\n    * [Create or switch profile](#create-or-switch-profile)\n    * [Open current profile config in nano](#open-current-profile-config-in-nano)\n    * [Add included patterns](#add-included-patterns)\n    * [Add excluded patterns](#add-excluded-patterns)\n    * [Remove included / excluded patterns](#remove-included--excluded-patterns)\n    * [List included / excluded patterns](#list-included--excluded-patterns)\n    * [Resolve patterns and track files](#resolve-patterns-and-track-files)\n    * [List tracked files](#list-tracked-files)\n    * [List changed files](#list-changed-files)\n    * [Commit changed files to database](#commit-changed-files-to-database)\n    * [Combined track and commit](#combined-track-and-commit)\n    * [Search your files](#search-your-files)\n    * [Query your files](#query-your-files)\n    * [Launch Archive Agent GUI](#launch-archive-agent-gui)\n    * [Start MCP Server](#start-mcp-server)\n  * [MCP Tools](#mcp-tools)\n  * [Update Archive Agent](#update-archive-agent)\n    * [Archive Agent settings](#archive-agent-settings)\n      * [Profile configuration](#profile-configuration)\n    * [Watchlist](#watchlist)\n    * [AI cache](#ai-cache)\n  * [Qdrant database](#qdrant-database)\n  * [Developer's guide](#developers-guide)\n    * [Important modules](#important-modules)\n    * [Network and Retry Handling](#network-and-retry-handling)\n    * [Code testing and analysis](#code-testing-and-analysis)\n    * [Run Qdrant with in-memory storage](#run-qdrant-with-in-memory-storage)\n  * [Tools](#tools)\n    * [Rename file paths in chunk metadata](#rename-file-paths-in-chunk-metadata)\n    * [Remove file paths from context headers](#remove-file-paths-from-context-headers)\n  * [Known issues](#known-issues)\n  * [Licensed under GNU GPL v3.0](#licensed-under-gnu-gpl-v30)\n  * [Collaborators welcome](#collaborators-welcome)\n\u003c!-- TOC --\u003e\n\n---\n\n## Supported OS\n\n**Archive Agent** has been tested with these configurations:\n\n- **Ubuntu 24.04** (PC x64)\n- **Ubuntu 22.04** (PC x64)\n\nIf you've successfully installed and tested **Archive Agent** with a different setup, please let me know and I'll add it here! \n\n---\n\n## Install Archive Agent\n\nPlease install these requirements before proceeding:\n\n- [Docker](https://docs.docker.com/engine/install/) *(for running Qdrant server)*\n- [Python](https://www.python.org/downloads/) **\u003e= 3.10** *(core runtime)* (usually already installed)\n\n### Ubuntu / Linux Mint\n\nThis installation method should work on any Linux distribution derived from Ubuntu (e.g. Linux Mint). \n\nTo install **Archive Agent** in the current directory of your choice, run this once:\n\n```bash\ngit clone https://github.com/shredEngineer/Archive-Agent\ncd Archive-Agent\nchmod +x install.sh\n./install.sh\n```\n\nThe `install.sh` script will execute the following steps:\n- Download and install `uv` (used for Python environment management)\n- Install the custom Python environment\n- Install the `spaCy` model for natural language processing (pre-chunking)\n- Install `pandoc` (used for document parsing)\n- Download and install the Qdrant docker image with persistent storage and auto-restart\n- Install a global `archive-agent` command for the current user\n\n**Archive Agent is now installed!**\n\n👉 **Please complete the [AI provider setup](#ai-provider-setup) next.**  \n(Afterward, you'll be ready to [Run Archive Agent](#run-archive-agent)!)\n\n---\n\n## AI provider setup\n\n**Archive Agent** lets you choose between different AI providers:\n\n- Remote APIs *(higher performance and cost, less privacy)*:\n  - **OpenAI**: Requires an OpenAI API key.\n  - **OpenRouter**: Requires an OpenRouter API key. Access to 400+ models.\n\n- Local APIs *(lower performance and cost, best privacy)*:\n  - **Ollama**: Requires Ollama running locally.\n  - **LM Studio**: Requires LM Studio running locally.\n\n💡 **Good to know:** You will be prompted to choose an AI provider at startup; see: [Run Archive Agent](#run-archive-agent).\n\n📌 **Note:** You *can* customize the specific **models** used by the AI provider in the [Archive Agent settings](#archive-agent-settings). However, you *cannot* change the AI provider of an *existing* profile, as the embeddings will be incompatible; to choose a different AI provider, create a new profile instead.\n\n### OpenAI provider setup\n\nIf the OpenAI provider is selected, **Archive Agent** requires the OpenAI API key.\n\nTo export your [OpenAI API key](https://platform.openai.com/api-keys), replace `sk-...` with your actual key and run this once:\n\n```bash\necho \"export OPENAI_API_KEY='sk-...'\" \u003e\u003e ~/.bashrc \u0026\u0026 source ~/.bashrc\n```\n\nThis will persist the export for the current user.\n\n💡 **Good to know:** [OpenAI won't use your data for training.](https://platform.openai.com/docs/guides/your-data)\n\n### OpenRouter provider setup\n\nIf the OpenRouter provider is selected, **Archive Agent** requires an OpenRouter API key.\n\n[OpenRouter](https://openrouter.ai/) provides a unified API to access 400+ models from many providers (OpenAI, Google, Anthropic, Meta, and more) through a single endpoint.\n\nTo export your [OpenRouter API key](https://openrouter.ai/settings/keys), replace `sk-or-...` with your actual key and run this once:\n\n```bash\necho \"export OPENROUTER_API_KEY='sk-or-...'\" \u003e\u003e ~/.bashrc \u0026\u0026 source ~/.bashrc\n```\n\nThis will persist the export for the current user.\n\nWith the default [Archive Agent Settings](#archive-agent-settings), these OpenRouter models are used:\n\n| Task   | Default Model                   | Input/Output Cost          |\n|--------|---------------------------------|----------------------------|\n| Chunk  | `google/gemini-2.5-flash-lite`  | $0.10 / $0.40 per M tokens |\n| Rerank | `google/gemini-2.5-flash-lite`  | $0.10 / $0.40 per M tokens |\n| Query  | `google/gemini-2.5-flash`       | $0.30 / $2.50 per M tokens |\n| Vision | `google/gemini-2.5-flash`       | $0.30 / $2.50 per M tokens |\n| Embed  | `openai/text-embedding-3-large` | $0.13 per M tokens         |\n\n💡 **Good to know:** You can customize the models in the [Archive Agent settings](#archive-agent-settings). OpenRouter supports [structured outputs](https://openrouter.ai/docs/guides/features/structured-outputs), [embeddings](https://openrouter.ai/docs/api/reference/embeddings), and [vision](https://openrouter.ai/docs/guides/overview/multimodal/overview) across many models. Browse all available models at [openrouter.ai/models](https://openrouter.ai/models).\n\n### Ollama provider setup\n\nIf the Ollama provider is selected, **Archive Agent** requires Ollama running at `http://localhost:11434`.\n\n- [How to install Ollama.](https://ollama.com/download)\n\nWith the default [Archive Agent Settings](#archive-agent-settings), these Ollama models are expected to be installed: \n\n```bash\nollama pull llama3.1:8b             # for chunk/rerank/query\nollama pull llava:7b-v1.6           # for vision\nollama pull nomic-embed-text:v1.5   # for embed\n```\n\n💡 **Good to know:** Ollama also works without a GPU.\nAt least 32 GiB RAM is recommended for smooth performance.\n\n### LM Studio provider setup\n\nIf the LM Studio provider is selected, **Archive Agent** requires LM Studio running at `http://localhost:1234`.\n\n- [How to install LM Studio.](https://lmstudio.ai/download)\n\nWith the default [Archive Agent Settings](#archive-agent-settings), these LM Studio models are expected to be installed: \n\n```bash\nmeta-llama-3.1-8b-instruct              # for chunk/rerank/query\nllava-v1.5-7b                           # for vision\ntext-embedding-nomic-embed-text-v1.5    # for embed\n```\n\n💡 **Good to know:** LM Studio also works without a GPU.\nAt least 32 GiB RAM is recommended for smooth performance.\n\n---\n\n## Which files are processed\n\n**Archive Agent** currently supports these file types:\n- Text:\n  - Plaintext: `.txt`, `.md`, `.markdown`\n  - Documents:\n    - ASCII documents: `.html`, `.htm` (images not supported)\n    - Binary documents: `.odt`, `.docx` (including images)\n  - PDF documents: `.pdf` (including images; see [OCR strategies](#ocr-strategies))\n- Images: `.jpg`, `.jpeg`, `.png`, `.gif`, `.webp`, `.bmp`\n\n📌 **Note:** Images in HTML documents are currently not supported.\n\n📌 **Note:** Legacy `.doc` files are currently not supported.\n\n📌 **Note:** Unsupported files are tracked but not processed.\n\n---\n\n## How files are processed\n\nUltimately, **Archive Agent** decodes everything to text like this:\n- Plaintext files are decoded to UTF-8.\n- Documents are converted to plaintext, images are extracted.\n- PDF documents are decoded according to the OCR strategy.\n- Images are decoded to text using AI vision.\n  - Uses OCR, entity extraction, or both combined (default).\n  - The vision model will reject unintelligible images.\n  - *Entity extraction* extracts structured information from images.\n  - Structured information is formatted as image description.\n\nSee [Archive Agent settings](#archive-agent-settings): `image_ocr`, `image_entity_extract`\n\n**Archive Agent** processes files with optimized performance:\n- **Surgical Synchronization**:\n  - PDF analyzing phase is serialized (due to PyMuPDF threading limitations).\n  - All other phases (vision, chunking, embedding) run in parallel for maximum performance.\n- **Vision operations** are parallelized across images and pages within and across files.\n- **Embedding operations** are parallelized across text chunks and files.\n- **Smart chunking** uses sequential processing due to carry mechanism dependencies.\n\nSee [Archive Agent settings](#archive-agent-settings): `max_workers_ingest`, `max_workers_vision`, `max_workers_embed`\n\n---\n\n## OCR strategies\n\nFor PDF documents, there are different OCR strategies supported by **Archive Agent**:\n\n- `strict` OCR strategy (**recommended**):\n  - PDF OCR text layer is *ignored*.\n  - PDF pages are treated as images and processed with OCR only.\n  - **Expensive and slow, but more accurate.**\n\n- `relaxed` OCR strategy:\n  - PDF OCR text layer is extracted.\n  - PDF foreground images are decoded with OCR, but background images are *ignored*.\n  - **Cheap and fast, but less accurate.**\n\n- `auto` OCR strategy:\n  - Attempts to select the best OCR strategy for each page, based on the number of characters extracted from the PDF OCR text layer, if any.\n  - Decides based on `ocr_auto_threshold`, the minimum number of characters for `auto` OCR strategy to resolve to `relaxed` instead of `strict`.\n  - **Trade-off between cost, speed, and accuracy.**\n\n⚠️ **Warning:** The `auto` OCR strategy is still experimental.\nPDF documents often contain small/scattered images related to page style/layout which cause overhead while contributing little information or even cluttering the result.\n\n💡 **Good to know:** You will be prompted to choose an OCR strategy at startup (see [Run Archive Agent](#run-archive-agent)).\n\n---\n\n## How smart chunking works\n\n**Archive Agent** processes decoded text like this:\n- Decoded text is sanitized and split into sentences.\n- Sentences are grouped into reasonably-sized blocks.\n- **Each block is split into smaller chunks using an AI model.**\n  - Block boundaries are handled gracefully (last chunk carries over).\n- Each chunk is prefixed with a *context header* (improves search).\n- Each chunk is turned into a vector using AI embeddings.\n- Each vector is turned into a *point* with file metadata.\n- Each *point* is stored in the Qdrant database.\n\nSee [Archive Agent settings](#archive-agent-settings): `chunk_lines_block`, `chunk_words_target`\n\n💡 **Good to know:** This **smart chunking** improves the accuracy and effectiveness of the retrieval.\n\n📌 **Note:** In rare cases where a chunk exceeds the embedding model's token limit (typically 8192 tokens), **Archive Agent** automatically truncates it as a last resort with progressive 10% reductions (up to 10 attempts) until it fits. \n\n📌 **Note:** Splitting into sentences may take some time for huge documents.\nThere is currently no possibility to show the progress of this step.\n\n---\n\n## How chunk references work\n\nTo ensure that every chunk can be traced back to its origin, **Archive Agent** maps the text contents of each chunk to the corresponding line numbers or page numbers of the source file.\n\n- Line-based files (e.g., `.txt`) use the range of line numbers as reference.\n- Page-based files (e.g., `.pdf`) use the range of page numbers as reference.\n\n📌 **Note:** References are only *approximate* due to paragraph/sentence splitting/joining in the chunking process.\n\n---\n\n## How chunks are retrieved\n\n**Archive Agent** retrieves chunks related to your question like this:\n- The question is turned into a vector using AI embeddings.\n- Points with similar vectors are retrieved from the Qdrant database.\n- Only chunks of points with sufficient score are kept.\n- If `retrieve_knee_enable` is enabled, an adaptive cutoff trims low-relevance chunks when there is a clear score drop-off.\n\nSee [Archive Agent settings](#archive-agent-settings): `retrieve_score_min`, `retrieve_chunks_max`, `retrieve_knee_enable`, `retrieve_knee_sensitivity`, `retrieve_knee_min_chunks`\n\n📌 **Note:** Adaptive cutoff uses the Kneedle algorithm on the sorted similarity scores. Set a higher\n`retrieve_knee_sensitivity` to make the cutoff more conservative, and use `retrieve_knee_min_chunks`\nto enforce a minimum floor.\n\n💡 **Tuning tips:**\n- If retrieval feels too short or misses context, increase `retrieve_knee_min_chunks` (e.g., `3`–`5`) or raise `retrieve_knee_sensitivity`.\n- If retrieval feels too long or noisy, lower `retrieve_knee_sensitivity` slightly (e.g., `0.8`–`1.0`) or reduce `retrieve_knee_min_chunks`.\n- If you want to disable the adaptive cutoff entirely, set `retrieve_knee_enable` to `false`.\n\n---\n\n## How chunks are reranked and expanded\n\n**Archive Agent** filters the retrieved chunks .\n\n- Retrieved chunks are reranked by relevance to your question.\n- Only the top relevant chunks are kept (the other chunks are discarded).\n- Each selected chunk is expanded to get a larger context from the relevant documents.\n\nSee [Archive Agent settings](#archive-agent-settings): `rerank_chunks_max`, `expand_chunks_radius`\n\n---\n\n## How answers are generated\n\n**Archive Agent** answers your question using the reranked and expanded chunks like this:\n- The LLM receives the chunks as context to the question.\n- LLM's answer is returned as structured output and formatted.\n\n💡 **Good to know:** **Archive Agent** uses an answer template that aims to be universally helpful.\n\n---\n\n## How files are selected for tracking\n\n**Archive Agent** uses *patterns* to select your files:\n\n- Patterns can be actual file paths.\n- Patterns can be paths containing wildcards that resolve to actual file paths.\n\n\n- 💡 **Patterns must be specified as (or resolve to) *absolute* paths, e.g. `/home/user/Documents/*.txt` (or `~/Documents/*.txt`).**\n\n\n- 💡 **Use the wildcard `*` to match any file in the given directory.**\n\n\n- 💡 **Use the wildcard `**` to match any files and zero or more directories, subdirectories, and symbolic links to directories.**\n\nThere are *included patterns* and *excluded patterns*:\n\n- The set of resolved excluded files is removed from the set of resolved included files.\n- Only the remaining set of files (included but not excluded) is tracked by **Archive Agent**. \n- Hidden files are always ignored!\n\nThis approach gives you the best control over the specific files or file types to track.\n\n---\n\n## Run Archive Agent\n\n💡 **Good to know:** At startup, you will be prompted to choose the following:\n- **Profile name**\n- **AI provider** (see [AI Provider Setup](#ai-provider-setup))\n- **OCR strategy** (see [OCR strategies](#ocr-strategies))\n\nScreenshot of **command-line** interface (CLI):\n\n![](archive_agent/assets/Screenshot-CLI.png)\n\n---\n\n## Quickstart on the command line (CLI)\n\nFor example, to [track](#how-files-are-selected-for-tracking) your documents and images, run this:\n\n```bash\narchive-agent include \"~/Documents/**\" \"~/Images/**\"\narchive-agent update\n```\n\nTo start the GUI, run this:\n\n```bash\narchive-agent \n```\n\nOr, to ask questions from the command line:\n\n```bash\narchive-agent query \"Which files mention donuts?\"\n```\n\n---\n\n## CLI command reference\n\n### See list of commands\n\nTo see the list of supported commands, run this:\n\n```bash\narchive-agent\n```\n\n### Create or switch profile\n\nTo switch to a new or existing profile, run this:\n\n```bash\narchive-agent switch \"My Other Profile\"\n```\n\n📌 **Note:** **Always use quotes** for the profile name argument,\n**or skip it** to get an interactive prompt.\n\n💡 **Good to know:** Profiles are useful to manage *independent* Qdrant collections (see [Qdrant database](#qdrant-database)) and [Archive Agent settings](#archive-agent-settings).\n\n### Open current profile config in nano\n\nTo open the current profile's config (JSON) in the `nano` editor, run this:\n\n```bash\narchive-agent config\n```\n\nSee [Archive Agent settings](#archive-agent-settings) for details.\n\n### Add included patterns\n\nTo add one or more included [patterns](#how-files-are-selected-for-tracking), run this:\n\n```bash\narchive-agent include \"~/Documents/*.txt\"\n```\n\n📌 **Note:** **Always use quotes** for the pattern argument (to prevent your shell's wildcard expansion),\n**or skip it** to get an interactive prompt.\n\n### Add excluded patterns\n\nTo add one or more excluded [patterns](#how-files-are-selected-for-tracking), run this:\n\n```bash\narchive-agent exclude \"~/Documents/*.txt\"\n```\n\n📌 **Note:** **Always use quotes** for the pattern argument (to prevent your shell's wildcard expansion),\n**or skip it** to get an interactive prompt.\n\n### Remove included / excluded patterns\n\nTo remove one or more previously included / excluded patterns, run this:\n\n```bash\narchive-agent remove \"~/Documents/*.txt\"\n```\n\n📌 **Note:** **Always use quotes** for the pattern argument (to prevent your shell's wildcard expansion),\n**or skip it** to get an interactive prompt.\n\n### List included / excluded patterns\n\nTo see the list of included / excluded patterns, run this: \n\n```bash\narchive-agent patterns\n```\n\n### Resolve patterns and track files\n\nTo resolve all patterns and track changes to your files, run this:\n\n```bash\narchive-agent track\n```\n\n### List tracked files\n\nTo see the list of tracked files, run this: \n\n```bash\narchive-agent list\n```\n\n📌 **Note:** Don't forget to `track` your files first.\n\n### List changed files\n\nTo see the list of changed files, run this: \n\n```bash\narchive-agent diff\n```\n\n📌 **Note:** Don't forget to `track` your files first.\n\n### Commit changed files to database\n\nTo sync changes to your files with the Qdrant database, run this:\n\n```bash\narchive-agent commit\n```\n\nTo see additional information (vision, chunking, embedding), pass the `--verbose` option.\n\nTo bypass the [AI cache](#ai-cache) (vision, chunking, embedding) for this commit, pass the `--nocache` option.\n\nTo automatically confirm deleting untracked files from the database, pass the `--confirm-delete` option.\n\n💡 **Good to know:** Changes are triggered by:\n- File added\n- File removed\n- File changed:\n  - Different file size\n  - Different modification date\n\nThe Qdrant database is updated after all files have been ingested. \n\n📌 **Note:** Don't forget to `track` your files first.\n\n### Combined track and commit\n\nTo `track` and then `commit` in one go, run this:\n\n```bash\narchive-agent update\n```\n\nTo see additional information (vision, chunking, embedding), pass the `--verbose` option.\n\nTo bypass the [AI cache](#ai-cache) (vision, chunking, embedding) for this commit, pass the `--nocache` option.\n\nTo automatically confirm deleting untracked files from the database, pass the `--confirm-delete` option.\n\n### Search your files\n\n```bash\narchive-agent search \"Which files mention donuts?\"\n```\n\nLists files relevant to the question.\n\n📌 **Note:** **Always use quotes** for the question argument, **or skip it** to get an interactive prompt.\n\nTo see additional information (embedding, retrieval, reranking), pass the `--verbose` option.\n\nTo bypass the [AI cache](#ai-cache) (embedding, reranking) for this search, pass the `--nocache` option.\n\n### Query your files\n\n```bash\narchive-agent query \"Which files mention donuts?\"\n```\n\nAnswers your question using RAG.\n\n📌 **Note:** **Always use quotes** for the question argument, **or skip it** to get an interactive prompt.\n\nTo see additional information (embedding, retrieval, reranking, querying), pass the `--verbose` option.\n\nTo bypass the [AI cache](#ai-cache) (embedding, reranking) for this query, pass the `--nocache` option.\n\nTo save the query results to a JSON file, run either:\n\n- `--to-json` with a specific filename:\n  ```bash\n  archive-agent query \"Which files mention donuts?\" --to-json answer.json\n  ```\n\n- `--to-json-auto [DIR]` to auto-generate a clean filename from the question\n  (max 160 chars, truncated with `[...]` if needed)\n  and write to directory `DIR` if provided (defaults to current directory `.`; creates directories in path if not existing):\n  ```bash\n  archive-agent query \"Which files mention donuts?\" --to-json-auto Output/\n  # Creates: Output/Which_files_mention_donuts_.json\n  ```\n\n📌 **Note:** As of **Archive Agent** v12.2.0, a corresponding Markdown file (`.md`) containing the answer is also created when using the `--to-json` or `--to-json-auto` options. (There is currently no way to opt out of this.)   \n\n### Launch Archive Agent GUI\n\nTo launch the **Archive Agent** GUI in your browser, run this:\n\n```bash\narchive-agent gui\n```\n\nTo see additional information (embedding, retrieval, reranking, querying), pass the `--verbose` option.\n\nTo bypass the [AI cache](#ai-cache) (embedding, reranking) for this query, pass the `--nocache` option.\n\nTo save the query results to JSON files, run this:\n\n- `--to-json-auto [DIR]` to auto-generate clean filenames from the questions\n  (max 160 chars, truncated with `[...]` if needed)\n  and write to directory `DIR` if provided (defaults to current directory `.`; creates directories in path if not existing):\n  ```bash\n  archive-agent gui --to-json-auto Output/\n  ```\n\n📌 **Note:** As of **Archive Agent** v12.2.0, corresponding Markdown files (`.md`) containing the answers are also created when using the `--to-json-auto` option. (There is currently no way to opt out of this.)   \n\n📌 **Note:** Press `CTRL+C` in the console to close the GUI server.\n\n### Start MCP Server\n\nTo start the **Archive Agent** MCP server, run this:\n\n```bash\narchive-agent mcp\n```\n\nTo see additional information (embedding, retrieval, reranking, querying), pass the `--verbose` option.\n\nTo bypass the [AI cache](#ai-cache) (embedding, reranking) for this query, pass the `--nocache` option.\n\nTo save the query results to JSON files, run this:\n\n- `--to-json-auto [DIR]` to auto-generate clean filenames from the questions\n  (max 160 chars, truncated with `[...]` if needed)\n  and write to directory `DIR` if provided (defaults to current directory `.`; creates directories in path if not existing):\n  ```bash\n  archive-agent mcp --to-json-auto Output/\n  ```\n\n📌 **Note:** As of **Archive Agent** v12.2.0, corresponding Markdown files (`.md`) containing the answers are also created when using the `--to-json-auto` option. (There is currently no way to opt out of this.)\n\n📌 **Note:** Press `CTRL+C` in the console to close the MCP server.\n\n💡 **Good to know:** Use these MCP configurations to let your IDE or AI extension automate **Archive Agent**:\n\n- [`.vscode/mcp.json`](.vscode/mcp.json) for [GitHub Copilot agent mode (VS Code)](https://code.visualstudio.com/blogs/2025/02/24/introducing-copilot-agent-mode): \n- [`.roo/mcp.json`](.roo/mcp.json) for [Roo Code (VS Code extension)](https://marketplace.visualstudio.com/items?itemName=RooVeterinaryInc.roo-cline)\n\n---\n\n## MCP Tools\n\n**Archive Agent** exposes these tools via MCP:\n\n| MCP tool            | Equivalent CLI command(s) | Argument(s) | Implementation | Description                                     |\n|---------------------|---------------------------|-------------|----------------|-------------------------------------------------|\n| `get_patterns`      | `patterns`                | None        | Synchronous    | Get the list of included / excluded patterns.   |\n| `get_files_tracked` | `track` and then `list`   | None        | Synchronous    | Get the list of tracked files.                  |\n| `get_files_changed` | `track` and then `diff`   | None        | Synchronous    | Get the list of changed files.                  |\n| `get_chunk_headers` | None                      | `file_path` | Asynchronous   | Get list of chunk headers for a file.           |\n| `get_search_result` | `search`                  | `question`  | Asynchronous   | Get the list of files relevant to the question. |\n| `get_answer_rag`    | `query`                   | `question`  | Asynchronous   | Get answer to question using RAG.               |\n\n📌 **Note:** These commands are **read-only**, preventing the AI from changing your Qdrant database.\n\n💡 **Good to know:** Just type `#get_answer_rag` (e.g.) in your IDE or AI extension to call the tool directly.\n\n💡 **Good to know:** The `#get_answer_rag` output follows the `QuerySchema` format defined in [`AiQuery.py`](archive_agent/ai/query/AiQuery.py).\n\n💡 **Good to know:** The `#get_chunk_headers` tool provides a quick overview of a document's structure by returning all chunk headers (semantic summaries) for a given file. This is useful for understanding document contents without retrieving full text.\n\n---\n\n## Update Archive Agent\n\nThis step is not immediately needed if you just installed Archive Agent.\nHowever, to get the latest features, you should update your installation regularly.\n\nTo update your **Archive Agent** installation, run this in the installation directory:\n\n```bash\n./update.sh\n```\n\n📌 **Note:** If updating doesn't work, try removing the installation directory and then [Install Archive Agent](#install-archive-agent) again.\nYour config and data are safely stored in another place;\nsee [Archive Agent settings](#archive-agent-settings) and [Qdrant database](#qdrant-database) for details.\n\n💡 **Good to know:** To also update the Qdrant docker image, run this:\n\n```bash\nsudo ./manage-qdrant.sh update\n```\n\n---\n\n### Archive Agent settings\n\n**Archive Agent** settings are organized as profile folders in `~/.archive-agent-settings/`.\n\nE.g., the `default` profile is located in `~/.archive-agent-settings/default/`.\n\nThe currently used profile is stored in `~/.archive-agent-settings/profile.json`.\n\n📌 **Note:** To delete a profile, simply delete the profile folder.\nThis will not delete the Qdrant collection (see [Qdrant database](#qdrant-database)).\n\n#### Profile configuration\n\nThe profile configuration is contained in the profile folder as `config.json`.\n\n💡 **Good to know:** Use the `config` CLI command to open the current profile's config (JSON) in the `nano` editor (see [Open current profile config in nano](#open-current-profile-config-in-nano)).\n\n💡 **Good to know:** Use the `switch` CLI command to switch to a new or existing profile (see [Create or switch profile](#create-or-switch-profile)).\n\n| Key                         | Description                                                                                       |\n|-----------------------------|---------------------------------------------------------------------------------------------------|\n| `config_version`            | Config version                                                                                    |\n| `mcp_server_host`           | MCP server host (default `http://127.0.0.1`; set to `http://0.0.0.0` to expose in LAN)            |\n| `mcp_server_port`           | MCP server port (default `8008`)                                                                  |\n| `ocr_strategy`              | OCR strategy in [`DecoderSettings.py`](archive_agent/config/DecoderSettings.py)                   |\n| `ocr_auto_threshold`        | Minimum number of characters for `auto` OCR strategy to resolve to `relaxed` instead of `strict`  |\n| `image_ocr`                 | Image handling: `true` enables OCR, `false` disables it.                                          |\n| `image_entity_extract`      | Image handling: `true` enables entity extraction, `false` disables it.                            |\n| `chunk_lines_block`         | Number of lines per block for chunking                                                            |\n| `chunk_words_target`        | Target number of words per chunk                                                                  |\n| `qdrant_server_url`         | URL of the Qdrant server                                                                          |\n| `qdrant_collection`         | Name of the Qdrant collection                                                                     |\n| `retrieve_score_min`        | Minimum similarity score of retrieved chunks (`0`...`1`)                                          |\n| `retrieve_chunks_max`       | Maximum number of retrieved chunks                                                                |\n| `retrieve_knee_enable`      | Adaptive cutoff for retrieval (`true` enables knee-based cutoff, `false` disables it)             |\n| `retrieve_knee_sensitivity` | Knee detection sensitivity (Kneedle `S` parameter; higher = more conservative)                    |\n| `retrieve_knee_min_chunks`  | Minimum number of chunks to keep when adaptive cutoff is applied                                  |\n| `rerank_chunks_max`         | Number of top chunks to keep after reranking                                                      |\n| `expand_chunks_radius`      | Number of preceding and following chunks to prepend and append to each reranked chunk             |\n| `max_workers_ingest`        | Maximum number of files to process in parallel, creating one thread for each file                 |\n| `max_workers_vision`        | Maxmimum number of parallel vision requests **per file**, creating one thread per request         |\n| `max_workers_embed`         | Maxmimum number of parallel embedding requests **per file**, creating one thread per request      |\n| `ai_provider`               | AI provider in [`ai_provider_registry.py`](archive_agent/ai_provider/ai_provider_registry.py)     |\n| `ai_server_url`             | AI server URL                                                                                     |\n| `ai_model_chunk`            | AI model used for chunking                                                                        |\n| `ai_model_embed`            | AI model used for embedding                                                                       |\n| `ai_model_rerank`           | AI model used for reranking                                                                       |\n| `ai_model_query`            | AI model used for queries                                                                         |\n| `ai_model_vision`           | AI model used for vision (`\"\"` disables vision)                                                   |\n| `ai_vector_size`            | Vector size of embeddings (used for Qdrant collection)                                            |\n| `ai_temperature_query`      | Temperature of the query model (ignored for GPT-5)                                                |\n\n📌 **Note:** When using GPT-5 (default as of **Archive Agent** v14.0.0), `ai_temperature_query` is ignored.\nGPT-5 reasoning effort and verbosity are currently not available in the configuration,\nbut may be customized directly inside `OpenAiProvider.py`.\n\n📌 **Note:** Since `max_workers_vision` and `max_workers_embed` requests are processed in parallel **per file**,\nand `max_workers_ingest` files are processed in parallel, the total number of requests multiplies quickly.\nAdjust according to your system resources and in alignment with your AI provider's rate limits.\n\n### Watchlist\n\nThe profile watchlist is contained in the profile folder as `watchlist.json`.\n\nThe watchlist is managed by these commands only:\n\n- `include` / `exclude` / `remove`\n- `track` / `commit` / `update`\n\n### AI cache\n\nEach profile folder also contains an `ai_cache` folder.\n\nThe AI cache ensures that, in a given profile:\n- The same image is only OCR-ed once.\n- The same text is only chunked once.\n- The same text is only embedded once.\n- The same combination of chunks is only reranked once.\n\nThis way, **Archive Agent** can quickly resume where it left off if a commit was interrupted.\n\nTo bypass the AI cache for a single commit, pass the `--nocache` option to the `commit` or `update` command\n(see [Commit changed files to database](#commit-changed-files-to-database) and [Combined track and commit](#combined-track-and-commit)).\n\n💡 **Good to know:** Queries are never cached, so you always get a fresh answer. \n\n📌 **Note:** To clear the entire AI cache, simply delete the profile's cache folder.\n\n📌 **Technical Note:** **Archive Agent** keys the cache using a composite hash made from the text/image bytes, and of the AI model names for chunking, embedding, reranking, and vision.\nCache keys are deterministic and change generated whenever you change the *chunking*, *embedding* or *vision* AI model names.\nSince cache entries are retained forever, switching back to a prior combination of AI model names will again access the \"old\" keys.  \n\n---\n\n## Qdrant database\n\nThe [Qdrant](https://qdrant.tech/) database is stored in `~/.archive-agent-qdrant-storage/`.\n\n📌 **Note:** This folder is created by the Qdrant Docker image running as root.\n\n💡 **Good to know:** Visit your [Qdrant dashboard](http://localhost:6333/dashboard#/collections) to manage collections and snapshots.\n\n---\n\n## Developer's guide\n\n**Archive Agent** was written from scratch for educational purposes (on either end of the software).\n\n💡 **Good to know:** Tracking the `test_data/` gets you started with *some* kind of test data. \n\n### Important modules\n\nTo get started, check out these epic modules:\n\n- Files are processed in [`archive_agent/data/FileData.py`](archive_agent/data/FileData.py)\n- The app context is initialized in [`archive_agent/core/ContextManager.py`](archive_agent/core/ContextManager.py)\n- The default config is defined in [`archive_agent/config/ConfigManager.py`](archive_agent/config/ConfigManager.py)  \n- The CLI commands are defined in [`archive_agent/__main__.py`](archive_agent/__main__.py)\n- The commit logic is implemented in [`archive_agent/core/CommitManager.py`](archive_agent/core/CommitManager.py)\n- The CLI verbosity is handled in [`archive_agent/core/CliManager.py`](archive_agent/core/CliManager.py)\n- The GUI is implemented in [`archive_agent/core/GuiManager.py`](archive_agent/core/GuiManager.py)\n- The AI API prompts for chunking, embedding, vision, and querying are defined in [`archive_agent/ai/AiManager.py`](archive_agent/ai/AiManager.py) \n- The AI provider registry is located in [`archive_agent/ai_provider/ai_provider_registry.py`](archive_agent/ai_provider/ai_provider_registry.py)\n\nIf you miss something or spot bad patterns, feel free to contribute and refactor!\n\n### Network and Retry Handling\n\nArchive Agent implements comprehensive retry logic with exponential backoff and intelligent failure detection:\n\n- **AI Provider Operations**: 10 retries with exponential backoff (max 60s delay) for network timeouts and API errors\n- **Database Operations**: 10 retries with exponential backoff (max 10s delay) for Qdrant connection issues\n- **Schema Validation**: 10 retry attempts for AI response parsing failures with cache invalidation on each failure\n- **MAX_TOKENS Detection**: Instant skip when model hits token limit (no wasted retries on guaranteed truncation)\n- **Block-Level Resilience**: Failed chunking blocks are skipped with CRITICAL logs, file processing continues\n- **Dual-Layer Strategy**: Network-level retries handle infrastructure failures, schema-level retries handle AI response quality\n\nAll parallel processors (chunking, vision, embedding) properly propagate fatal errors instead of silently continuing with corrupted data.\n\n### Code testing and analysis\n\nTo run unit tests, check types, and check style, run this:\n\n```bash\n./audit.sh\n```\n\n### Run Qdrant with in-memory storage\n\nTo run Qdrant with in-memory storage (e.g., in OpenAI Codex environment where Docker is not available),\nexport this environment variable before running `install.sh` and `archive-agent`:\n\n```bash\nexport ARCHIVE_AGENT_QDRANT_IN_MEMORY=1\n``` \n\n- The environment variable is checked by `install.sh` to skip `manage-qdrant.sh`.\n- The environment variable is checked by `QdrantManager.py` to ignore server URL and use in-memory storage instead.\n\n📌 **Note:** Qdrant in-memory storage is volatile (not persisted to disk).\n\n---\n\n## Tools\n\n### Rename file paths in chunk metadata\n\nTo bulk-rename file paths in chunk metadata in the currently active Qdrant collection, run this:\n\n```bash\ncd tools/\n\n./qdrant-rename-paths-in-chunk-metadata.py\n# OR\nuv run python qdrant-rename-paths-in-chunk-metadata.py\n```\n\nUseful after moving files or renaming folders when you don't want to run the `update` command again.\n\n📌 **Note:**\n- This tool modifies the Qdrant database directly — **ensure you have backups if working with critical data**.\n- This tool will **not** update the tracked files. You need to update your watchlist (see [Archive Agent settings](#archive-agent-settings)) using manual search and replace.\n\n### Remove file paths from context headers\n\n**Archive Agent** \u003c v11.0.0 included file paths in the chunk context headers; this was a bad design decision that led to skewed retrieval.\n\nTo bulk-remove all file paths in context headers in the currently active Qdrant collection, run this:\n\n```bash\ncd tools/\n\n./qdrant-remove-paths-from-chunk-headers.py\n# OR\nuv run python qdrant-remove-paths-from-chunk-headers.py\n```\n\n📌 **Note:**\n- This tool modifies the Qdrant database directly — **ensure you have backups if working with critical data**.\n\n---\n\n## Known issues\n\n- [ ] While `track` initially reports a file as *added*, subsequent `track` calls report it as *changed*. \n\n\n- [ ] Removing and restoring a tracked file in the tracking phase is currently not handled properly:\n  - Removing a tracked file sets `{size=0, mtime=0, diff=removed}`.\n  - Restoring a tracked file sets `{size=X, mtime=Y, diff=added}`.\n  - Because `size` and `mtime` were cleared, we lost the information to detect a restored file.\n\n\n- [ ] Unprocessable files are tracked in the watchlist (and attempted to be deleted from the Qdrant database when untracked again)\n\n\n- [ ] Unprocessable files are shown in the final statistics as being \"updated in Qdrant database\"\n\n\n- [ ] AI vision is employed on empty images as well, even though they could be easily detected locally and skipped. \n\n\n- [ ] PDF vector images may not convert as expected, due to missing tests. (Using `strict` OCR strategy would certainly help in the meantime.) \n\n\n- [ ] Binary document page numbers (e.g., `.docx`) are not supported yet; Microsoft Word document support is experimental.\n\n\n- [ ] References are only *approximate* due to paragraph/sentence splitting/joining in the chunking process.\n\n\n- [ ] AI cache does not handle `AiResult` schema migration yet. (If you encounter errors, passing the `--nocache` flag or deleting all AI cache folders would certainly help in the meantime.)\n\n\n- [ ] Rejected images (e.g., due to OpenAI content filter policy violation) from PDF pages in `strict` OCR mode are currently left empty instead of resorting to text extracted from PDF OCR layer (if any).\n\n\n- [ ] The spaCy model `en_core_web_md` used for sentence splitting is only suitable for English source text. Multilingual support is missing at the moment.\n\n\n- [ ] HTML document images are not supported.\n\n\n- [ ] The behavior of handling unprocessable files is not customizable yet. Should the user be prompted? Should the entire file be rejected? **Unprocessable images are currently tolerated and replaced by `[Unprocessable image]`.** \n\n\n- [ ] GPT-5 reasoning effort and verbosity are currently not available in the configuration.\n\n\n- [ ] GUI sometimes doesn't react on first button click, needs a second one. (Should migrate to use NiceGUI instead of Streamlit.)\n\n- [ ] Test coverage gaps (core retrieval path): `archive_agent/db/QdrantManager.py` ~54% coverage; `archive_agent/ai/query/AiQuery.py` ~57% coverage; `archive_agent/core/CliManager.py` ~37% coverage. Missing areas include reference repair paths, CLI formatting branches, and edge-case retrieval flows.\n\n---\n\n## Licensed under GNU GPL v3.0\n\nCopyright © 2025 Dr.-Ing. Paul Wilhelm \u003c[paul@wilhelm.dev](mailto:paul@wilhelm.dev)\u003e\n\n```\nThis program is free software: you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation, either version 3 of the License, or\n(at your option) any later version.\n```\n\nSee [LICENSE](LICENSE) for details.\n\n---\n\n## Collaborators welcome\nYou are invited to contribute to this open source project! Feel free to [file issues](https://github.com/shredEngineer/Archive-Agent/issues) and [submit pull requests](https://github.com/shredEngineer/Archive-Agent/pulls) anytime.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshredengineer%2Farchive-agent","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fshredengineer%2Farchive-agent","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fshredengineer%2Farchive-agent/lists"}