https://github.com/mantisfury/arkhammirror
Local-first AI-powered document intelligence platform for investigative journalism
https://github.com/mantisfury/arkhammirror
computer-vision data-visualization edge-ai embeddings local-llm offline-ai open-source osint palantir-alternative privacy-first python rag sqlite
Last synced: 10 days ago
JSON representation
Local-first AI-powered document intelligence platform for investigative journalism
- Host: GitHub
- URL: https://github.com/mantisfury/arkhammirror
- Owner: mantisfury
- License: mit
- Created: 2025-11-25T01:15:38.000Z (2 months ago)
- Default Branch: main
- Last Pushed: 2025-11-28T13:15:14.000Z (2 months ago)
- Last Synced: 2025-11-28T13:34:08.193Z (2 months ago)
- Topics: computer-vision, data-visualization, edge-ai, embeddings, local-llm, offline-ai, open-source, osint, palantir-alternative, privacy-first, python, rag, sqlite
- Language: Python
- Homepage:
- Size: 4.61 MB
- Stars: 3
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- Funding: .github/FUNDING.yml
- License: LICENSE
- Roadmap: ROADMAP.md
Awesome Lists containing this project
README
#
ArkhamMirror

> **Connect the dots without connecting to the cloud.**
### The insane part
I am not a developer. I can't read code. I can't write code. I honestly have no idea what I'm doing.
About a week ago I got angry that pretty much every investigative document tool forces journalists to upload sensitive leaks to the cloud, most of them cost money, and most of them kill your privacy.
So I opened free-tier Claude, Gemini, Qwen, GPT, and Grok tabs and said: βI don't want to pay for cloud services. I don't want to be rate limited. You are my dev team. Build me a 100% local version. MIT license. Oh yeah, one more thing. $0 budget.β
A few days into seeing just how far I could go with the tools I had, this exists. It started as a personal project to use and keep for myself, but my "AI dev team" convinced me that it needed to be shared with the world, so here we are.
If a complete non-coder can ship this in a week, imagine what you can do.
**ArkhamMirror** is a local-first, air-gapped investigation platform for journalists or anyone else looking for the truth in documents. It ingests complex documents (PDFs, images, handwriting), extracts text using hybrid OCR (PaddleOCR + Qwen-VL), and enables semantic search, anomaly detection, and "chat with your data" capabilitiesβall running 100% locally on your hardware.

[](https://www.youtube.com/watch?v=HcjcKnEzPww)
> *Listen to the AI-generated Deep Dive (via NotebookLM) explaining how ArkhamMirror works and why it matters.*
## π Features
* **Hybrid OCR Engine**: Automatically switches between fast CPU-based OCR (PaddleOCR) and smart GPU-based Vision LLMs (Qwen-VL) for complex layouts and handwriting.
* **Multi-Format Ingestion**: Supports **PDF, DOCX, TXT, EML, MSG, and Images**. Automatically converts all formats to standardized PDFs for processing.
* **Semantic Search**: Find documents based on meaning, not just keywords, using hybrid vector search (Dense + Sparse embeddings).
* **Entity Extraction (NER)**: Automatically identifies People, Organizations, and Locations, with noise filtering and deduplication.
* **Local-First Privacy**: Designed to run with local LLMs (via LM Studio) and local vector stores. No data leaves your machine.
* **Anomaly Detection**: Automatically flags suspicious language ("confidential", "shred", "off the books") and visual anomalies.
* **Resilient Pipeline**: Includes "Retry Missing Pages" functionality to recover from partial failures without re-processing entire documents.
* **Investigative Lens**: AI-powered analysis modes:
* **General Summary**: What is this document about?
* **Motive Detective**: What is the author trying to hide?
* **Timeline Analyst**: Extract chronological events.
* **Cluster Analysis**: Visualize how documents group together by topic.
## π οΈ Tech Stack
* **Frontend**: Streamlit
* **Backend**: Python, SQLAlchemy
* **Database**: PostgreSQL (Metadata), Qdrant (Vectors), Redis (Queue)
* **AI/ML**:
* **OCR**: PaddleOCR, Qwen-VL-Chat (via LM Studio)
* **NER**: Spacy (en_core_web_sm)
* **Embeddings**: BAAI/bge-large-en-v1.5
* **LLM**: Qwen-VL-Chat / Llama 3 (via LM Studio)
## π¦ Installation
**π New to ArkhamMirror?** Check out our [User Guide for Journalists](docs/USER_GUIDE.md) - a step-by-step tutorial with screenshots and troubleshooting tips!
### Prerequisites
* **Docker Desktop** (for DB, Redis, Qdrant)
* **Python 3.10+**
* **LM Studio** (for local LLM inference) running Qwen-VL-Chat
### Quick Start
1. **Clone the Repository**
```bash
git clone https://github.com/YourUsername/ArkhamMirror.git
cd ArkhamMirror/arkham_mirror
```
2. **Start Infrastructure**
```bash
docker compose up -d
```
3. **Setup Python Environment**
```bash
python -m venv venv
.\venv\Scripts\activate # Windows
# source venv/bin/activate # Linux/Mac
# Option 1: Standard Installation (Recommended)
# Includes BGE-M3 (multilingual) and all features
pip install -r requirements-standard.txt
# Option 2: Minimal Installation
# Lightweight (English-only), saves ~2GB disk space
# pip install -r requirements-minimal.txt
```
**Note**: If you choose the Minimal installation, you must update `config.yaml` to use the `minilm-bm25` provider. See [EMBEDDING_PROVIDERS.md](docs/EMBEDDING_PROVIDERS.md) for details.
4. **Configure Environment**
Copy `.env.example` to `.env` (or create one):
```env
DATABASE_URL=postgresql://anom:anompass@localhost:5435/anomdb
QDRANT_URL=http://localhost:6343
REDIS_URL=redis://localhost:6380
LM_STUDIO_URL=http://localhost:1234/v1
```
5. **Run the Application**
```bash
streamlit run streamlit_app/Search.py
```
6. **Start Background Workers**
Click the **"π Spawn Worker"** button in the Streamlit sidebar to launch workers.
## π Tutorial Data
New to ArkhamMirror? We've included a "Phantom Shipping" tutorial case to help you get started.
1. **Generate Data**:
Run the generator script:
```bash
python scripts/generate_sample_data.py
```
This creates a set of realistic evidence files (PDF, DOCX, EML, Image) in `data/tutorial_case`.
2. **Ingest**:
Drag and drop these files into the **"Upload Files"** area in the Streamlit sidebar.
3. **Investigate**:
Search for "C-999" or "Captain Silver" to see how the system links information across different file types.
## βοΈ Configuration
ArkhamMirror uses a `config.yaml` file for system settings. You can configure:
* **OCR Engine**: Choose between `paddle` (fast) or `qwen` (smart).
* **LLM Provider**: Connect to LM Studio, OpenAI, or local models.
* **Hardware**: Toggle GPU usage for OCR and Embeddings.
See `config.yaml` for all available options.
## π Support This Project
ArkhamMirror is **free and open source**, built to empower journalists, researchers, and investigators. Your support helps us:
* π₯οΈ Cover GPU compute costs for AI/OCR processing
* π§ Maintain and improve the platform
* β¨ Build new features requested by the community
* π Create better documentation and tutorials
### Ways to Support
[](https://github.com/sponsors/mantisfury)
[](https://ko-fi.com/arkhammirror)
* **[GitHub Sponsors](https://github.com/sponsors/mantisfury)** - Zero fees, recurring or one-time support
* **[Ko-fi](https://ko-fi.com/arkhammirror)** - Quick one-time donations
**Every contribution matters!** Even $5 helps keep the servers running and the code flowing.
Thank you to our amazing sponsors! [View all sponsors β](SPONSORS.md)
## π€ Contributing
We welcome contributions! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines on how to report bugs, suggest features, and submit pull requests.
## π License
This project is licensed under the MIT License.