{"id":31227435,"url":"https://github.com/mugiwara555343/jsonify2ai","last_synced_at":"2026-03-01T02:02:08.930Z","repository":{"id":310554632,"uuid":"1037345333","full_name":"Mugiwara555343/jsonify2ai","owner":"Mugiwara555343","description":"Engineered an end-to-end pipeline: file ingestion → JSONL normalization → embeddings → Qdrant vector indexing → semantic search \u0026 LLM “ask” API","archived":false,"fork":false,"pushed_at":"2026-02-17T19:50:50.000Z","size":44487,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-02-18T00:48:07.763Z","etag":null,"topics":["ai","api","app","bash","containers","cpu","docker","docker-compose","env","go","gpu","json","makefile","powershell","python","qdrant","qdrant-vector-database","website","yml"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Mugiwara555343.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":"SECURITY.md","support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-08-13T12:38:54.000Z","updated_at":"2026-02-17T19:50:54.000Z","dependencies_parsed_at":"2025-08-18T21:31:57.668Z","dependency_job_id":"480e08c0-71c7-4b91-8e80-d26c29c73a2d","html_url":"https://github.com/Mugiwara555343/jsonify2ai","commit_stats":null,"previous_names":["mugiwara555343/jsonify2ai"],"tags_count":10,"template":false,"template_full_name":null,"purl":"pkg:github/Mugiwara555343/jsonify2ai","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mugiwara555343%2Fjsonify2ai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mugiwara555343%2Fjsonify2ai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mugiwara555343%2Fjsonify2ai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mugiwara555343%2Fjsonify2ai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Mugiwara555343","download_url":"https://codeload.github.com/Mugiwara555343/jsonify2ai/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Mugiwara555343%2Fjsonify2ai/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29958394,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-01T01:47:18.291Z","status":"online","status_checked_at":"2026-03-01T02:00:07.437Z","response_time":124,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","api","app","bash","containers","cpu","docker","docker-compose","env","go","gpu","json","makefile","powershell","python","qdrant","qdrant-vector-database","website","yml"],"created_at":"2025-09-22T04:08:45.416Z","updated_at":"2026-03-01T02:02:08.900Z","avatar_url":"https://github.com/Mugiwara555343.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/jsonify2ai_logo.png\" alt=\"Jsonify2AI logo\" width=\"165\"/\u003e\n\u003c/p\u003e\n\n\u003ch3 align=\"center\"\u003eYour data. Your models. Your machine.\u003c/h3\u003e\n\n\u003cp align=\"center\"\u003e\n  A local-first RAG engine that transforms messy files into searchable, AI-synthesized intelligence, running entirely on your hardware with zero cloud dependency.\n\u003c/p\u003e\n\n---\n\n[![MIT License](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)\n![OS](https://img.shields.io/badge/tested-Linux%20%7C%20macOS%20%7C%20Windows-lightblue)\n[![CI](https://github.com/Mugiwara555343/jsonify2ai/actions/workflows/ci.yml/badge.svg)](https://github.com/Mugiwara555343/jsonify2ai/actions/workflows/ci.yml)\n[![GitHub](https://img.shields.io/badge/GitHub-Follow-black?logo=github)](https://github.com/Mugiwara555343)\n\n---\n\n## 🔒 Why This Exists: Data Sovereignty\n\nMost AI tools require sending your documents to a cloud API. Every query, every file, every conversation, routed through someone else's servers.\n\n**Jsonify2ai takes the opposite approach:**\n\n- **Nothing leaves your machine.** Every embedding, every search query, every LLM response is computed locally.\n- **No API keys required.** No OpenAI, no cloud credits, no usage caps. You own the entire pipeline.\n- **Full provenance.** Every chunk is SHA-256 hashed, timestamped, and traceable back to its source file.\n- **Deterministic \u0026 idempotent.** Re-ingesting the same file produces zero duplicates — guaranteed by UUID5 deterministic document IDs.\n\n---\n\n## ⚡ The 400K Milestone\n\nMost local RAG implementations break down after a few pages. Jsonify2ai was engineered to handle **massive local datasets** without precision loss:\n\n| Metric | Value |\n|--------|-------|\n| **Tested corpus size** | 400,000+ characters (~100K tokens) |\n| **Embedding model** | `nomic-embed-text` (768 dimensions) |\n| **Vector similarity** | Cosine distance via Qdrant |\n| **Chunk strategy** | 800-char sliding window, 100-char overlap, whitespace-aware cuts |\n| **Deduplication** | SHA-256 file hash → UUID5 deterministic IDs |\n| **Batch ingestion** | 64 embeddings / 128 upserts per batch |\n\nThe pipeline is designed so that ingestion never blocks retrieval. You can search and ask questions while new documents are still being indexed.\n\n---\n\n## 🧠 Architecture: Spine-and-Worker\n\nThe system is built as a distributed microservice stack, a **Go API spine** for stability and concurrency, with a **Python worker** for the heavy AI/RAG logic.\n\n| Service | Stack | Role |\n|---------|-------|------|\n| **API** | Go / Gin | Auth, rate limiting, CORS, reverse-proxy to Worker |\n| **Worker** | Python / FastAPI | Ingestion, chunking, embedding, semantic search, LLM synthesis |\n| **Qdrant** | Vector DB | 768-dim cosine similarity search with payload indexing |\n| **Web UI** | React / Vite / TypeScript | Dark-mode interface with drag-and-drop, ask panel, document drawer |\n| **Watcher** | Python daemon | Auto-ingests files dropped into `data/dropzone/` |\n\nAll five services are orchestrated via Docker Compose with health checks on every container.\n\n---\n\n## 🛠 Features\n\n### Ingestion\n\n- 📄 **Multi-format support** — TXT, MD, CSV, TSV, JSON, JSONL, HTML, PDF, DOCX\n- 🎤 **Audio transcription** — WAV, MP3, M4A, FLAC, OGG via Whisper *(optional module)*\n- 🖼️ **Image captioning** — BLIP-based caption extraction *(optional module)*\n- 💬 **ChatGPT export parsing** — Dedicated parser for conversation-aware chunking\n- 🗣️ **Transcript detection** — Auto-detects and structures dialogue formats\n- 📂 **Dropzone watcher** — Daemon auto-ingests new files with configurable polling interval\n\n### Search \u0026 Retrieval\n\n- 🔍 **Semantic search** — Embedding-based similarity search across your entire corpus\n- 🎯 **Scoped queries** — Filter by document, file path, content kind, or time range\n- 🤖 **LLM synthesis** — Ollama-backed \"Ask\" mode with multi-source cross-referencing\n- 🔧 **Model selection** — Choose any Ollama model for synthesis directly from the UI\n- 📊 **Configurable depth** — Adjustable retrieval depth (`k`) for precision vs. breadth\n\n### Data Management\n\n- 📋 **Document inventory** — Browse, inspect, and manage all indexed documents\n- 🗑️ **Deletion** — Remove documents from both chunk and image collections\n- 📦 **Export** — Download indexed data as JSONL or ZIP archives\n- 🔐 **Auth** — Bearer token authentication with `local` (open) and `strict` modes + rate limiting\n\n### Developer Experience\n\n- 🖥️ **Dark-mode-first UI** — Markdown-native rendering with theme toggle\n- 📡 **Full-stack health checks** — `/health/full` verifies the entire API → Worker → Qdrant chain\n- 📝 **Telemetry** — Structured JSONL logging with rotation and ingest activity tracking\n- 🧪 **Smoke tests** — End-to-end and pre-commit validation scripts\n\n---\n\n## ⚡ Quickstart\n\n### Prerequisites\n\n- [Docker](https://www.docker.com/) and Docker Compose\n- [Ollama](https://ollama.com/) running locally (for LLM synthesis)\n\n**Pull the embedding model:**\n\n```bash\nollama pull nomic-embed-text\n```\n\n### 1. Clone \u0026 Start\n\n```bash\ngit clone https://github.com/Mugiwara555343/jsonify2ai.git\ncd jsonify2ai\ndocker compose up --build -d\n```\n\n### 2. Access\n\n| Endpoint | URL |\n|----------|-----|\n| **Web UI** | [http://localhost:5173](http://localhost:5173) |\n| **API** | [http://localhost:8082](http://localhost:8082) |\n| **Qdrant Dashboard** | [http://localhost:6333/dashboard](http://localhost:6333/dashboard) |\n\n### 3. Verify\n\n```bash\n# Check full-stack health\ncurl http://localhost:8082/health/full\n\n# Expected: {\"ok\":true,\"api\":true,\"worker\":true}\n```\n\n### 4. Ingest Your First File\n\nDrop any supported file into `data/dropzone/` — the watcher daemon picks it up automatically. Or drag-and-drop directly in the Web UI.\n\n---\n\n## 🗺️ Roadmap\n\n\u003e Features that exist in code but are not yet polished, or are planned for future development.\n\n| Feature | Status |\n|---------|--------|\n| Audio transcription (Whisper STT) | ⚙️ Code complete, requires optional deps (`requirements.audio.txt`) |\n| Image captioning (BLIP) | ⚙️ Code complete, behind `IMAGES_CAPTION` flag |\n| Hybrid search (vector + keyword) | 🔬 Qdrant text indexes created, full hybrid ranking in progress |\n| Context depth slider in UI | 📐 Backend `k` parameter wired, UI slider planned |\n| Multi-collection federation | 📐 Separate chunk/image collections exist, unified query coming |\n| Streaming LLM responses | 📐 Planned |\n\n---\n\n## 📖 Philosophy\n\nThis project is the result of a **7-month solo intensive** at the intersection of local AI and data privacy. It represents a belief that consumer hardware not cloud APIs should be the default substrate for personal AI.\n\nBy treating LLMs as architectural partners rather than syntax generators, the focus has been on **system design, reliability, and scalability** over manual implementation.\n\nThe goal: prove that a single builder, using the right tools, can deploy production-grade AI infrastructure that rivals enterprise cloud solutions in accuracy, while keeping every byte on the user's machine.\n\n---\n\n## 📂 Documentation\n\n- **[API Reference](docs/API.md)** — Endpoints, request/response schemas, auth\n- **[Architecture Deep-Dive](docs/ARCHITECTURE.md)** — Service topology, data flow, design decisions\n- **[Data Model](docs/DATA_MODEL.md)** — Chunk schema, Qdrant collections, payload structure\n- **[Contracts](docs/contracts.md)** — API/Worker interface contracts\n- **[Golden Path](docs/golden_path.md)** — End-to-end verification runbook\n\n---\n\n## ⚖️ License\n\nMIT — Hack it, extend it, keep it local.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmugiwara555343%2Fjsonify2ai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmugiwara555343%2Fjsonify2ai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmugiwara555343%2Fjsonify2ai/lists"}