{"id":46636647,"url":"https://github.com/np-compete/pageindex","last_synced_at":"2026-03-10T03:01:06.316Z","repository":{"id":341816462,"uuid":"1171600337","full_name":"NP-compete/pageindex","owner":"NP-compete","description":"Vectorless, reasoning-based RAG using hierarchical document indexing with Vertex AI. No embeddings, no vector DB - just LLM reasoning.","archived":false,"fork":false,"pushed_at":"2026-03-03T15:19:19.000Z","size":68,"stargazers_count":2,"open_issues_count":14,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-09T04:47:14.728Z","etag":null,"topics":["ai","document-processing","gemini","hacktoberfest","hierarchical-indexing","llm","machine-learning","nlp","pdf","python","rag","retrieval-augmented-generation","vertex-ai"],"latest_commit_sha":null,"homepage":"https://github.com/NP-compete/pageindex#readme","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/NP-compete.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["NP-compete"]}},"created_at":"2026-03-03T12:08:06.000Z","updated_at":"2026-03-04T13:47:43.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/NP-compete/pageindex","commit_stats":null,"previous_names":["np-compete/pageindex"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/NP-compete/pageindex","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NP-compete%2Fpageindex","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NP-compete%2Fpageindex/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NP-compete%2Fpageindex/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NP-compete%2Fpageindex/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/NP-compete","download_url":"https://codeload.github.com/NP-compete/pageindex/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/NP-compete%2Fpageindex/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":30322648,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-03-10T01:36:58.598Z","status":"online","status_checked_at":"2026-03-10T02:00:06.579Z","response_time":106,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai","document-processing","gemini","hacktoberfest","hierarchical-indexing","llm","machine-learning","nlp","pdf","python","rag","retrieval-augmented-generation","vertex-ai"],"created_at":"2026-03-08T01:39:26.248Z","updated_at":"2026-03-10T03:01:06.278Z","avatar_url":"https://github.com/NP-compete.png","language":"Python","funding_links":["https://github.com/sponsors/NP-compete"],"categories":[],"sub_categories":[],"readme":"# PageIndex\n\n[![CI](https://github.com/NP-compete/pageindex/actions/workflows/ci.yml/badge.svg)](https://github.com/NP-compete/pageindex/actions/workflows/ci.yml)\n[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)\n[![Code style: ruff](https://img.shields.io/badge/code%20style-ruff-000000.svg)](https://github.com/astral-sh/ruff)\n\n**Vectorless, reasoning-based RAG using hierarchical document indexing with Vertex AI**\n\n\u003e *Why chunk and embed when you can reason and structure?*\n\nPageIndex builds semantic tree structures from documents without embeddings or vector databases. Instead of chunking and embedding, it uses LLM reasoning to extract hierarchical structure, making document navigation and retrieval more intuitive.\n\n\u003e **Note:** This is an independent implementation inspired by the [PageIndex framework](https://pageindex.ai/) by [VectifyAI](https://github.com/VectifyAI/PageIndex). While the original uses OpenAI, this implementation uses **Google Vertex AI (Gemini)** and adds features like batch processing, repository indexing, and CLI tooling.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/PDF-Supported-green\" alt=\"PDF\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Markdown-Supported-green\" alt=\"Markdown\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/DOCX-Via%20Docling-blue\" alt=\"DOCX\"\u003e\n  \u003cimg src=\"https://img.shields.io/badge/Repository-Indexing-purple\" alt=\"Repo\"\u003e\n\u003c/p\u003e\n\n## Features\n\n- **PDF Processing** - Extracts table of contents, detects document structure, and builds hierarchical trees with page-level precision\n- **Markdown Processing** - Parses header hierarchy into navigable tree structures\n- **Batch Processing** - Process entire folders of documents concurrently\n- **Repository Indexing** - Generate semantic summaries for codebases\n- **Format Conversion** - Convert DOCX, PPTX, HTML, and images via [docling](https://github.com/DS4SD/docling)\n\n## When NOT to Use PageIndex\n\nPageIndex excels at structured, hierarchical documents but isn't the right tool for every use case:\n\n| Use Case | Why PageIndex May Not Be Ideal | Better Alternative |\n|----------|-------------------------------|-------------------|\n| **Short documents** (\u003c 10 pages) | Overhead of tree construction isn't worth it | Direct LLM context or simple chunking |\n| **Unstructured content** (chat logs, social media) | No inherent hierarchy to extract | Vector search with semantic embeddings |\n| **High-volume real-time queries** | LLM reasoning per query adds latency | Pre-computed vector indices |\n| **Keyword/exact match search** | PageIndex focuses on semantic structure | Full-text search (Elasticsearch, etc.) |\n| **Frequently updated documents** | Tree must be regenerated on each change | Incremental vector indexing |\n| **Multi-document corpus search** | Designed for single-document navigation | Vector DB with cross-document retrieval |\n| **Cost-sensitive applications** | Each indexing run uses LLM API calls | One-time embedding generation |\n\n### PageIndex Shines When:\n\n- Documents have **clear hierarchical structure** (reports, manuals, textbooks, legal docs)\n- You need **explainable, traceable retrieval** with section/page references\n- **Accuracy matters more than speed** (financial analysis, compliance, research)\n- Documents are **long** (50+ pages) where vector chunking loses context\n- You want **human-like navigation** through complex documents\n\n## Installation\n\nClone the repository and install in editable mode:\n\n```bash\ngit clone https://github.com/NP-compete/pageindex.git\ncd pageindex\npip install -e .\n```\n\nWith document conversion support:\n\n```bash\npip install -e \".[docling]\"\n```\n\nFor development:\n\n```bash\npip install -e \".[dev]\"\n```\n\n## Quick Start\n\n### CLI Usage\n\nProcess a PDF:\n\n```bash\npageindex pdf document.pdf --project-id your-gcp-project\n```\n\nProcess a Markdown file:\n\n```bash\npageindex md document.md --project-id your-gcp-project\n```\n\nProcess all documents in a folder:\n\n```bash\npageindex folder ./docs --project-id your-gcp-project\n```\n\nIndex a code repository:\n\n```bash\npageindex repo ./my-project --project-id your-gcp-project\n```\n\n### Python API\n\n```python\nfrom pageindex import page_index, md_to_tree, process_folder_sync, index_repository_sync\n\n# Process a PDF\nresult = page_index(\n    \"document.pdf\",\n    project_id=\"your-gcp-project\",\n    model=\"gemini-1.5-flash\",\n)\n\n# Process Markdown\nimport asyncio\nfrom pageindex import md_to_tree, PageIndexConfig\n\nconfig = PageIndexConfig(project_id=\"your-gcp-project\")\nresult = asyncio.run(md_to_tree(\"document.md\", config=config))\n\n# Batch process a folder\nresult = process_folder_sync(\n    \"./docs\",\n    project_id=\"your-gcp-project\",\n    max_concurrent=5,\n)\n\n# Index a repository\nresult = index_repository_sync(\n    \"./my-project\",\n    project_id=\"your-gcp-project\",\n    add_summaries=True,\n)\n```\n\n## Configuration\n\nSet your Google Cloud project ID via environment variable:\n\n```bash\nexport PAGEINDEX_PROJECT_ID=your-gcp-project\n```\n\nOr pass it directly to commands and functions.\n\n### CLI Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `--project-id`, `-p` | Google Cloud project ID | `PAGEINDEX_PROJECT_ID` env |\n| `--location`, `-l` | Vertex AI location | `us-central1` |\n| `--model`, `-m` | Gemini model | `gemini-1.5-flash` |\n| `--output`, `-o` | Output file/directory | `./results/` |\n| `--add-summary/--no-summary` | Generate node summaries | varies by command |\n| `--add-text/--no-text` | Include full text in nodes | `--no-text` |\n| `--add-node-id/--no-node-id` | Add hierarchical node IDs | `--add-node-id` |\n\n### PDF-Specific Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `--toc-check-pages` | Pages to scan for TOC | `20` |\n| `--max-pages-per-node` | Max pages before splitting | `10` |\n| `--max-tokens-per-node` | Max tokens before splitting | `20000` |\n\n### Folder Processing Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `--max-concurrent`, `-c` | Concurrent processing tasks | `5` |\n| `--convert/--no-convert` | Convert unsupported formats | `--convert` |\n| `--docling-serve-url` | Remote docling-serve API URL | None |\n| `--docling-serve-timeout` | API timeout (seconds) | `300` |\n\n### Repository Indexing Options\n\n| Option | Description | Default |\n|--------|-------------|---------|\n| `--summaries/--no-summaries` | Generate directory summaries | `--summaries` |\n| `--include`, `-i` | File patterns to include | See defaults |\n| `--exclude`, `-e` | Patterns to exclude | See defaults |\n| `--max-depth` | Tree display depth | `4` |\n\n## Output Format\n\nPageIndex outputs JSON with a hierarchical structure:\n\n```json\n{\n  \"doc_name\": \"example\",\n  \"doc_description\": \"A technical guide covering...\",\n  \"structure\": [\n    {\n      \"title\": \"Introduction\",\n      \"node_id\": \"0001\",\n      \"summary\": \"Overview of the document...\",\n      \"start_index\": 1,\n      \"end_index\": 5,\n      \"nodes\": [\n        {\n          \"title\": \"Background\",\n          \"node_id\": \"0001.0001\",\n          \"summary\": \"Historical context...\",\n          \"start_index\": 2,\n          \"end_index\": 4\n        }\n      ]\n    }\n  ]\n}\n```\n\n## Document Conversion\n\nPageIndex supports converting various formats to Markdown using docling:\n\n**Supported formats:** DOCX, PPTX, XLSX, HTML, PNG, JPG, TIFF, BMP\n\n### Using docling-serve (recommended for production)\n\n```bash\n# Start docling-serve\ndocker run -p 5001:5001 quay.io/docling-project/docling-serve\n\n# Process with remote conversion\npageindex folder ./docs --docling-serve-url http://localhost:5001\n```\n\n### Using local docling\n\n```bash\npip install pageindex[docling]\npageindex folder ./docs --convert\n```\n\n## How It Works\n\n### PDF Processing Pipeline\n\n1. **TOC Detection** - Scans initial pages for table of contents\n2. **Structure Extraction** - Uses LLM to extract hierarchical structure from TOC or content\n3. **Page Mapping** - Maps logical sections to physical page numbers\n4. **Verification** - Validates extracted structure against actual content\n5. **Large Node Splitting** - Recursively splits oversized sections\n6. **Summary Generation** - Optionally generates summaries for each node\n\n### Markdown Processing\n\n1. **Header Extraction** - Parses markdown headers (H1-H6)\n2. **Tree Building** - Constructs hierarchy based on header levels\n3. **Tree Thinning** - Optionally merges small nodes\n4. **Summary Generation** - Optionally summarizes each section\n\n### Repository Indexing\n\n1. **Directory Scanning** - Walks repository respecting include/exclude patterns\n2. **Context Building** - Reads README files and key entry points\n3. **Summary Generation** - Uses LLM to summarize each directory's purpose\n4. **Tree Construction** - Builds navigable directory tree with metadata\n\n## Requirements\n\n- Python 3.10+\n- Google Cloud project with Vertex AI API enabled\n- Authentication via `gcloud auth application-default login` or service account\n\n## Related Projects\n\n- **[PageIndex by VectifyAI](https://github.com/VectifyAI/PageIndex)** - The original PageIndex framework for vectorless, reasoning-based RAG using OpenAI\n- **[PageIndex.ai](https://pageindex.ai/)** - Commercial platform for human-like document AI by VectifyAI\n\n## License\n\nMIT - see [LICENSE](LICENSE) for details.\n\n## Contributing\n\nContributions are welcome! Please see [CONTRIBUTING.md](CONTRIBUTING.md) for guidelines.\n\n## Acknowledgments\n\nThis project is inspired by the [PageIndex framework](https://pageindex.ai/) developed by [VectifyAI](https://github.com/VectifyAI). Their research on vectorless, reasoning-based RAG demonstrates that **similarity ≠ relevance** — true document retrieval requires reasoning, not just embedding similarity.\n\n## Author\n\nSoham Dutta ([@NP-compete](https://github.com/NP-compete))\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnp-compete%2Fpageindex","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fnp-compete%2Fpageindex","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fnp-compete%2Fpageindex/lists"}