{"id":30629535,"url":"https://github.com/itshivams/persona-driven-document-intelligence","last_synced_at":"2025-08-30T20:24:10.952Z","repository":{"id":307256028,"uuid":"1027102781","full_name":"itshivams/Persona-Driven-Document-Intelligence","owner":"itshivams","description":"Persona-Driven Document Intelligence – A lightweight, CPU-only system that intelligently extracts and ranks document sections based on user persona and task context.","archived":false,"fork":false,"pushed_at":"2025-07-27T10:22:41.000Z","size":18,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-08-29T23:52:53.995Z","etag":null,"topics":["adobe-hackathon","document-summarization","nlp","sentence-transformers"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/itshivams.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-27T10:17:22.000Z","updated_at":"2025-07-29T06:30:27.000Z","dependencies_parsed_at":"2025-07-30T09:49:47.773Z","dependency_job_id":"bb861597-7068-4dbf-91ac-d4b5f2ca8556","html_url":"https://github.com/itshivams/Persona-Driven-Document-Intelligence","commit_stats":null,"previous_names":["itshivams/persona-driven-document-intelligence"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/itshivams/Persona-Driven-Document-Intelligence","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itshivams%2FPersona-Driven-Document-Intelligence","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itshivams%2FPersona-Driven-Document-Intelligence/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itshivams%2FPersona-Driven-Document-Intelligence/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itshivams%2FPersona-Driven-Document-Intelligence/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/itshivams","download_url":"https://codeload.github.com/itshivams/Persona-Driven-Document-Intelligence/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/itshivams%2FPersona-Driven-Document-Intelligence/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272901406,"owners_count":25012304,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["adobe-hackathon","document-summarization","nlp","sentence-transformers"],"created_at":"2025-08-30T20:24:07.841Z","updated_at":"2025-08-30T20:24:10.943Z","avatar_url":"https://github.com/itshivams.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Persona‑Driven Document Intelligence\n\nThis repository implements a **generic, offline, CPU‑only** pipeline to extract, rank, and summarize the most relevant sections from a collection of PDFs, customized to any persona and their specific task.\n\n## Features\n\n* **Modular stages**: ingestion → chunking → embedding → ranking → summarization → output\n* **Semantic embeddings**: uses Sentence‑Transformers `all‑MiniLM‑L6‑v2` for lightweight, 384‑dim vectors\n* **Robust ranking**:\n\n  * **Cosine similarity** for semantic relevance\n  * **Static keyword boosts** (configurable per domain)\n  * **Dynamic corpus boosts** (auto‑learned top tokens)\n  * **Brevity bonus** for concise sections\n  * **Heading penalty** to de‑prioritize generic titles\n  * **Soft diversity** to balance coverage across documents\n* **Abstractive summaries**: integrates `t5‑small` for fluent, paragraph‑style `refined_text`\n* **High performance**: \u003c 1 GB image, CPU‑only, end‑to‑end \u003c 60 s for 3–5 PDFs (20 pages each)\n* **Domain‑agnostic**: easily swap static buckets and persona/task definitions via JSON; no code changes\n\n## Getting Started\n\n### Prerequisites\n\n* Docker (Engine ≥ 20.10)\n* Linux/macOS/Windows WSL2\n\n### Project Structure\n\n```\n├── Dockerfile\n├── requirements.txt\n├── README.md                # This file\n├── approach_explanation.md  # Detailed methodology\n├── src/\n│   ├── main.py              # Entry point\n│   ├── ingestion/pdf_loader.py\n│   ├── chunker/chunker.py\n│   ├── models/embedder.py\n│   ├── models/ranker.py\n│   ├── models/summarizer.py\n│   └── output/formatter.py\n└── sample_input/\n    ├── docs/                # PDF files\n    ├── persona.json         # Persona metadata\n    └── job.json             # Job‑to‑be‑done metadata\n```\n\n### Building the Docker Image\n\n```bash\ndocker build -t persona_doc_intel .\n```\n\n### Running the Pipeline\n\n1. **Prepare input**:\n\n   * Place your PDF files under `my_input/docs/`\n   * Define `my_input/persona.json`:\n\n     ```json\n     {\"persona\": \"Your Persona Title\"}\n     ```\n   * Define `my_input/job.json`:\n\n     ```json\n     {\"job_to_be_done\": \"Specific task description for the persona.\"}\n     ```\n\n2. **Run**:\n\n```bash\ndocker run --rm \\\n  -v \"$PWD/my_input:/input\" \\\n  -v \"$PWD/my_results:/output\" \\\n  persona_doc_intel \\\n  --input /input --output /output/results.json --top_k 10\n```\n\n3. **Output**:\n\n   * Check `my_results/results.json` for the final structured JSON:\n\n     ```json\n     {\n       \"metadata\": {...},\n       \"extracted_sections\": [...],\n       \"subsection_analysis\": [...]\n     }\n     ```\n\n## Customization\n\n* **Persona/Job**: edit `persona.json` and `job.json` to any role and task.\n* **Static buckets**: modify `STATIC_BUCKETS` in `src/models/ranker.py` to tune domain themes.\n* **Summary length**: tweak `max_len` and `min_len` in `src/models/summarizer.py`.\n\n## Performance \u0026 Limitations\n\n* Designed for **small to medium** PDF collections (3–10 docs, up to \\~100 pages total).\n* **Scalability**: embedding and summarization are batchable but CPU‑bound; expect linear time with document size.\n* **Robustness**: non‑PDF or corrupted files are skipped with a warning.\n\n---\n## Our Team\nWe are a cross-functional team of machine learning engineers, NLP researchers, full-stack developers, and software architects passionate about document intelligence. Our mission is to make complex document structures easily interpretable by building accurate, scalable, and user-friendly PDF outline extraction systems powered by AI.\n\n- [Shivam](https://github.com/itshivams)\n- [Ritik Gupta](https://github.com/ritikgupta06)\n- [Sanskar Soni](https://github.com/sunscar-sony)\n\n\n## GitHub Repository\nYou can find the complete source code to the project on GitHub:\n[GitHub Repository](https://github.com/itshivams/Persona-Driven-Document-Intelligence/)\n\n## Acknowledgment\nSpecial thanks to Adobe India for organizing this hackathon.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitshivams%2Fpersona-driven-document-intelligence","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fitshivams%2Fpersona-driven-document-intelligence","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fitshivams%2Fpersona-driven-document-intelligence/lists"}