https://github.com/quantumcoderrr/adobe-india-hackathon25
https://github.com/quantumcoderrr/adobe-india-hackathon25
Last synced: 6 months ago
JSON representation
- Host: GitHub
- URL: https://github.com/quantumcoderrr/adobe-india-hackathon25
- Owner: QuantumCoderrr
- Created: 2025-07-24T16:20:24.000Z (7 months ago)
- Default Branch: main
- Last Pushed: 2025-07-25T14:43:42.000Z (7 months ago)
- Last Synced: 2025-07-25T16:40:34.524Z (7 months ago)
- Language: Python
- Size: 24 MB
- Stars: 0
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# π Adobe India Hackathon 2025 β "Connecting the Dots"
## π Rethink Reading. Rediscover Knowledge.
Imagine a world where PDFs arenβt just passive documents β but intelligent, interactive companions that **understand structure**, **connect ideas**, and **respond meaningfully**. Thatβs the mission of Adobeβs *Connecting the Dots* challenge β and this repository is our response to it.
---
## π Problem Statement
In an era where weβre flooded with digital documents, the real power lies not in reading more β but in reading smarter. Adobeβs challenge asked us to:
- β
Extract intelligent outlines from PDFs (**Challenge 1A**)
- β
Identify section-specific content based on user personas (**Challenge 1B**)
- π§ Do it all with lightweight models, on-device, and with high accuracy
- π¦ Wrap everything in reproducible, portable Docker containers
---
## π§ Solutions Overview
### πΉ Challenge 1A β Structured PDF Outline Extraction
**Objective**: Build a Python script that processes a directory of PDFs and returns JSON-formatted outlines β capturing headings, structure, and page numbers.
- π Built with `PyMuPDF` (fitz) for PDF parsing
- π Input/output via CLI arguments
- π³ Packaged in a Docker container for seamless execution
- π Outputs: Individual `.json` files per PDF with structural metadata
π Folder: [`Challenge_1a`](./Challenge_1a)
β‘οΈ Includes:
- `process_pdfs.py`
- `Dockerfile`
- `sample_dataset/` (PDFs)
- `output/` (Generated JSON files)
---
### πΉ Challenge 1B β Persona-Driven Section Extraction
**Objective**: For a given set of PDFs and a `challenge1b_input.json`, extract and rank the top 5 most relevant sections based on a specified user persona.
- π€ Used `sentence-transformers (MiniLM)` for semantic embeddings
- π Applied `cosine similarity` (via scikit-learn) for ranking sections
- π§Ύ Output format aligned with provided sample files
- π³ Docker-ready, CPU-efficient, <1GB
π Folder: [`Challenge_1b`](./Challenge_1b)
β‘οΈ Includes:
- `process_documents.py`
- `Dockerfile`
- Collections 1β3 with:
- PDFs
- Input prompts
- Output JSONs (predicted sections)
---
## π³ Docker Instructions (For Judges)
Each challenge can be run independently in Docker.
### π Build Image
```bash
docker build --platform linux/amd64 -t adobe_round1a ./Challenge_1a
docker build --platform linux/amd64 -t adobe_round1b ./Challenge_1b
```
### βΆοΈ Run Container
```bash
docker run --rm \
-v $(pwd)/Challenge_1a/input:/app/input \
-v $(pwd)/Challenge_1a/output:/app/output \
--network none \
adobe_round1a
```
```bash
docker run --rm \
-v $(pwd)/Challenge_1b/input:/app/input \
-v $(pwd)/Challenge_1b/output:/app/output \
--network none \
adobe_round1b
```
## π§βπ» Team
- **Sandip Ghosh** β [GitHub: @QuantumCoderrr](https://github.com/QuantumCoderrr)
- **Sandhita Poddar** β [GitHub: @CelestialCoderrr](https://github.com/CelestialCoderrr)
Together, we built something that doesn't just read PDFs β it *understands* them.