An open API service indexing awesome lists of open source software.

https://github.com/quantumcoderrr/adobe-india-hackathon25


https://github.com/quantumcoderrr/adobe-india-hackathon25

Last synced: 6 months ago
JSON representation

Awesome Lists containing this project

README

          

# πŸš€ Adobe India Hackathon 2025 – "Connecting the Dots"

## πŸ” Rethink Reading. Rediscover Knowledge.

Imagine a world where PDFs aren’t just passive documents β€” but intelligent, interactive companions that **understand structure**, **connect ideas**, and **respond meaningfully**. That’s the mission of Adobe’s *Connecting the Dots* challenge β€” and this repository is our response to it.

---

## πŸ“Œ Problem Statement

In an era where we’re flooded with digital documents, the real power lies not in reading more β€” but in reading smarter. Adobe’s challenge asked us to:
- βœ… Extract intelligent outlines from PDFs (**Challenge 1A**)
- βœ… Identify section-specific content based on user personas (**Challenge 1B**)
- 🧠 Do it all with lightweight models, on-device, and with high accuracy
- πŸ“¦ Wrap everything in reproducible, portable Docker containers

---

## 🧠 Solutions Overview

### πŸ”Ή Challenge 1A – Structured PDF Outline Extraction

**Objective**: Build a Python script that processes a directory of PDFs and returns JSON-formatted outlines β€” capturing headings, structure, and page numbers.

- πŸ›  Built with `PyMuPDF` (fitz) for PDF parsing
- πŸ“‚ Input/output via CLI arguments
- 🐳 Packaged in a Docker container for seamless execution
- πŸ“„ Outputs: Individual `.json` files per PDF with structural metadata

πŸ“ Folder: [`Challenge_1a`](./Challenge_1a)

➑️ Includes:
- `process_pdfs.py`
- `Dockerfile`
- `sample_dataset/` (PDFs)
- `output/` (Generated JSON files)

---

### πŸ”Ή Challenge 1B – Persona-Driven Section Extraction

**Objective**: For a given set of PDFs and a `challenge1b_input.json`, extract and rank the top 5 most relevant sections based on a specified user persona.

- πŸ€– Used `sentence-transformers (MiniLM)` for semantic embeddings
- πŸ“Š Applied `cosine similarity` (via scikit-learn) for ranking sections
- 🧾 Output format aligned with provided sample files
- 🐳 Docker-ready, CPU-efficient, <1GB

πŸ“ Folder: [`Challenge_1b`](./Challenge_1b)

➑️ Includes:
- `process_documents.py`
- `Dockerfile`
- Collections 1–3 with:
- PDFs
- Input prompts
- Output JSONs (predicted sections)

---

## 🐳 Docker Instructions (For Judges)

Each challenge can be run independently in Docker.

### πŸ— Build Image

```bash
docker build --platform linux/amd64 -t adobe_round1a ./Challenge_1a
docker build --platform linux/amd64 -t adobe_round1b ./Challenge_1b
```

### ▢️ Run Container
```bash
docker run --rm \
-v $(pwd)/Challenge_1a/input:/app/input \
-v $(pwd)/Challenge_1a/output:/app/output \
--network none \
adobe_round1a
```
```bash
docker run --rm \
-v $(pwd)/Challenge_1b/input:/app/input \
-v $(pwd)/Challenge_1b/output:/app/output \
--network none \
adobe_round1b
```

## πŸ§‘β€πŸ’» Team

- **Sandip Ghosh** β€” [GitHub: @QuantumCoderrr](https://github.com/QuantumCoderrr)
- **Sandhita Poddar** β€” [GitHub: @CelestialCoderrr](https://github.com/CelestialCoderrr)

Together, we built something that doesn't just read PDFs β€” it *understands* them.