{"id":30739132,"url":"https://github.com/ap-dev-github/adobe-hackathon-round1a-winners","last_synced_at":"2025-09-03T22:45:30.615Z","repository":{"id":306913037,"uuid":"1027607255","full_name":"ap-dev-github/adobe-hackathon-round1a-Winners","owner":"ap-dev-github","description":"Submission for the Round 1A of Adobe Hackathon 2025","archived":false,"fork":false,"pushed_at":"2025-07-28T10:31:49.000Z","size":7,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-07-28T12:27:43.837Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ap-dev-github.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-07-28T09:07:44.000Z","updated_at":"2025-07-28T10:31:53.000Z","dependencies_parsed_at":"2025-07-28T12:27:47.329Z","dependency_job_id":"2fa67434-51d9-4e55-8950-fe16ee9d66a9","html_url":"https://github.com/ap-dev-github/adobe-hackathon-round1a-Winners","commit_stats":null,"previous_names":["ap-dev-github/adobe-hackathon-round1a-winners"],"tags_count":null,"template":false,"template_full_name":null,"purl":"pkg:github/ap-dev-github/adobe-hackathon-round1a-Winners","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ap-dev-github%2Fadobe-hackathon-round1a-Winners","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ap-dev-github%2Fadobe-hackathon-round1a-Winners/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ap-dev-github%2Fadobe-hackathon-round1a-Winners/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ap-dev-github%2Fadobe-hackathon-round1a-Winners/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ap-dev-github","download_url":"https://codeload.github.com/ap-dev-github/adobe-hackathon-round1a-Winners/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ap-dev-github%2Fadobe-hackathon-round1a-Winners/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":273523645,"owners_count":25120864,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-09-03T02:00:09.631Z","response_time":76,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-09-03T22:45:29.453Z","updated_at":"2025-09-03T22:45:30.607Z","avatar_url":"https://github.com/ap-dev-github.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"## Team Members\n### Ayush Pandey \n### Ayush Banerjee\n\n# PDF Outline Extractor \nA Dockerized Python tool that extracts structured outlines (H1 and H2 headings) from PDF files and saves them as JSON. This project is optimized for a minimal footprint using a multi-stage Docker build with Alpine Linux, resulting in a lean image of approximately **165 MB**.\n\n## How It Works\n\nThe Python script `pdf_outline_extractor.py` uses the `pdfplumber` library to perform the following steps:\n\n1. **Character Grouping**: It reads a PDF page and groups individual characters into lines based on their vertical alignment (`y` coordinate).\n2. **Font Size Analysis**: It analyzes the font sizes used throughout the document to identify the most common sizes.\n3. **Heading Identification**: It establishes a hierarchy of headings based on font size. The largest font size is designated as **H1**, and the next largest is treated as the threshold for **H2** headings.\n4. **JSON Output**: It processes all PDFs in an `input` directory and generates a corresponding JSON file for each in the `output` directory. The JSON file contains the document's title and a structured list of all identified H1 and H2 headings with their text, level, and page number.\n\n##  Docker Image Optimization: From Slim to Ultralight\n\nThe primary goal of the Docker configuration was to create the smallest possible image for efficient distribution and deployment. This was achieved by moving from a `python:3.9-slim` base to a `python:3.9-alpine` base with a multi-stage build.\n\n### The Challenge: Larger Image Size\n\nThe initial `Dockerfile` used `python:3.9-slim`. While smaller than the full Debian-based Python image, `slim` still includes many system libraries and tools not required for the script to simply *run*.\n\n### The Solution: Alpine and Multi-Stage Builds\n\nThe new `Dockerfile` leverages two key strategies for a massive size reduction:\n\n1. **`python:3.9-alpine` Base Image**: Alpine Linux is a minimal Linux distribution built around `musl libc` and `BusyBox`. Its base image is incredibly small (around 5-6 MB) compared to Debian-based images (`slim` is often ~50 MB+).\n\n2. **Multi-Stage Build**: This is the critical optimization.\n   - **Stage 1 (`builder`)**: This stage is a temporary environment used only to install dependencies. It installs build tools like `gcc` and `musl-dev` which are required to compile some Python packages (like those used by `pdfplumber`).\n   - **Stage 2 (Final Image)**: This is the final, clean image. Instead of keeping the build tools, we **only copy the installed Python packages** and our script (`pdf_outline_extractor.py`) from the `builder` stage.\n\nThe result is a final image that contains the minimal Alpine OS, the Python runtime, and our installed packages—and nothing else. The build tools, temporary files, and package manager cache are all discarded, leading to the **~165 MB** final image size.\n\n## Usage\n\nFollow these steps to build the Docker image and run the extractor on your PDF files.\n\n### Prerequisites\n\n* [Docker](https://www.docker.com/get-started) must be installed and running.\n\n\n### 1. Project Setup\n\nClone the repository and set up the input/output directories.\n\n```bash\ngit clone git@github.com:ap-dev-github/adobe-hackathon-round1a-Winners.git\n```\n# Create directories for input and output\n```bash\nmkdir input\nmkdir output\n```\nPlace all the PDF files you want to process inside the input directory.\n\n2. Build the Docker Image\nRun the following command from the root of the project directory to build the ultralight image.\n\n```bash\ndocker build -t pdf-extractor:ultralight .\n```\n3. Run the Container\nExecute the script by running the Docker container. This command mounts your local input and output folders into the container, runs the script, and then cleans up the container after it's done.\n\n```bash\npowershell\ndocker run --rm `\n  -v \"${PWD}/input:/app/input\" `\n  -v \"${PWD}/output:/app/output\" `\n  --network none `\n  pdf-extractor:ultralight\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fap-dev-github%2Fadobe-hackathon-round1a-winners","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fap-dev-github%2Fadobe-hackathon-round1a-winners","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fap-dev-github%2Fadobe-hackathon-round1a-winners/lists"}