{"id":43671846,"url":"https://github.com/gsmatheus/pdf-image-extractor","last_synced_at":"2026-02-05T00:12:12.504Z","repository":{"id":196112674,"uuid":"694372608","full_name":"gsmatheus/pdf-image-extractor","owner":"gsmatheus","description":"The PDF Image Extractor is a Python script designed to process PDF files, specifically extracting and saving images embedded within the pages of the document. Besides the image extraction, it also prints out the textual content of the pages.","archived":false,"fork":false,"pushed_at":"2024-08-20T19:15:53.000Z","size":7,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-08-20T21:26:43.892Z","etag":null,"topics":["extractor","image-processing","pdf","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/gsmatheus.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-09-20T21:45:42.000Z","updated_at":"2024-08-20T19:15:57.000Z","dependencies_parsed_at":"2023-09-21T11:43:01.260Z","dependency_job_id":null,"html_url":"https://github.com/gsmatheus/pdf-image-extractor","commit_stats":null,"previous_names":["gsmatheus/pdf-image-extractor"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/gsmatheus/pdf-image-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsmatheus%2Fpdf-image-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsmatheus%2Fpdf-image-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsmatheus%2Fpdf-image-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsmatheus%2Fpdf-image-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/gsmatheus","download_url":"https://codeload.github.com/gsmatheus/pdf-image-extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/gsmatheus%2Fpdf-image-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29102166,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-04T22:44:52.815Z","status":"ssl_error","status_checked_at":"2026-02-04T22:44:16.428Z","response_time":62,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extractor","image-processing","pdf","python"],"created_at":"2026-02-05T00:12:11.798Z","updated_at":"2026-02-05T00:12:12.499Z","avatar_url":"https://github.com/gsmatheus.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Image Extractor\n\n## Overview\n\nThe PDF Image Extractor is a Python script designed to process PDF files, specifically extracting and saving images embedded within the pages of the document. Besides the image extraction, it also prints out the textual content of the pages. This tool can be particularly useful when handling digital catalogs or any PDFs with important embedded images.\n\n## Features\n\n- **Image Extraction**: Efficiently extracts images from any page within a provided PDF.\n- **Image Resizing**: Automatically resizes the extracted images to 60% of their original size, ensuring consistent output and potentially reducing file size.\n- **Text Extraction**: For each processed page, the script also extracts and prints the textual content.\n- **Flexibility**: Designed with modularity in mind, making it easy to integrate, expand, or modify for various use cases.\n\n## Requirements\n\nTo run this script, you will need:\n\n- Python 3.x\n- pdfplumber\n- fitz (PyMuPDF)\n- PIL (Pillow)\n\nThese can be installed using `pip`:\n\n```\npip install pdfplumber pymupdf pillow\n```\n\nor if using [uv](https://docs.astral.sh/uv/)\n\n```\nuv synv\nuv run main.py\n```\n\n## Usage\n\n1. Clone the repository or download the script.\n2. Ensure you have a folder named `images` (or another name of your choice, but remember to update the `OUTPUT_DIR` constant in the script accordingly) in the same directory as the script. This is where the extracted images will be saved.\n3. Update the `PDF_PATH` constant in the script to point to your target PDF file.\n4. Run the script:\n```\npython main.py input_file output_dir img_format img_quality\n```\n\nAfter execution, check the `images` folder for the extracted images.\n\nFor example, you can extract to png in the current folder via:\n\n```\npython main.py ./file.pdf ./ png 100\n```\n\n\n## Customization\n\n- **Changing Output Image Format**: By default, images are saved in the `.webp` format due to its efficiency. However, you can modify the `save_page_images` function to save in a different format, such as PNG or JPEG.\n- **Adjusting Resizing Ratio**: The `resize_image` function currently reduces the image size to 60% of the original. Adjust the resizing ratio as per your requirements by modifying the multiplier value.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsmatheus%2Fpdf-image-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fgsmatheus%2Fpdf-image-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fgsmatheus%2Fpdf-image-extractor/lists"}