{"id":50393906,"url":"https://github.com/leomurillodev/document-image-extractor","last_synced_at":"2026-05-30T20:00:47.708Z","repository":{"id":336379897,"uuid":"1141387080","full_name":"LeoMurilloDev/document-image-extractor","owner":"LeoMurilloDev","description":"Herramienta CLI para la extracción de imágenes embebidas de archivos DOCX, PDF, PPTX y XLSX, eliminar duplicados, aplicar filtros configurables , exportando resultados en ZIP","archived":false,"fork":false,"pushed_at":"2026-05-30T18:24:50.000Z","size":3864,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-05-30T19:21:42.783Z","etag":null,"topics":["automation","cli","document-processing","docx","extractor","office","pdf","pymupdf","python","python-docx"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LeoMurilloDev.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-01-24T19:05:20.000Z","updated_at":"2026-05-30T18:00:53.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/LeoMurilloDev/document-image-extractor","commit_stats":null,"previous_names":["leomurillodev/document-image-extractor"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/LeoMurilloDev/document-image-extractor","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeoMurilloDev%2Fdocument-image-extractor","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeoMurilloDev%2Fdocument-image-extractor/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeoMurilloDev%2Fdocument-image-extractor/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeoMurilloDev%2Fdocument-image-extractor/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LeoMurilloDev","download_url":"https://codeload.github.com/LeoMurilloDev/document-image-extractor/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LeoMurilloDev%2Fdocument-image-extractor/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33707328,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-05-30T02:00:06.278Z","response_time":92,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["automation","cli","document-processing","docx","extractor","office","pdf","pymupdf","python","python-docx"],"created_at":"2026-05-30T20:00:25.440Z","updated_at":"2026-05-30T20:00:47.700Z","avatar_url":"https://github.com/LeoMurilloDev.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Document-Image-Extractor\n\nCLI tool to extract embedded images from **DOCX**, **PDF**, **PPTX** and **XLSX** files, with deduplication , size filtering, and batch export to ZIPs.\n\n---\n\n## Features:\nExtract images from: \n- DOCX (Word documents)\n- PDF (documents)\n- PPTX (Powerpoint documents)\n- XLSX (Excel documents)\n\nOutputs:\n- Creates a **ZIP per input file** with extracted images\n\nbuilt-in helpers:\n- **Deduplication** (skips repeated images within the same document)\n- **Size filter** (`min_kb` default is 5kb)\n- Handles “no images” and **corrupt files** gracefully\n\n---\n\n## Project status\nthis repository is begin improved **phase by phase** \n\n---\n\n## Requirements\n- python 3.12+ (recomended)\n\nDependencies (install from 'requirements.txt'):\n- python-docx\n- PyMuPDF\n- pillow\n\n## Installation\n\n### 1. Clone the repository\n```bash\ngit clone https://github.com/LeoMurilloDev/document-image-extractor.git\ncd document-image-extractor \n```\n\n### 2. Create and activate a virtual environment\n#### Windows\n```bash\npython -m  venv .venv\n.\\.venv\\Scripts\\activate\n```\n#### macOS / Linux\n```bash\npython3 -m venv .venv\nsource .venv/bin/activate\n```\n\n### 3. Install dependencies\npip install -r requirements.txt\n\n## Usage\n\n### Folder structure expected by the script \nthe script creates these folders automatically if they don't exist:\n- **Entrdas_archivos/** -\u003e place your **.docx** and **.pdf** files here\n- **Salidas_archivos/** -\u003e output ZIPs will be generated here\n- **temp/** -\u003e temporary extraction folder (auto-cleaned)\n\n### Configuration \nYou can customize filters without editing the code using `config.json` (repo root).\nExample: \n```json\n{\n  \"filters\": {\n    \"min_kb\": 5,\n    \"min_width\": 0,\n    \"min_height\": 0\n  }\n}\n```\n- `min_kb`: minimum file size in kb (default: 5)\n- `min_width`/ `min_height`: optional dimension filter (0 disables it)\n\n\n## Run \n```bash\npython main.py\n```\n\n## CLI usage\nThe tool can be used with default folders/config:\n\n```bash\npython main.py\n\npython main.py --input Entradas_archivos --output Salidas_archivos\n\npython main.py --input example.pptx --output Salidas_archivos\n\npython main.py --input Entradas_archivos --recursive\n\npython main.py --input Entradas_archivos --min-kb 1 --min-width 100 --min-height 100\n\npython main.py --input Entradas_archivos --no-dedup\n\npython main.py --input Entradas_archivos --format folder\n\npython main.py --input Entradas_archivos --log-level DEBUG --log-file logs/debug.log\n```\n\n## Output\n- For each input file, a ZIP is created in **Salidas_archivos/**\n- Example: \n    - Input: **Entradas_archivos/report.pdf**\n    - Output: **Salidas_archivos/report.zip**\n\n## What to expect\nWhen you run the script, it prints a summary per file:\n- `guardadas` -\u003e images saved successfully\n- `duplicadas` -\u003e images skipped due to hash duplication\n- `pequeñas` -\u003e images filtered out by size\n- `encontradas` -\u003e images found inside the document\n### Important notes\n- In `DOCX`, images are saved using the real extension (.jpg, .png, .gif, etc)\n- `temp/` is cleaned even when a file fails\n\n## Test suites\nwe use small test suites to validate.\n### Documents to try\nIncludes:\n- Mixed formats (JPG/PNG/GIF)\n- Duplicates\n- Small icon filtered out by size\n- Corrupt files (error handling)\nManual validation steps: \n1. Copy test files into `Entradas_archivos/`\n2. Run `python main.py`\n3. Verify\n    - Output ZIPs exist in `Salidas_archivos/`\n    - Extencions are correct in DOCX resutls (.jpg, .png, .gif)\n    - Duplicates are removed\n    - `temp/` is empty at the end\n\n## Contributing\nif you want to propose changes:\n1. Fork the repo \n2. Create a branch\n3. Open a PR with a clear description","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleomurillodev%2Fdocument-image-extractor","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleomurillodev%2Fdocument-image-extractor","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleomurillodev%2Fdocument-image-extractor/lists"}