{"id":50681346,"url":"https://github.com/arturovaine/smart-catalog-reader","last_synced_at":"2026-06-08T19:04:38.050Z","repository":{"id":342450118,"uuid":"1173888238","full_name":"arturovaine/smart-catalog-reader","owner":"arturovaine","description":"An intelligent PDF catalog extraction system that uses Google Gemini Vision AI to automatically extract structured product data from cosmetics catalogs.","archived":false,"fork":false,"pushed_at":"2026-03-26T23:15:00.000Z","size":7218,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2026-03-27T11:10:55.788Z","etag":null,"topics":["extract-data","gemini-api","vision-language-models"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"other","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/arturovaine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2026-03-05T21:20:10.000Z","updated_at":"2026-03-26T23:15:04.000Z","dependencies_parsed_at":null,"dependency_job_id":null,"html_url":"https://github.com/arturovaine/smart-catalog-reader","commit_stats":null,"previous_names":["arturovaine/smart-catalog-reader"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/arturovaine/smart-catalog-reader","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arturovaine%2Fsmart-catalog-reader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arturovaine%2Fsmart-catalog-reader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arturovaine%2Fsmart-catalog-reader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arturovaine%2Fsmart-catalog-reader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/arturovaine","download_url":"https://codeload.github.com/arturovaine/smart-catalog-reader/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/arturovaine%2Fsmart-catalog-reader/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34076032,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-08T02:00:07.615Z","response_time":111,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["extract-data","gemini-api","vision-language-models"],"created_at":"2026-06-08T19:04:36.953Z","updated_at":"2026-06-08T19:04:38.040Z","avatar_url":"https://github.com/arturovaine.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Smart Catalog Reader\n\nAn intelligent PDF catalog extraction system that uses **Google Gemini Vision AI** to automatically extract structured product data from cosmetics catalogs.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"docs/images/project-cover.png\" alt=\"Smart Catalog Reader - AI-powered product extraction\" width=\"50%\"\u003e\n\u003c/p\u003e\n\n## Technical Flow\n\n```\n       SMART CATALOG READER | TECHNICAL FLOW\n       ====================================\n\n       [ INPUT ]            [ AI ENGINE ]             [ OUTPUT ]\n      ┌────────────┐        ┌────────────────────┐     ┌─────────────┐ \n      |  PDF File  |───────▶| GEMINI VISION AI   |────▶|  JSON DATA  |\n      └────────────┘        └────────┬───────────┘     └─────────────┘\n                                     │\n                                     ▼\n                      ┌───────────────────────────────┐\n                      │      EXTRACTION PIPELINE      │\n                      └──────────────┬────────────────┘\n                                     │\n      1. IMAGE PREP   ▶ PDF to high-res images (200 DPI)\n                                     │\n      2. BATCHING     ▶ Convert 5 pages/batch, parallel API calls\n                                     │\n      3. AI VISION    ▶ Extract: Code, Name, Price, Promos\n                                     │\n      4. REFINEMENT   ▶ Fuzzy Match: Normalize Categories\n                                     │\n      5. VALIDATION   ▶ Auto-Correction \u0026 Data Checks\n                                     │\n      6. CHECKPOINT   ▶ Save progress every 10 pages\n                                     │\n                                     ▼\n                      ┌───────────────────────────────┐\n                      |      DATA SAMPLE (JSON)       |\n                      |-------------------------------|\n                      | {                             |\n                      |   \"codigo\": \"85675\",          |\n                      |   \"nome\": \"Creme Relaxante\",  |\n                      |   \"preco_regular\": 69.90,     |\n                      |   \"preco_promocional\": 55.90, |\n                      |   \"categoria_normalizada\":    |\n                      |       \"Corpo e Banho\"         |\n                      | }                             |\n                      └───────────────────────────────┘\n```\n\n## Overview\n\nThis tool processes PDF catalogs from brands like **O Boticário** and **Natura**, converting each page into images and using AI vision to extract:\n\n- **Product details**: code, name, product line, category, volume/weight\n- **Pricing**: regular price, promotional price, savings, discount percentage\n- **Promotions**: simple discounts, progressive discounts, combos, buy-x-pay-y\n- **Features**: vegan, dermatologist tested, sustainability badges, etc.\n- **Validation**: automatic data validation with alerts and auto-corrections\n\nThe extracted data is normalized, validated, and exported to structured JSON format, ready for integration with e-commerce platforms, analytics systems, or inventory management.\n\n## Sample Catalogs\n\n| O Boticário - Ciclo 03 | Natura - Ciclo 03 |\n|:----------------------:|:-----------------:|\n| ![O Boticário Cover](docs/images/boticario-cover.png) | ![Natura Cover](docs/images/natura-cover.png) |\n\n### Extraction Results\n\n| Metric | O Boticário C03 | Natura C03 | Total |\n|--------|---------------:|------------:|------:|\n| **Pages Processed** | 197 | 164 | 361 |\n| **Products Extracted** | 1,638 | 1,206 | 2,844 |\n| **Maquiagem** | 601 | 393 | 994 |\n| **Corpo e Banho** | 328 | 309 | 637 |\n| **Perfumaria** | 277 | 191 | 468 |\n| **Cabelos** | 174 | 187 | 361 |\n| **Outros** | 258 | 126 | 384 |\n\n**Catalog Sources:**\n- [O Boticário Catalogs](https://brcatalogos.com.br/oboticario/)\n- [Natura Catalogs](https://brcatalogos.com.br/revista-natura/)\n\n### Sample Output\n\n\u003ctable\u003e\n\u003ctr\u003e\n\u003cth\u003eO Boticário\u003c/th\u003e\n\u003cth\u003eNatura\u003c/th\u003e\n\u003c/tr\u003e\n\u003ctr\u003e\n\u003ctd\u003e\n\n```json\n{\n  \"code\": \"85675\",\n  \"name\": \"Creme Relaxante Hidratante Corporal\",\n  \"product_line\": \"Cuide-se Bem\",\n  \"category\": \"Hidratação\",\n  \"normalized_category\": \"Corpo e Banho\",\n  \"volume_weight\": \"200 g\",\n  \"regular_price\": 69.9,\n  \"promotional_price\": 55.9,\n  \"savings\": 14.0,\n  \"discount_percentage\": 20.03,\n  \"promotion_active\": true,\n  \"promotional_rule\": {\n    \"type\": \"simple_discount\",\n    \"description\": \"ECONOMIZE R$ 14,00\"\n  },\n  \"features\": [\"Vegano\", \"Sensação relaxante\"],\n  \"page\": 2\n}\n```\n\n\u003c/td\u003e\n\u003ctd\u003e\n\n```json\n{\n  \"code\": \"204459\",\n  \"name\": \"Desodorante Hidratante Perfumado Luna\",\n  \"product_line\": \"Natura Luna Nuit\",\n  \"category\": \"Corpo e Banho\",\n  \"normalized_category\": \"Corpo e Banho\",\n  \"volume_weight\": \"300 ml\",\n  \"regular_price\": 95.9,\n  \"promotional_price\": 66.9,\n  \"savings\": 29.0,\n  \"discount_percentage\": 30.24,\n  \"promotion_active\": true,\n  \"promotional_rule\": {\n    \"type\": \"simple_discount\",\n    \"description\": \"Economize R$ 29,00\"\n  },\n  \"features\": [\"08 pontos\"],\n  \"page\": 9\n}\n```\n\n\u003c/td\u003e\n\u003c/tr\u003e\n\u003c/table\u003e\n\n### Full Output Structure\n\n\u003cdetails\u003e\n\u003csummary\u003eClick to expand complete JSON output example\u003c/summary\u003e\n\n```json\n{\n  \"metadata\": {\n    \"name\": \"boticario-c03\",\n    \"brand\": \"O Boticário\",\n    \"cycle\": null,\n    \"validity_start\": null,\n    \"validity_end\": null,\n    \"total_pages\": 197,\n    \"source_file\": \"data/catalogs/boticario-c03.pdf\",\n    \"pages_processed\": 197,\n    \"pages_with_errors\": [],\n    \"exported_at\": \"2026-03-05T21:42:52.364339\"\n  },\n  \"global_promotional_rules\": [],\n  \"products\": [\n    {\n      \"code\": \"85675\",\n      \"name\": \"Creme Relaxante Hidratante Corporal Cuide-se Bem Cereja de Fases\",\n      \"product_line\": \"Cuide-se Bem\",\n      \"category\": \"Hidratação\",\n      \"normalized_category\": \"Corpo e Banho\",\n      \"volume_weight\": \"200 g\",\n      \"quantity\": null,\n      \"regular_price\": 69.9,\n      \"promotional_price\": 55.9,\n      \"savings\": 14.0,\n      \"discount_percentage\": 20.03,\n      \"promotion_active\": true,\n      \"promotional_rule\": {\n        \"type\": \"simple_discount\",\n        \"description\": \"ECONOMIZE R$ 14,00\",\n        \"conditions\": {},\n        \"discount_tiers\": {},\n        \"combo_codes\": [],\n        \"related_pages\": []\n      },\n      \"features\": [\n        \"Vegano\",\n        \"Sensação relaxante\",\n        \"Ajuda a aliviar o cansaço\"\n      ],\n      \"page\": 2,\n      \"quadrant\": \"bottom-left\",\n      \"alerts\": []\n    },\n    {\n      \"code\": \"85679\",\n      \"name\": \"Sabonete em Barra Cuide-se Bem Cereja de Fases\",\n      \"product_line\": \"Cuide-se Bem\",\n      \"category\": \"Limpeza\",\n      \"normalized_category\": \"Corpo e Banho\",\n      \"volume_weight\": \"4 unidades de 80 g cada\",\n      \"quantity\": null,\n      \"regular_price\": 36.9,\n      \"promotional_price\": 28.9,\n      \"savings\": 8.0,\n      \"discount_percentage\": 21.68,\n      \"promotion_active\": true,\n      \"promotional_rule\": {\n        \"type\": \"simple_discount\",\n        \"description\": \"ECONOMIZE R$ 8,00\",\n        \"conditions\": {},\n        \"discount_tiers\": {},\n        \"combo_codes\": [],\n        \"related_pages\": []\n      },\n      \"features\": [\n        \"Vegano\",\n        \"Limpa sem ressecar\"\n      ],\n      \"page\": 2,\n      \"quadrant\": \"bottom-left\",\n      \"alerts\": []\n    }\n  ],\n  \"statistics\": {\n    \"total_products\": 1638,\n    \"products_with_promotion\": 539,\n    \"categories\": [\n      \"Outros\",\n      \"Unhas\",\n      \"Maquiagem\",\n      \"Perfumaria\",\n      \"Cabelos\",\n      \"Corpo e Banho\"\n    ],\n    \"product_lines\": [\n      \"Beijinho\",\n      \"Intense\",\n      \"Egeo\",\n      \"Escudo de Força e Brilho\",\n      \"BOTI.SUN kids\",\n      \"Skin.q\",\n      \"Boti.Sun\"\n    ]\n  }\n}\n```\n\n\u003c/details\u003e\n\n## Extraction Pipeline (Detailed Flow)\n\n```\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                              INPUT: PDF Catalog                                 │\n│                          (e.g., boticario-c03.pdf)                              │\n└─────────────────────────────────────┬───────────────────────────────────────────┘\n                                      │\n                                      ▼\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│                            1. PDF PROCESSING                                    │\n│                            ──────────────────                                   │\n│      ┌─────────────┐      ┌─────────────┐      ┌─────────────┐                  │\n│      │   Page 1    │      │   Page 2    │      │   Page N    │                  │\n│      │    PDF      │ ───► │    PDF      │ ───► │    PDF      │                  │\n│      └──────┬──────┘      └──────┬──────┘      └──────┬──────┘                  │\n│             │                    │                    │                         │\n│             ▼                    ▼                    ▼                         │\n│      ┌─────────────┐      ┌─────────────┐      ┌─────────────┐                  │\n│      │   Image 1   │      │   Image 2   │      │   Image N   │                  │\n│      │  (PIL/PNG)  │      │  (PIL/PNG)  │      │  (PIL/PNG)  │                  │\n│      │  DPI: 200   │      │  DPI: 200   │      │  DPI: 200   │                  │\n│      └─────────────┘      └─────────────┘      └─────────────┘                  │\n│                                                                                 │\n│      Library: pdf2image (poppler)                                               │\n│                                                                                 │\n└─────────────────────────────────────┬───────────────────────────────────────────┘\n                                      │\n                                      ▼\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│                        2. BATCH PROCESSING (5 pages/batch)                      │\n│                        ───────────────────────────────────                      │\n│                                                                                 │\n│      ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐        │\n│      │ Batch 1 │   │ Batch 2 │   │ Batch 3 │   │   ...   │   │ Batch N │        │\n│      │ Pages   │   │ Pages   │   │ Pages   │   │         │   │ Pages   │        │\n│      │  1-5    │   │  6-10   │   │ 11-15   │   │         │   │196-197  │        │\n│      └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘   └────┬────┘        │\n│           │             │             │             │             │             │\n│           └─────────────┴─────────────┴─────────────┴─────────────┘             │\n│                                       │                                         │\n│                       ┌───────────────┴───────────────┐                         │\n│                       │      ThreadPoolExecutor       │                         │\n│                       │    (max_workers: 2 or 10)     │                         │\n│                       └───────────────────────────────┘                         │\n│                                                                                 │\n└─────────────────────────────────────┬───────────────────────────────────────────┘\n                                      │\n                                      ▼\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│                         3. LLM EXTRACTION (Gemini Vision)                       │\n│                         ─────────────────────────────────                       │\n│                                                                                 │\n│      ┌──────────────────────────────────────────────────────────────────┐       │\n│      │                     Gemini 2.5 Flash API                         │       │\n│      │  ┌─────────────────────────────────────────────────────────────┐ │       │\n│      │  │  PROMPT:                                                    │ │       │\n│      │  │  - Extract product code, name, line, category               │ │       │\n│      │  │  - Extract regular_price, promotional_price                 │ │       │\n│      │  │  - Identify promotion type (simple, progressive, combo)     │ │       │\n│      │  │  - Extract features (vegan, dermatologist tested, etc.)     │ │       │\n│      │  │  - Detect page type (products, advertising, index)          │ │       │\n│      │  └─────────────────────────────────────────────────────────────┘ │       │\n│      │                              │                                   │       │\n│      │                              ▼                                   │       │\n│      │  ┌─────────────────────────────────────────────────────────────┐ │       │\n│      │  │  OUTPUT: JSON with products[]                               │ │       │\n│      │  │  {                                                          │ │       │\n│      │  │    \"produtos\": [...],                                       │ │       │\n│      │  │    \"regras_promocionais_globais\": [...],                    │ │       │\n│      │  │    \"tipo_pagina\": \"products\"                                │ │       │\n│      │  │  }                                                          │ │       │\n│      │  └─────────────────────────────────────────────────────────────┘ │       │\n│      └──────────────────────────────────────────────────────────────────┘       │\n│                                                                                 │\n│      Retry Logic: Exponential backoff (tenacity) for rate limits                │\n│                                                                                 │\n└─────────────────────────────────────┬───────────────────────────────────────────┘\n                                      │\n                                      ▼\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│                         4. CATEGORY NORMALIZATION                               │\n│                         ─────────────────────────                               │\n│                                                                                 │\n│      Raw Category              Fuzzy Match (thefuzz)         Normalized         │\n│      ─────────────             ────────────────────          ──────────         │\n│      \"Hidratação\"        ───►  score: 85  ───────────►  \"Corpo e Banho\"         │\n│      \"Perfume Feminino\"  ───►  score: 90  ───────────►  \"Perfumaria\"            │\n│      \"Batom\"             ───►  score: 88  ───────────►  \"Maquiagem\"             │\n│      \"Shampoo\"           ───►  score: 92  ───────────►  \"Cabelos\"               │\n│      \"Desconhecido\"      ───►  score: 45  ───────────►  \"Outros\"                │\n│                                                                                 │\n│      Master Categories: Perfumaria, Maquiagem, Corpo e Banho, Cabelos,          │\n│                         Skincare, Infantil, Masculino, Unhas, Acessórios,       │\n│                         Kits e Presentes, Proteção Solar, Desodorantes          │\n│                                                                                 │\n│      Threshold: 75 (configurable)                                               │\n│                                                                                 │\n└─────────────────────────────────────┬───────────────────────────────────────────┘\n                                      │\n                                      ▼\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│                       5. VALIDATION \u0026 AUTO-CORRECTION                           │\n│                       ───────────────────────────────                           │\n│                                                                                 │\n│      ┌─────────────────────────────────────────────────────────────────┐        │\n│      │  VALIDATION RULES:                                              │        │\n│      │  ├── Code: 4-6 digits, alphanumeric                             │        │\n│      │  ├── Name: min 3 chars, no suspicious characters                │        │\n│      │  ├── Price: range 0.01 - 10,000 BRL                             │        │\n│      │  ├── Discount: max 80%                                          │        │\n│      │  ├── Promotional consistency: flags match prices                │        │\n│      │  └── Duplicate detection: unique product codes                  │        │\n│      └─────────────────────────────────────────────────────────────────┘        │\n│                                                                                 │\n│      ┌─────────────────────────────────────────────────────────────────┐        │\n│      │  AUTO-CORRECTIONS:                                              │        │\n│      │  ├── Swap prices if promotional \u003e regular                       │        │\n│      │  ├── Calculate savings if missing                               │        │\n│      │  ├── Set promotion_active flag if promo price exists            │        │\n│      │  └── Normalize category if missing                              │        │\n│      └─────────────────────────────────────────────────────────────────┘        │\n│                                                                                 │\n│      Alerts: ERROR (critical) │ WARNING (review) │ INFO (informational)         │\n│                                                                                 │\n└─────────────────────────────────────┬───────────────────────────────────────────┘\n                                      │\n                                      ▼\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│                          6. CHECKPOINTING (every 10 pages)                      │\n│                          ─────────────────────────────────                      │\n│                                                                                 │\n│      .checkpoints/                                                              │\n│      └── boticario-c03.checkpoint.json                                          │\n│          {                                                                      │\n│            \"catalog\": { ... },            ◄── Full catalog state                │\n│            \"last_processed_page\": 110,    ◄── Resume point                      │\n│            \"checkpoint_time\": \"...\"       ◄── Timestamp                         │\n│          }                                                                      │\n│                                                                                 │\n│      On interrupt (Ctrl+C): Progress saved, resume with --resume                │\n│      On completion: Checkpoint deleted automatically                            │\n│                                                                                 │\n└─────────────────────────────────────┬───────────────────────────────────────────┘\n                                      │\n                                      ▼\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│                            7. JSON PERSISTENCE                                  │\n│                            ───────────────────                                  │\n│                                                                                 │\n│      data/output/boticario-c03_20260305_214252.json                             │\n│      ┌─────────────────────────────────────────────────────────────────┐        │\n│      │  {                                                              │        │\n│      │    \"metadata\": {                                                │        │\n│      │      \"name\": \"boticario-c03\",                                   │        │\n│      │      \"brand\": \"O Boticário\",                                    │        │\n│      │      \"total_pages\": 197,                                        │        │\n│      │      \"pages_processed\": 197                                     │        │\n│      │    },                                                           │        │\n│      │    \"global_promotional_rules\": [...],                           │        │\n│      │    \"products\": [ 1638 products ],                               │        │\n│      │    \"statistics\": {                                              │        │\n│      │      \"total_products\": 1638,                                    │        │\n│      │      \"products_with_promotion\": 1200,                           │        │\n│      │      \"categories\": [\"Maquiagem\", \"Corpo e Banho\", ...]          │        │\n│      │    }                                                            │        │\n│      │  }                                                              │        │\n│      └─────────────────────────────────────────────────────────────────┘        │\n│                                                                                 │\n└─────────────────────────────────────────────────────────────────────────────────┘\n```\n\n## Clean Architecture\n\n```\n┌─────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                 │\n│   ┌──────────────────────────────────────────────────────────────────────────┐  │\n│   │                           PRESENTATION LAYER                             │  │\n│   │                                                                          │  │\n│   │   ┌─────────────────────────────────────────────────────────────────┐    │  │\n│   │   │                      CLI (Typer + Rich)                         │    │  │\n│   │   │                       main.py                                   │    │  │\n│   │   │  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌──────────┐        │    │  │\n│   │   │  │ extract  │  │ validate │  │   list    │  │   info   │        │    │  │\n│   │   │  │ command  │  │ command  │  │  command  │  │ command  │        │    │  │\n│   │   │  └──────────┘  └──────────┘  └───────────┘  └──────────┘        │    │  │\n│   │   │                      │                                          │    │  │\n│   │   │                      ▼                                          │    │  │\n│   │   │            Progress Bar │ Tables │ Console Output               │    │  │\n│   │   └─────────────────────────────────────────────────────────────────┘    │  │\n│   └──────────────────────────────────────────────────────────────────────────┘  │\n│                                       │                                         │\n│                                       ▼                                         │\n│   ┌──────────────────────────────────────────────────────────────────────────┐  │\n│   │                      DEPENDENCY INJECTION LAYER                          │  │\n│   │                                                                          │  │\n│   │   ┌─────────────────────────────────────────────────────────────────┐    │  │\n│   │   │            DI Container (dependency-injector)                   │    │  │\n│   │   │                      container.py                               │    │  │\n│   │   │                                                                 │    │  │\n│   │   │   Settings ─────┬──────────────────────────────────────────┐    │    │  │\n│   │   │                 │                                          │    │    │  │\n│   │   │   Providers:    │                                          │    │    │  │\n│   │   │   ├── pdf_processor ──► Singleton                          │    │    │  │\n│   │   │   ├── llm_client ─────► Singleton                          │    │    │  │\n│   │   │   ├── normalizer ─────► Singleton                          │    │    │  │\n│   │   │   ├── validator ──────► Singleton                          │    │    │  │\n│   │   │   ├── storage ────────► Singleton                          │    │    │  │\n│   │   │   └── extraction_service ► Factory                         │    │    │  │\n│   │   └─────────────────────────────────────────────────────────────────┘    │  │\n│   └──────────────────────────────────────────────────────────────────────────┘  │\n│                                       │                                         │\n│                                       ▼                                         │\n│   ┌──────────────────────────────────────────────────────────────────────────┐  │\n│   │                         APPLICATION LAYER                                │  │\n│   │                        (Use Cases / Services)                            │  │\n│   │                                                                          │  │\n│   │   ┌─────────────────────────────────┐  ┌────────────────────────────┐    │  │\n│   │   │   CatalogExtractionService      │  │  ProductValidatorService   │    │  │\n│   │   │   extraction_service.py         │  │  validator.py              │    │  │\n│   │   │                                 │  │                            │    │  │\n│   │   │   Methods:                      │  │  Methods:                  │    │  │\n│   │   │   ├── extract_catalog()         │  │  ├── validate()            │    │  │\n│   │   │   ├── extract_pages()           │  │  ├── validate_batch()      │    │  │\n│   │   │   ├── retry_failed_pages()      │  │  ├── auto_correct()        │    │  │\n│   │   │   ├── save_results()            │  │  └── get_validation_summary│    │  │\n│   │   │   └── get_validation_report()   │  │                            │    │  │\n│   │   └────────────────┬────────────────┘  └────────────────────────────┘    │  │\n│   │                    │                                                     │  │\n│   │      Orchestrates: │                                                     │  │\n│   │      PDF ──► LLM ──► Normalize ──► Validate ──► Store                    │  │\n│   └──────────────────────────────────────────────────────────────────────────┘  │\n│                                       │                                         │\n│                                       ▼                                         │\n│   ┌──────────────────────────────────────────────────────────────────────────┐  │\n│   │                           DOMAIN LAYER                                   │  │\n│   │                     (Entities / Interfaces / Ports)                      │  │\n│   │                                                                          │  │\n│   │   ┌─────────────────────────────────┐  ┌────────────────────────────┐    │  │\n│   │   │        models.py                │  │      interfaces.py         │    │  │\n│   │   │                                 │  │      (Ports/Contracts)     │    │  │\n│   │   │   Entities:                     │  │                            │    │  │\n│   │   │   ├── Product                   │  │   Abstract Classes:        │    │  │\n│   │   │   ├── Catalog                   │  │   ├── PDFProcessor         │    │  │\n│   │   │   ├── PromotionalRule           │  │   ├── LLMClient            │    │  │\n│   │   │   ├── ExtractionResult          │  │   ├── CategoryNormalizer   │    │  │\n│   │   │   ├── PageImage                 │  │   ├── ProductValidator     │    │  │\n│   │   │   └── ValidationAlert           │  │   └── StorageRepository    │    │  │\n│   │   │                                 │  │                            │    │  │\n│   │   │   Enums:                        │  │   These define contracts   │    │  │\n│   │   │   ├── PromotionType             │  │   that infrastructure      │    │  │\n│   │   │   └── ValidationAlertLevel      │  │   must implement           │    │  │\n│   │   └─────────────────────────────────┘  └────────────────────────────┘    │  │\n│   └──────────────────────────────────────────────────────────────────────────┘  │\n│                                       │                                         │\n│                                       ▼                                         │\n│   ┌──────────────────────────────────────────────────────────────────────────┐  │\n│   │                       INFRASTRUCTURE LAYER                               │  │\n│   │                    (Concrete Implementations / Adapters)                 │  │\n│   │                                                                          │  │\n│   │   ┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐       │  │\n│   │   │PDF2ImageProcessor│  │   GeminiClient   │  │FuzzyCategoryNorm.│       │  │\n│   │   │ pdf_processor.py │  │   llm_client.py  │  │  normalizer.py   │       │  │\n│   │   │                  │  │                  │  │                  │       │  │\n│   │   │ Implements:      │  │ Implements:      │  │ Implements:      │       │  │\n│   │   │ PDFProcessor     │  │ LLMClient        │  │ CategoryNormalizer       │  │\n│   │   │                  │  │                  │  │                  │       │  │\n│   │   │ Uses:            │  │ Uses:            │  │ Uses:            │       │  │\n│   │   │ - pdf2image      │  │ - google-genai   │  │ - thefuzz        │       │  │\n│   │   │ - Pillow         │  │ - tenacity       │  │ - rapidfuzz      │       │  │\n│   │   └──────────────────┘  └──────────────────┘  └──────────────────┘       │  │\n│   │                                                                          │  │\n│   │   ┌──────────────────────────────────────────────────────────────┐       │  │\n│   │   │               JSONStorageRepository                          │       │  │\n│   │   │                    storage.py                                │       │  │\n│   │   │                                                              │       │  │\n│   │   │   Implements: StorageRepository                              │       │  │\n│   │   │                                                              │       │  │\n│   │   │   Methods:                                                   │       │  │\n│   │   │   ├── save_catalog()        - Save to JSON                   │       │  │\n│   │   │   ├── load_catalog()        - Load from JSON                 │       │  │\n│   │   │   ├── save_checkpoint()     - Save progress                  │       │  │\n│   │   │   ├── load_checkpoint()     - Resume progress                │       │  │\n│   │   │   ├── delete_checkpoint()   - Cleanup                        │       │  │\n│   │   │   └── list_catalogs()       - List PDFs                      │       │  │\n│   │   └──────────────────────────────────────────────────────────────┘       │  │\n│   └──────────────────────────────────────────────────────────────────────────┘  │\n│                                                                                 │\n│   ┌──────────────────────────────────────────────────────────────────────────┐  │\n│   │                        EXTERNAL DEPENDENCIES                             │  │\n│   │                                                                          │  │\n│   │     ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐                │  │\n│   │     │ Gemini  │   │ Poppler │   │  JSON   │   │  File   │                │  │\n│   │     │   API   │   │  (PDF)  │   │  Files  │   │ System  │                │  │\n│   │     └─────────┘   └─────────┘   └─────────┘   └─────────┘                │  │\n│   └──────────────────────────────────────────────────────────────────────────┘  │\n│                                                                                 │\n└─────────────────────────────────────────────────────────────────────────────────┘\n```\n\n## Data Flow Diagram\n\n```\n                          ┌─────────────────────────────────────┐\n                          │           User Request              │\n                          │   catalog-extractor extract foo.pdf │\n                          └──────────────────┬──────────────────┘\n                                             │\n                                             ▼\n┌────────────────────────────────────────────────────────────────────────────────┐\n│                                                                                │\n│  ┌─────────────┐      ┌─────────────┐      ┌─────────────┐      ┌───────────┐  │\n│  │             │      │             │      │             │      │           │  │\n│  │  PDF File   │─────►│ PDF2Image   │─────►│   Gemini    │─────►│ Normalizer│  │\n│  │  (Input)    │      │  Processor  │      │   Client    │      │           │  │\n│  │             │      │             │      │             │      │           │  │\n│  └─────────────┘      └─────────────┘      └─────────────┘      └─────┬─────┘  │\n│        │                    │                    │                    │        │\n│        │                    │                    │                    │        │\n│        │              ┌─────┴─────┐        ┌─────┴─────┐        ┌─────┴─────┐  │\n│        │              │  PIL      │        │   JSON    │        │  Fuzzy    │  │\n│        │              │  Images   │        │  Response │        │  Matched  │  │\n│        │              └───────────┘        └───────────┘        └───────────┘  │\n│        │                                                              │        │\n│        │                                                              ▼        │\n│        │                                                      ┌─────────────┐  │\n│        │                                                      │             │  │\n│        │                                                      │  Validator  │  │\n│        │                                                      │             │  │\n│        │                                                      └─────┬───────┘  │\n│        │                                                            │          │\n│        │                                                      ┌─────┴─────┐    │\n│        │                                                      │ Validated │    │\n│        │                                                      │ Products  │    │\n│        │                                                      │ + Alerts  │    │\n│        │                                                      └─────┬─────┘    │\n│        │                                                            │          │\n│        │              ┌───────────────────────────────────┐         │          │\n│        │              │         Checkpoint                │◄────────┤          │\n│        │              │    (every 10 pages)               │         │          │\n│        │              └───────────────────────────────────┘         │          │\n│        │                                                            │          │\n│        │                                                            ▼          │\n│        │                                                    ┌─────────────┐    │\n│        └───────────────────────────────────────────────────►│   Storage   │    │\n│                         source_file reference               │ Repository  │    │\n│                                                             └──────┬──────┘    │\n│                                                                    │           │\n└────────────────────────────────────────────────────────────────────┼─────────-─┘\n                                                                     │\n                                                                     ▼\n                                              ┌────────────────────────────────────┐\n                                              │         Output JSON                │\n                                              │  data/output/catalog_YYYYMMDD.json │\n                                              └────────────────────────────────────┘\n```\n\n## Features\n\n- **PDF to Image Conversion**: Convert catalog pages with configurable DPI\n- **AI Vision Extraction**: Use Gemini to extract product data from images\n- **Category Normalization**: Fuzzy matching to normalize categories\n- **Data Validation**: Validate extracted data with auto-correction\n- **Checkpointing**: Resume interrupted extractions\n- **Rate Limiting**: Exponential backoff for API rate limits\n- **Progress Tracking**: Real-time progress with Rich\n\n## Installation\n\n```bash\n# Clone the repository\ncd smart-catalog-reader\n\n# Create virtual environment\npython -m venv .venv\nsource .venv/bin/activate  # Linux/Mac\n# .venv\\Scripts\\activate   # Windows\n\n# Install dependencies\npip install -e .\n\n# For development\npip install -e \".[dev]\"\n```\n\n## Configuration\n\nCreate a `.env` file (see `.env.example`):\n\n```bash\n# Required\nGEMINI_API_KEY=your_api_key_here\n\n# Optional\nIS_PAID_TIER=false\nGEMINI_MODEL=gemini-2.5-flash\nDPI_DEFAULT=200\nFUZZY_MATCH_THRESHOLD=75\n```\n\n## Usage\n\n### Extract a catalog\n\n```bash\n# Basic extraction\ncatalog-extractor extract data/catalogs/boticario-c03.pdf\n\n# With options\ncatalog-extractor extract data/catalogs/natura-c03.pdf \\\n  --name \"Natura Ciclo 03\" \\\n  --brand \"Natura\" \\\n  --output data/output/natura-c03.json\n\n# Extract specific pages\ncatalog-extractor extract data/catalogs/boticario-c03.pdf --pages \"1-10,50,100-110\"\n```\n\n### Validate extracted data\n\n```bash\ncatalog-extractor validate data/output/natura-c03.json\n```\n\n### List available catalogs\n\n```bash\ncatalog-extractor list-catalogs data/catalogs/\n```\n\n### Show configuration\n\n```bash\ncatalog-extractor info\n```\n\n## Project Structure\n\n```\nsmart-catalog-reader/\n├── src/catalog_extractor/\n│   ├── domain/                 # Domain entities and interfaces\n│   │   ├── models.py          # Product, Catalog, etc.\n│   │   └── interfaces.py      # Abstract interfaces (ports)\n│   ├── infrastructure/         # Concrete implementations\n│   │   ├── pdf_processor.py   # PDF to image conversion\n│   │   ├── llm_client.py      # Gemini API integration\n│   │   ├── normalizer.py      # Category normalization\n│   │   └── storage.py         # JSON persistence\n│   ├── application/            # Use cases and services\n│   │   ├── extraction_service.py\n│   │   └── validator.py\n│   ├── config/                 # Configuration\n│   │   └── settings.py\n│   ├── container.py            # Dependency injection\n│   └── main.py                 # CLI entry point\n├── data/\n│   ├── catalogs/               # Input PDFs (gitignored)\n│   └── output/                 # Extracted JSONs (gitignored)\n├── tests/\n├── pyproject.toml\n└── README.md\n```\n\n## Output Format\n\n```json\n{\n  \"metadata\": {\n    \"nome\": \"O Boticário - Ciclo 03\",\n    \"marca\": \"O Boticário\",\n    \"total_paginas\": 197,\n    \"exported_at\": \"2025-03-05T18:30:00\"\n  },\n  \"produtos\": [\n    {\n      \"codigo\": \"85675\",\n      \"nome\": \"Creme Relaxante Hidratante Corporal Cuide-se Bem\",\n      \"linha\": \"Cuide-se Bem\",\n      \"categoria\": \"Hidratação\",\n      \"categoria_normalizada\": \"Corpo e Banho\",\n      \"volume_peso\": \"200g\",\n      \"preco_regular\": 69.90,\n      \"preco_promocional\": 55.90,\n      \"desconto_percentual\": 20.03,\n      \"promocao_ativa\": true,\n      \"regra_promocional\": {\n        \"tipo\": \"simple_discount\",\n        \"descricao\": \"Desconto direto\"\n      },\n      \"caracteristicas\": [\"Vegano\", \"Sensação relaxante\"],\n      \"pagina\": 2,\n      \"alertas\": []\n    }\n  ],\n  \"estatisticas\": {\n    \"total_produtos\": 1250,\n    \"produtos_com_promocao\": 890,\n    \"categorias\": [\"Corpo e Banho\", \"Perfumaria\", \"Maquiagem\", ...],\n    \"linhas\": [\"Cuide-se Bem\", \"Botik\", \"Nativa SPA\", ...]\n  }\n}\n```\n\n## Development\n\n```bash\n# Run tests\npytest\n\n# Run tests with coverage\npytest --cov=catalog_extractor\n\n# Type checking\nmypy src/\n\n# Linting\nruff check src/\n```\n\n## SOLID Principles Applied\n\n- **Single Responsibility**: Each class has one reason to change\n- **Open/Closed**: Extend via new implementations, not modifications\n- **Liskov Substitution**: Interfaces define contracts\n- **Interface Segregation**: Small, focused interfaces\n- **Dependency Inversion**: Depend on abstractions, not concretions\n\n## License\n\nThis project is licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License (CC BY-NC 4.0)](LICENSE).\n\nYou are free to share and adapt this work for non-commercial purposes with attribution.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farturovaine%2Fsmart-catalog-reader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Farturovaine%2Fsmart-catalog-reader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Farturovaine%2Fsmart-catalog-reader/lists"}