{"id":24936936,"url":"https://github.com/neetigyab/pdfreader","last_synced_at":"2026-05-08T19:32:25.803Z","repository":{"id":271829745,"uuid":"914698559","full_name":"neetigyab/PDFReader","owner":"neetigyab","description":"Ready to use Python application/file for parsing a specific format of pdf form, and storing relevant user data in a tabular format in excel sheet","archived":false,"fork":false,"pushed_at":"2025-02-11T10:34:37.000Z","size":77899,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-07T02:50:21.766Z","etag":null,"topics":["excel","forms","matplotlib","numpy","ocr","opencv-python","pandas","pdf","pdf-converter","pdfplumber","pytesseract","python"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/neetigyab.png","metadata":{"files":{"readme":"ReadME.txt","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null}},"created_at":"2025-01-10T05:58:02.000Z","updated_at":"2025-02-01T19:03:35.000Z","dependencies_parsed_at":null,"dependency_job_id":"039d6a2d-497a-473a-b4c4-743a9ac9543b","html_url":"https://github.com/neetigyab/PDFReader","commit_stats":null,"previous_names":["neetigyab/pdfreader"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/neetigyab/PDFReader","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neetigyab%2FPDFReader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neetigyab%2FPDFReader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neetigyab%2FPDFReader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neetigyab%2FPDFReader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/neetigyab","download_url":"https://codeload.github.com/neetigyab/PDFReader/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/neetigyab%2FPDFReader/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32794620,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-08T08:22:46.396Z","status":"ssl_error","status_checked_at":"2026-05-08T08:22:45.650Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["excel","forms","matplotlib","numpy","ocr","opencv-python","pandas","pdf","pdf-converter","pdfplumber","pytesseract","python"],"created_at":"2025-02-02T16:57:45.611Z","updated_at":"2026-05-08T19:32:25.778Z","avatar_url":"https://github.com/neetigyab.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# PDF Form Processing Tool\n\nThis project is a modular tool designed for extracting, processing, and saving data from form-like PDF documents. The tool detects and processes textual and checkbox elements in PDF files, maps fields to corresponding content, and exports the results to an Excel file.\n\n## Features\n- Extract text and checkbox data from PDFs.\n- Map form fields to their corresponding values.\n- Visualize detected checkboxes on PDF pages.\n- Save mapped data to an Excel file.\n\n## Project Structure\n\n### **Packages and Modules**\n\n#### 1. **`pdf_extractor`**\n- **`extractor.py`**\n  - `extract_pdf_text(pdf_path)`: Extracts all text lines from a PDF document.\n\n#### 2. **`checkbox_detector`**\n- **`checkbox_detector.py`**\n  - `detect_checkboxes(image, ignored_area=None)`: Detects checkboxes in a given image and determines whether they are checked or unchecked.\n  - `visualize_checkboxes(image, checkbox_positions, checkbox_states, page_number)`: Visualizes detected checkboxes and their states on the image.\n\n#### 3. **`checkbox_parser`**\n- **`checkbox_parser.py`**\n  - `parse_checkbox(lines, index, aliases, pdf_path)`: Extracts checkbox states from nearby lines in the PDF.\n\n#### 4. **`field_mapper`**\n- **`field_mapper.py`**\n  - `map_fields_to_content(lines, pdf_path)`: Maps fields to their respective content based on the extracted text and detected checkboxes.\n\n#### 5. **`output_saver`**\n- **`output_generator.py`**\n  - `save_to_excel(mapped_data, filename)`: Saves the mapped data to an Excel file.\n\n#### 6. **`config`**\n- **`config.py`**\n  - `fields`: A list of fields expected in the PDF.\n  - `checkbox_fields`: A subset of fields specifically for checkboxes.\n  - `field_mappings`: A dictionary mapping field names to possible aliases in the PDF.\n\n---\n\n## Libraries Utilized\n1. **`pdfplumber`**: For extracting text and images from PDF documents.\n2. **`OpenCV`**: For detecting and visualizing checkboxes.\n3. **`numpy`**: For handling image arrays and numerical operations.\n4. **`matplotlib`**: For visualizing checkboxes.\n5. **`pytesseract`**: For Optical Character Recognition (OCR) when necessary.\n6. **`pandas`**: For saving data to an Excel file.\n\n---\n\n## Manual Inputs and Configuration\n\n### 1. **Fields of Expected Form Elements**\n- **`fields`**: Define all possible fields that the PDF form may contain. This list must be updated to include any new fields introduced in the form layout.\n- Example:\n  ```python\n  fields = [\"Name\", \"Address\", \"Date of Birth\", \"Phone Number\"]\n  ```\n\n### 2. **Field Mappings**\n- **`field_mappings`**: Define mappings to evaluate differences between actual field names in the form and expected field names.\n- Example:\n  ```python\n  field_mappings = {\n      \"Name\": [\"Full Name\", \"Name of Applicant\"],\n      \"Address\": [\"Residential Address\", \"Home Address\"],\n  }\n  ```\n\n### 3. **Checkbox Fields**\n- **`checkbox_fields`**: Specify fields that require checkbox processing.\n- Example:\n  ```python\n  checkbox_fields = [\"Terms and Conditions\", \"Subscription Opt-in\"]\n  ```\n\n---\n\n## Updating for New Form Layouts\n1. **Add New Fields**: Update the `fields` list in `config/config.py` to include any new fields present in the updated form layout.\n2. **Update Field Mappings**: Extend `field_mappings` with aliases for new or renamed fields.\n3. **Update Checkbox Fields**: Add any new checkbox-related fields to the `checkbox_fields` list.\n\n---\n\n## Usage\n\n### 1. Extract Text from PDF\nThe `extract_pdf_text` function extracts lines of text from the PDF.\n```python\nlines = extract_pdf_text(\"path/to/pdf\")\n```\n\n### 2. Map Fields to Content\nMap the extracted lines to the predefined fields using `map_fields_to_content`.\n```python\nmapped_data = map_fields_to_content(lines, \"path/to/pdf\")\n```\n\n### 3. Save Data to Excel\nSave the mapped data to an Excel file using `save_to_excel`.\n```python\nsave_to_excel(mapped_data, \"output.xlsx\")\n```\n\n---\n\n## Example Workflow\n\n1. Place the PDF file in the project directory.\n2. Run the main script:\n   ```bash\n   python main.py\n   ```\n3. View the extracted and processed data in the output Excel file.\n\n---\n\n## Notes\n- Ensure that any updates to the form layout are reflected in the `fields`, `field_mappings`, and `checkbox_fields` in `config/config.py`.\n- Make sure all required libraries are installed before running the script:\n  ```bash\n  pip install pdfplumber opencv-python numpy matplotlib pytesseract pandas\n  ```\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneetigyab%2Fpdfreader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fneetigyab%2Fpdfreader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fneetigyab%2Fpdfreader/lists"}