{"id":24325763,"url":"https://github.com/toatoes/py_photo-excel_data_convert","last_synced_at":"2026-03-14T11:36:53.411Z","repository":{"id":272089491,"uuid":"915497225","full_name":"ToaToes/py_Photo-Excel_data_convert","owner":"ToaToes","description":"python to identify photo of Excel-like paper and get data, input the data to Spread Sheets(xlsx)  within the format","archived":false,"fork":false,"pushed_at":"2025-01-12T02:03:28.000Z","size":4,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-12T03:20:29.092Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/ToaToes.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-12T01:59:31.000Z","updated_at":"2025-01-12T02:03:31.000Z","dependencies_parsed_at":"2025-01-12T03:30:33.526Z","dependency_job_id":null,"html_url":"https://github.com/ToaToes/py_Photo-Excel_data_convert","commit_stats":null,"previous_names":["toatoes/py_photo-excel_data_convert"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToaToes%2Fpy_Photo-Excel_data_convert","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToaToes%2Fpy_Photo-Excel_data_convert/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToaToes%2Fpy_Photo-Excel_data_convert/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/ToaToes%2Fpy_Photo-Excel_data_convert/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/ToaToes","download_url":"https://codeload.github.com/ToaToes/py_Photo-Excel_data_convert/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":242973983,"owners_count":20215251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-01-17T20:20:16.086Z","updated_at":"2025-12-24T11:04:26.348Z","avatar_url":"https://github.com/ToaToes.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# py_Photo2Excel_data_convert\npython to identify photo of Excel-like paper and get data, input the data to Spread Sheets(xlsx)  within the format\n\nTo automate the process of extracting numerical data from a photo of an Excel-like report and writing it into a Numbers spreadsheet on macOS, \n\n1. Image Preprocessing: Convert the image to a format suitable for text extraction (such as grayscale and thresholding).\n2. Optical Character Recognition (OCR): Extract the numbers (and possibly headers) from the image.\n3. Data Parsing: Structure the extracted text into a tabular format.\n4. Write Data to Numbers Spreadsheet: Use AppleScript (through Python) to write the parsed data into a Numbers file.\n\nHere’s how you can implement this workflow using Python.\n\n## Step 1: Install Required Libraries\nYou'll need several libraries to perform image processing, OCR, and automation tasks. Install the following Python libraries:\n```\npip install pytesseract opencv-python numpy pillow\n```\n\npytesseract is the Python wrapper for Tesseract OCR, which will be used to extract text from the image.\nopencv-python will help in preprocessing the image (grayscale, thresholding).\npillow is for image manipulation (if needed).\nAdditionally, you'll need Tesseract OCR installed on your machine. You can install it via Homebrew on macOS:\n```\nbrew install tesseract\n```\n\n## Step 2: Preprocess the Image and Extract Text Using OCR\nLet's start by preprocessing the image to improve the OCR results and then extract text (numbers) from the image.\n\n```\nimport cv2\nimport pytesseract\nimport numpy as np\nfrom PIL import Image\n\n# Set the path for Tesseract (if not set in the environment variable)\npytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'  # Adjust path as needed\n\ndef extract_numbers_from_image(image_path):\n    # Load the image\n    img = cv2.imread(image_path)\n\n    # Convert to grayscale\n    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)\n\n    # Apply thresholding to binarize the image (improves OCR accuracy)\n    _, thresh_img = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)\n\n    # Use Tesseract to extract text (numbers)\n    custom_oem_psm_config = r'--oem 3 --psm 6'  # OEM = 3 (default), PSM = 6 (Assume a single uniform block of text)\n    text = pytesseract.image_to_string(thresh_img, config=custom_oem_psm_config)\n\n    # For debugging: Print extracted text\n    print(\"Extracted Text: \\n\", text)\n\n    return text\n\n# Example usage\nimage_path = \"path_to_your_excel_like_report_image.jpg\"  # Replace with the path to your image\nextracted_text = extract_numbers_from_image(image_path)\n\n```\nExplanation:\n\nGrayscale Conversion: The image is converted to grayscale to simplify the image and improve OCR accuracy.\nThresholding: A binary thresholding technique is applied to make the numbers stand out from the background.\nTesseract OCR: This is used to extract the text (numbers) from the image.\n\n\n## Step 3: Parse Extracted Data\nOnce we extract the text, we need to clean it up and structure it into a tabular format (assuming the report follows a grid-like structure).\n\n```\nimport re\n\ndef parse_extracted_text_to_table(extracted_text):\n    # Split by newlines and remove any non-numeric characters\n    lines = extracted_text.split(\"\\n\")\n    data = []\n\n    for line in lines:\n        # Remove non-numeric characters except for commas and dots\n        cleaned_line = re.sub(r'[^0-9,\\.]', '', line)\n        \n        # If the cleaned line contains any numbers, split by whitespace\n        if cleaned_line.strip():\n            row = cleaned_line.split()\n            data.append(row)\n\n    return data\n\n# Example usage\nparsed_data = parse_extracted_text_to_table(extracted_text)\nprint(\"Parsed Data: \\n\", parsed_data)\n\n```\nExplanation:\n\nRegex Cleaning: We use regular expressions to clean the extracted text by keeping only the numeric values (numbers, dots, and commas).\nData Splitting: The data is split into rows, and each row is split into individual cells by whitespace.\n\n\n## Step 4: Write Data to Numbers Spreadsheet\nNow that we have the data in a structured format (a list of rows and columns), we can write it to a Numbers spreadsheet. To do this, we’ll use AppleScript called from Python to interact with Numbers.\n\n```\nimport subprocess\n\ndef write_data_to_numbers(parsed_data):\n    # AppleScript to write data to a Numbers file\n    applescript = \"\"\"\n    tell application \"Numbers\"\n        activate\n        set doc to open \"Macintosh HD:Users:YourUsername:Documents:example.numbers\"  -- Adjust path as needed\n        tell sheet 1 of doc\n    \"\"\"\n    \n    # Loop through the rows of parsed data and write them to Numbers\n    for i, row in enumerate(parsed_data):\n        for j, cell_value in enumerate(row):\n            applescript += f\"\"\"\n                set value of cell {i + 1}, {j + 1} of sheet 1 of doc to \"{cell_value}\"\n            \"\"\"\n    \n    # Finish the AppleScript to save and close the document\n    applescript += \"\"\"\n        save doc\n        close doc\n    end tell\n    \"\"\"\n\n    # Run the AppleScript using osascript\n    process = subprocess.Popen(['osascript', '-e', applescript], stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n    stdout, stderr = process.communicate()\n\n    if stderr:\n        print(f\"Error: {stderr.decode()}\")\n    else:\n        print(\"Successfully wrote data to Numbers sheet.\")\n\n# Example usage\nwrite_data_to_numbers(parsed_data)\n\n```\n\nExplanation:\n\nAppleScript: The script opens an existing Numbers document, and then for each cell in the parsed data, it sets the value in the corresponding cell in the Numbers spreadsheet.\nSubprocess: The Python subprocess module is used to execute the AppleScript, which interacts with the Numbers app.\nPutting Everything Together\n\n\n# Here’s how to tie everything together in a single script:\n\n```\nimport cv2\nimport pytesseract\nimport numpy as np\nimport re\nimport subprocess\n\n# Set Tesseract path if not set in environment variables\npytesseract.pytesseract.tesseract_cmd = '/usr/local/bin/tesseract'  # Adjust path as needed\n\ndef extract_numbers_from_image(image_path):\n    img = cv2.imread(image_path)\n    gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)\n    _, thresh_img = cv2.threshold(gray, 150, 255, cv2.THRESH_BINARY_INV)\n    \n    custom_oem_psm_config = r'--oem 3 --psm 6'\n    text = pytesseract.image_to_string(thresh_img, config=custom_oem_psm_config)\n    \n    return text\n\ndef parse_extracted_text_to_table(extracted_text):\n    lines = extracted_text.split(\"\\n\")\n    data = []\n\n    for line in lines:\n        cleaned_line = re.sub(r'[^0-9,\\.]', '', line)\n        if cleaned_line.strip():\n            row = cleaned_line.split()\n            data.append(row)\n\n    return data\n\ndef write_data_to_numbers(parsed_data):\n    applescript = \"\"\"\n    tell application \"Numbers\"\n        activate\n        set doc to open \"Macintosh HD:Users:YourUsername:Documents:example.numbers\"\n        tell sheet 1 of doc\n    \"\"\"\n    \n    for i, row in enumerate(parsed_data):\n        for j, cell_value in enumerate(row):\n            applescript += f\"\"\"\n                set value of cell {i + 1}, {j + 1} of sheet 1 of doc to \"{cell_value}\"\n            \"\"\"\n    \n    applescript += \"\"\"\n        save doc\n        close doc\n    end tell\n    \"\"\"\n    \n    process = subprocess.Popen(['osascript', '-e', applescript], stdout=subprocess.PIPE, stderr=subprocess.PIPE)\n    stdout, stderr = process.communicate()\n\n    if stderr:\n        print(f\"Error: {stderr.decode()}\")\n    else:\n        print(\"Successfully wrote data to Numbers sheet.\")\n\n# Main execution\nimage_path = \"path_to_your_excel_like_report_image.jpg\"\nextracted_text = extract_numbers_from_image(image_path)\nparsed_data = parse_extracted_text_to_table(extracted_text)\nwrite_data_to_numbers(parsed_data)\n\n```\nExplanation:\n1. extract_numbers_from_image: Preprocesses the image and extracts text using OCR.\n2. parse_extracted_text_to_table: Cleans the extracted text and structures it into a table.\n3. write_data_to_numbers: Uses AppleScript to write the parsed data into a Numbers spreadsheet.\n\n   \nConclusion\nThis solution uses OCR to extract data from an image, structures the data into a tabular format, and writes the data into an existing Numbers spreadsheet using AppleScript. Adjust the file paths and tweak the preprocessing steps (e.g., thresholding) for your specific case.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftoatoes%2Fpy_photo-excel_data_convert","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftoatoes%2Fpy_photo-excel_data_convert","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftoatoes%2Fpy_photo-excel_data_convert/lists"}