{"id":19262859,"url":"https://github.com/mycielski/textract_study","last_synced_at":"2026-05-06T00:06:15.233Z","repository":{"id":208533543,"uuid":"721713843","full_name":"mycielski/textract_study","owner":"mycielski","description":"Analysing expense reports/invoices with AWS Textract and boto3.","archived":false,"fork":false,"pushed_at":"2023-11-27T01:35:37.000Z","size":26,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-23T18:47:13.398Z","etag":null,"topics":["aws","aws-cli","boto3","document-understanding","expenses","invoices","script","shell","textract"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mycielski.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-11-21T16:13:30.000Z","updated_at":"2024-05-11T07:16:03.000Z","dependencies_parsed_at":"2023-11-27T02:29:44.594Z","dependency_job_id":"957bc771-40b8-4923-8f68-208d3a84d306","html_url":"https://github.com/mycielski/textract_study","commit_stats":null,"previous_names":["mycielski/textract_study"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/mycielski/textract_study","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mycielski%2Ftextract_study","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mycielski%2Ftextract_study/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mycielski%2Ftextract_study/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mycielski%2Ftextract_study/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mycielski","download_url":"https://codeload.github.com/mycielski/textract_study/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mycielski%2Ftextract_study/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32672688,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-05T11:29:49.557Z","status":"ssl_error","status_checked_at":"2026-05-05T11:29:48.587Z","response_time":54,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-cli","boto3","document-understanding","expenses","invoices","script","shell","textract"],"created_at":"2024-11-09T19:33:41.761Z","updated_at":"2026-05-06T00:06:15.219Z","avatar_url":"https://github.com/mycielski.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS Textract study\n\n[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)\n\nSome code I've written when learning what is Textract and how to use it.\n\nThe project also contains a shell script which uses the AWS CLI v2 to perform the same task.\n\n## How to use it?\n\n1. Put your invoices in the `demo_data` directory.\n   Here's an example of the directory structure:\n    ```sh\n    .\n    ├── demo_data\n    │   ├── invoice.pdf\n    │   └── invoices\n    │       ├── other_invoice.jpg\n    │       └── and_one_more_invoice.png\n    ├── readme.md\n    └── src\n        └── main.py\n    ```\n\n2. Provide your AWS credentials as environment variables:\n    ```sh\n    $ export AWS_ACCESS_KEY_ID=your_access_key_id          # for example \"AKIAIOSFODNN7EXAMPLE\"\n    $ export AWS_SECRET_ACCESS_KEY=your_secret_access_key  # for example \"wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY\"\n    $ export AWS_REGION=region                             # for example \"us-east-1\"\n    $ export AWS_BUCKET=bucket_name                        # for example \"my-textract-study-bucket\"\n    ```\n\n3. Run the script:\n    ```sh\n    $ python src/main.py\n    ```\n\n4. The report will be generated with a name like `\u003cuuid\u003e.xlsx`:\n    ```sh\n    .\n    ├── demo_data\n    │   ├── invoice.pdf\n    │   └── invoices\n    │       ├── other_invoice.jpg\n    │       └── and_one_more_invoice.png\n    ├── output\n    │   └── 456af71d-f7b2-4bf8-87c7-bade21d843d4\n    │       ├── report.csv\n    │       ├── report.json\n    │       └── report.xlsx\n    ├── readme.md\n    └── src\n        └── main.py\n    ```\n\n## Notes\n\n- This script uses busy waiting for Textract job results (in the `retrieve_analyses` function). It is not optimal. In fact, it is pretty terrible for performance. Use [notifications](https://docs.aws.amazon.com/textract/latest/dg/api-async.html#:~:text=The%20completion%20status%20of%20the%20request%20is%20published%20to%20an%20Amazon%20Simple%20Notification%20Service%20(Amazon%20SNS)%20topic.) instead.\n- The whole thing is just one file. Terrible for legibility but eh, it works.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmycielski%2Ftextract_study","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmycielski%2Ftextract_study","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmycielski%2Ftextract_study/lists"}