https://github.com/mycielski/textract_study
Analysing expense reports/invoices with AWS Textract and boto3.
https://github.com/mycielski/textract_study
aws aws-cli boto3 document-understanding expenses invoices script shell textract
Last synced: 29 days ago
JSON representation
Analysing expense reports/invoices with AWS Textract and boto3.
- Host: GitHub
- URL: https://github.com/mycielski/textract_study
- Owner: mycielski
- Created: 2023-11-21T16:13:30.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2023-11-27T01:35:37.000Z (over 2 years ago)
- Last Synced: 2025-02-23T18:47:13.398Z (over 1 year ago)
- Topics: aws, aws-cli, boto3, document-understanding, expenses, invoices, script, shell, textract
- Language: Python
- Homepage:
- Size: 25.4 KB
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: readme.md
Awesome Lists containing this project
README
# AWS Textract study
[](https://github.com/psf/black)
Some code I've written when learning what is Textract and how to use it.
The project also contains a shell script which uses the AWS CLI v2 to perform the same task.
## How to use it?
1. Put your invoices in the `demo_data` directory.
Here's an example of the directory structure:
```sh
.
├── demo_data
│ ├── invoice.pdf
│ └── invoices
│ ├── other_invoice.jpg
│ └── and_one_more_invoice.png
├── readme.md
└── src
└── main.py
```
2. Provide your AWS credentials as environment variables:
```sh
$ export AWS_ACCESS_KEY_ID=your_access_key_id # for example "AKIAIOSFODNN7EXAMPLE"
$ export AWS_SECRET_ACCESS_KEY=your_secret_access_key # for example "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY"
$ export AWS_REGION=region # for example "us-east-1"
$ export AWS_BUCKET=bucket_name # for example "my-textract-study-bucket"
```
3. Run the script:
```sh
$ python src/main.py
```
4. The report will be generated with a name like `.xlsx`:
```sh
.
├── demo_data
│ ├── invoice.pdf
│ └── invoices
│ ├── other_invoice.jpg
│ └── and_one_more_invoice.png
├── output
│ └── 456af71d-f7b2-4bf8-87c7-bade21d843d4
│ ├── report.csv
│ ├── report.json
│ └── report.xlsx
├── readme.md
└── src
└── main.py
```
## Notes
- This script uses busy waiting for Textract job results (in the `retrieve_analyses` function). It is not optimal. In fact, it is pretty terrible for performance. Use [notifications](https://docs.aws.amazon.com/textract/latest/dg/api-async.html#:~:text=The%20completion%20status%20of%20the%20request%20is%20published%20to%20an%20Amazon%20Simple%20Notification%20Service%20(Amazon%20SNS)%20topic.) instead.
- The whole thing is just one file. Terrible for legibility but eh, it works.