Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/paulrobello/pdf_ingest

AWS solution to perform AI vision OCR on PDFs uploaded to S3
https://github.com/paulrobello/pdf_ingest

Last synced: 26 days ago
JSON representation

AWS solution to perform AI vision OCR on PDFs uploaded to S3

Host: GitHub
URL: https://github.com/paulrobello/pdf_ingest
Owner: paulrobello
Created: 2024-10-20T02:58:37.000Z (3 months ago)
Default Branch: main
Last Pushed: 2024-11-23T00:17:22.000Z (about 2 months ago)
Last Synced: 2024-11-23T01:20:07.935Z (about 2 months ago)
Language: Python
Size: 402 KB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

README

# AI Vision Contract OCR

## Overview
![Arch Diagram](./arch-diagram.png)

This project uses can use Amazon Bedrock, Anthropic or OpenAI vision models to perform OCR on pdf documents uploaded to a S3 bucket.
Results are stored as markdown files in a S3 bucket for further processing and storage in a database.

## Pricing and Metrics for AI portion of the project using Amazon Bedrock
* Claude 3.5 Sonnet input: $0.003/1K, output: $0.015/1K
* It takes roughly 1.5 minutes to OCR a 12-page document. Currently, the project is setup for processing 1 document at a time.
* Vision OCR averages per page: InputTokens:1633 OutputTokens:1095 Latency:33278ms Cost: $0.021324 x12 = $0.255888
* Extracting terms and conditions from the resulting text averages: InputTokens:12555 OutputTokens:520 Latency:16808ms Cost: $0.045465
* Total cost per document: $0.301353
* Total cost per month assuming 4k 12 page documents per month: $1205.41

## Pricing for infrastructure
* RDS is a shared resource and should not increase the cost of the project
* S3 storage and data transfer less than $5/mo
* Lambda assuming 4k 12 page documents per month $10/mo
* VPC Bedrock Endpoints $43/mo
* Total cost per month assuming 4k 12 page documents per month: $68/mo

## Total cost per month assuming 4k 12 page documents per month: $68 + $1205.41 = $1273.41

## Prerequisites for Bedrock
* Bedrock Anthropic models must be enabled in the account
* The following VPC endpoints are required if using in a private SUBNET which is the default for this project:
* bedrock - Amazon Bedrock Control Plane API actions
* bedrock-runtime - Amazon Bedrock Runtime API actions
* Not all Bedrock models support all regions us-east-1 is recommended

## Prerequisites for LocalStack deployment

- Generic Development Container (GDC) repo cloned and configured with a LocalStack Pro key. [GDC](https://github.com/devxpod/GDC)

## Prerequisites for OpenAI and other providers (Required if deploying locally)
* Set the environment variable for your chosen provider like OPENAI_API_KEY to your OpenAI API key.
* For local deployment or testing you can create a .env file in the repo root with needed API Keys.
* Example .env file
```bash
OPENAI_API_KEY=your_api_key
ANTHROPIC_API_KEY=your_api_key
# Tracing (optional)
LANGCHAIN_TRACING_V2=false
LANGCHAIN_ENDPOINT=https://api.smith.langchain.com
LANGCHAIN_API_KEY=your_api_key
LANGCHAIN_PROJECT=pdf_ingestion
```

## Deploying the project
* For local deployment or testing you can create a .env file in the repo root with needed API Keys.
* From the repo root run:
```bash
run-dev-container.sh
```
* In a separate terminal run the following to open a shell into the GDC:
```bash
docker exec -it pdf_igst-dev-1 bash -l
```
* All future commands will be run in the GDC shell.
* Run the following command to deploy the project to localstack:
```bash
export AWS_PROFILE=localstack
make local-aws-init
make local-deploy
```

## Testing
* Run the following command to test the project locally:
```bash
make local-upload-pdf
make local-list-outbox
```
The s3 listing should show a folder with the name of the request id.
If you then list that folder you should see the pages of the pdf in image and markdown format.
There will be a file with a suffix of "-final.md" this is the final output of the pdf.
Example command to copy final output to local GDC: (Replace 2d85d46b with your request id)
```bash
aws s3 cp s3://pdf-ingestion-lcl-us-east-1/outbox/2d85d46b/pdf-text-normal-final.md .
```

## Configuration
You can select the AI provider you want by setting AI_PROVIDER in the envs.xxx.makefile for the target environment.
Available providers are Bedrock, Anthropic and OpenAI.
You can also select the desired model by setting AI_MODEL in the envs.xxx.makefile for the target environment. If you do not select one a default vision model will be used.