https://github.com/aeksco/aws-pdf-textract-pipeline

:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
https://github.com/aeksco/aws-pdf-textract-pipeline

aws aws-cdk aws-textract cdk cloudformation data-pipeline dynamodb jest lambda pdf puppeteer s3 serverless sns textract typescript webscraping

Last synced: 2 days ago
JSON representation

:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript

Host: GitHub
URL: https://github.com/aeksco/aws-pdf-textract-pipeline
Owner: aeksco
License: mit
Created: 2020-02-24T04:08:57.000Z (over 5 years ago)
Default Branch: main
Last Pushed: 2024-06-05T13:57:47.000Z (about 1 year ago)
Last Synced: 2024-10-14T06:45:10.979Z (9 months ago)
Topics: aws, aws-cdk, aws-textract, cdk, cloudformation, data-pipeline, dynamodb, jest, lambda, pdf, puppeteer, s3, serverless, sns, textract, typescript, webscraping
Language: TypeScript
Homepage:
Size: 1.66 MB
Stars: 163
Watchers: 3
Forks: 18
Open Issues: 5
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE

Awesome Lists containing this project

awesome-cdk - aws-pdf-textract-pipeline - ETL pipeline for crawling PDFs from the Web using Puppeteer and transforming their contents into structured data using AWS Textract and storing the results in DynamoDB. (Construct Libraries / Workflows)

README

        # aws-pdf-textract-pipeline [![Mentioned in Awesome CDK](https://awesome.re/mentioned-badge.svg)](https://github.com/kolomied/awesome-cdk)

:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using [AWS Textract](https://aws.amazon.com/textract/). Built with AWS CDK + TypeScript.

This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.

![Example Extension Popup](https://i.imgur.com/3F89JQK.png "Example Extension Popup")

**Getting Started**

Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.

```

yarn install

yarn build

cdk bootstrap

cdk deploy

```

### Overview

The following is an overview of each process performed by this CDK stack.

1. **Scrape PDF download URLs from a website**

   Scraping data from the [COGCC](https://cogcc.state.co.us/) website.

2. **Store PDF download URL in DynamoDB**

   ![Example Extension Popup](https://i.imgur.com/bmFJGDW.png "Example Extension Popup")

3. **Download the PDF to S3**

   A lambda fires off when a new PDF download URL has been created in DynamoDB.

4. **Process the PDF with AWS Textract**

   Another lambda fires off when a PDF has been downloaded to the S3 bucket.

5. **Process the AWS Textract results**

   When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.

6. **Save the processed Textract result to DynamoDB.**

   After the full result is pruned down the the desired datastructure, we save the data in DynamoDB.

   ![Example Extension Popup](https://i.imgur.com/HkTtLmi.png "Example Extension Popup")

### Scripts

- `yarn install` - installs dependencies

- `yarn build` - builds the production-ready CDK Stack

- `yarn test` - runs Jest

- `cdk bootstrap` - bootstraps AWS Cloudformation for your CDK deploy

- `cdk deploy` - deploys the CDK stack to AWS

**Notes**

- **Warning** - the `AnalyzeDocument` process from AWS Textract costs \$50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention.

- If a PDF download URL has already been added to the `pdfUrlsTable` DynamoDB table, the pipeline will not re-execute for the PDF.

- Includes tests with Jest.

- Recommended to use `Visual Studio Code` with the `Format on Save` setting turned on.

**Built with**

- [TypeScript](https://www.typescriptlang.org/)

- [Jest](https://jestjs.io)

- [Puppeteer](https://jestjs.io)

- [AWS CDK](https://aws.amazon.com/cdk/)

- [AWS Lambda](https://aws.amazon.com/lambda/)

- [AWS SNS](https://aws.amazon.com/sns/)

- [AWS DynamoDB](https://aws.amazon.com/dynamodb/)

- [AWS S3](https://aws.amazon.com/s3/)

**Additional Resources**

- [CDK API Reference](https://docs.aws.amazon.com/cdk/api/latest/docs/aws-construct-library.html)

- [Puppeteer](https://github.com/puppeteer/puppeteer)

- [Puppeteer Lambda](https://github.com/alixaxel/chrome-aws-lambda)

- [CDK TypeScript Reference](https://docs.aws.amazon.com/cdk/api/latest/typescript/api/index.html)

- [CDK Assertion Package](https://github.com/aws/aws-cdk/tree/master/packages/%40aws-cdk/assert)

- [Textract Pricing Chart](https://aws.amazon.com/textract/pricing/)

- [awesome-cdk repo](https://github.com/eladb/awesome-cdk)

**License**

Opens source under the MIT License.

Built with :heart: by [aeksco](https://twitter.com/aeksco)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/aeksco/aws-pdf-textract-pipeline

Awesome Lists containing this project

README