Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/aeksco/aws-pdf-textract-pipeline
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
https://github.com/aeksco/aws-pdf-textract-pipeline
aws aws-cdk aws-textract cdk cloudformation data-pipeline dynamodb jest lambda pdf puppeteer s3 serverless sns textract typescript webscraping
Last synced: 7 days ago
JSON representation
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript
- Host: GitHub
- URL: https://github.com/aeksco/aws-pdf-textract-pipeline
- Owner: aeksco
- License: mit
- Created: 2020-02-24T04:08:57.000Z (over 4 years ago)
- Default Branch: main
- Last Pushed: 2024-06-05T13:57:47.000Z (5 months ago)
- Last Synced: 2024-10-14T06:45:10.979Z (25 days ago)
- Topics: aws, aws-cdk, aws-textract, cdk, cloudformation, data-pipeline, dynamodb, jest, lambda, pdf, puppeteer, s3, serverless, sns, textract, typescript, webscraping
- Language: TypeScript
- Homepage:
- Size: 1.66 MB
- Stars: 163
- Watchers: 3
- Forks: 18
- Open Issues: 5
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
- awesome-cdk - aws-pdf-textract-pipeline - ETL pipeline for crawling PDFs from the Web using Puppeteer and transforming their contents into structured data using AWS Textract and storing the results in DynamoDB. (Construct Libraries / Workflows)
- awesome-cdk - aws-pdf-textract-pipeline - ETL pipeline for crawling PDFs from the Web using Puppeteer and transforming their contents into structured data using AWS Textract and storing the results in DynamoDB. (Construct Libraries / Workflows)
README
# aws-pdf-textract-pipeline [![Mentioned in Awesome CDK](https://awesome.re/mentioned-badge.svg)](https://github.com/kolomied/awesome-cdk)
:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using [AWS Textract](https://aws.amazon.com/textract/). Built with AWS CDK + TypeScript.
This is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.
![Example Extension Popup](https://i.imgur.com/3F89JQK.png "Example Extension Popup")
**Getting Started**
Run the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.
```
yarn install
yarn build
cdk bootstrap
cdk deploy
```### Overview
The following is an overview of each process performed by this CDK stack.
1. **Scrape PDF download URLs from a website**
Scraping data from the [COGCC](https://cogcc.state.co.us/) website.
2. **Store PDF download URL in DynamoDB**
![Example Extension Popup](https://i.imgur.com/bmFJGDW.png "Example Extension Popup")
3. **Download the PDF to S3**
A lambda fires off when a new PDF download URL has been created in DynamoDB.
4. **Process the PDF with AWS Textract**
Another lambda fires off when a PDF has been downloaded to the S3 bucket.
5. **Process the AWS Textract results**
When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.
6. **Save the processed Textract result to DynamoDB.**
After the full result is pruned down the the desired datastructure, we save the data in DynamoDB.
![Example Extension Popup](https://i.imgur.com/HkTtLmi.png "Example Extension Popup")### Scripts
- `yarn install` - installs dependencies
- `yarn build` - builds the production-ready CDK Stack
- `yarn test` - runs Jest
- `cdk bootstrap` - bootstraps AWS Cloudformation for your CDK deploy
- `cdk deploy` - deploys the CDK stack to AWS**Notes**
- **Warning** - the `AnalyzeDocument` process from AWS Textract costs \$50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention.
- If a PDF download URL has already been added to the `pdfUrlsTable` DynamoDB table, the pipeline will not re-execute for the PDF.
- Includes tests with Jest.
- Recommended to use `Visual Studio Code` with the `Format on Save` setting turned on.
**Built with**
- [TypeScript](https://www.typescriptlang.org/)
- [Jest](https://jestjs.io)
- [Puppeteer](https://jestjs.io)
- [AWS CDK](https://aws.amazon.com/cdk/)
- [AWS Lambda](https://aws.amazon.com/lambda/)
- [AWS SNS](https://aws.amazon.com/sns/)
- [AWS DynamoDB](https://aws.amazon.com/dynamodb/)
- [AWS S3](https://aws.amazon.com/s3/)**Additional Resources**
- [CDK API Reference](https://docs.aws.amazon.com/cdk/api/latest/docs/aws-construct-library.html)
- [Puppeteer](https://github.com/puppeteer/puppeteer)
- [Puppeteer Lambda](https://github.com/alixaxel/chrome-aws-lambda)
- [CDK TypeScript Reference](https://docs.aws.amazon.com/cdk/api/latest/typescript/api/index.html)
- [CDK Assertion Package](https://github.com/aws/aws-cdk/tree/master/packages/%40aws-cdk/assert)
- [Textract Pricing Chart](https://aws.amazon.com/textract/pricing/)
- [awesome-cdk repo](https://github.com/eladb/awesome-cdk)**License**
Opens source under the MIT License.
Built with :heart: by [aeksco](https://twitter.com/aeksco)