{"id":13406347,"url":"https://github.com/aeksco/aws-pdf-textract-pipeline","last_synced_at":"2025-10-06T21:52:25.918Z","repository":{"id":43803538,"uuid":"242643811","full_name":"aeksco/aws-pdf-textract-pipeline","owner":"aeksco","description":":mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using AWS textract. Built with AWS CDK + TypeScript ","archived":false,"fork":false,"pushed_at":"2024-06-05T13:57:47.000Z","size":1738,"stargazers_count":166,"open_issues_count":5,"forks_count":18,"subscribers_count":2,"default_branch":"main","last_synced_at":"2025-07-10T23:12:57.320Z","etag":null,"topics":["aws","aws-cdk","aws-textract","cdk","cloudformation","data-pipeline","dynamodb","jest","lambda","pdf","puppeteer","s3","serverless","sns","textract","typescript","webscraping"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aeksco.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null},"funding":{"github":"aeksco"}},"created_at":"2020-02-24T04:08:57.000Z","updated_at":"2025-04-07T23:16:18.000Z","dependencies_parsed_at":"2024-01-08T19:33:39.331Z","dependency_job_id":"6245c002-89ad-41b7-9874-a1ce0dd69036","html_url":"https://github.com/aeksco/aws-pdf-textract-pipeline","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/aeksco/aws-pdf-textract-pipeline","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aeksco%2Faws-pdf-textract-pipeline","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aeksco%2Faws-pdf-textract-pipeline/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aeksco%2Faws-pdf-textract-pipeline/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aeksco%2Faws-pdf-textract-pipeline/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aeksco","download_url":"https://codeload.github.com/aeksco/aws-pdf-textract-pipeline/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aeksco%2Faws-pdf-textract-pipeline/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":278686638,"owners_count":26028325,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-06T02:00:05.630Z","response_time":65,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-cdk","aws-textract","cdk","cloudformation","data-pipeline","dynamodb","jest","lambda","pdf","puppeteer","s3","serverless","sns","textract","typescript","webscraping"],"created_at":"2024-07-30T19:02:27.807Z","updated_at":"2025-10-06T21:52:25.901Z","avatar_url":"https://github.com/aeksco.png","language":"TypeScript","funding_links":["https://github.com/sponsors/aeksco"],"categories":["TypeScript","HarmonyOS","Construct Libraries"],"sub_categories":["Windows Manager","Workflows"],"readme":"# aws-pdf-textract-pipeline [![Mentioned in Awesome CDK](https://awesome.re/mentioned-badge.svg)](https://github.com/kolomied/awesome-cdk)\n\n:mag: Data pipeline for crawling PDFs from the Web and transforming their contents into structured data using [AWS Textract](https://aws.amazon.com/textract/). Built with AWS CDK + TypeScript.\n\nThis is an example data pipeline that illustrates one possible approach for large-scale serverless PDF processing - it should serve as a good foundation to modify for your own purposes.\n\n![Example Extension Popup](https://i.imgur.com/3F89JQK.png \"Example Extension Popup\")\n\n\u003c!-- https://cloudcraft.co/view/e135397e-a673-411e-9ee7-05a5618052b2?key=R-OLiwplnkA9dtQxtkVqOw\u0026interactive=true\u0026embed=true --\u003e\n\n**Getting Started**\n\nRun the following commands to install dependencies, build the CDK stack, and deploy the CDK Stack to AWS.\n\n```\nyarn install\nyarn build\ncdk bootstrap\ncdk deploy\n```\n\n### Overview\n\nThe following is an overview of each process performed by this CDK stack.\n\n1. **Scrape PDF download URLs from a website**\n\n   Scraping data from the [COGCC](https://cogcc.state.co.us/) website.\n\n2. **Store PDF download URL in DynamoDB**\n\n   ![Example Extension Popup](https://i.imgur.com/bmFJGDW.png \"Example Extension Popup\")\n\n3. **Download the PDF to S3**\n\n   A lambda fires off when a new PDF download URL has been created in DynamoDB.\n\n4. **Process the PDF with AWS Textract**\n\n   Another lambda fires off when a PDF has been downloaded to the S3 bucket.\n\n5. **Process the AWS Textract results**\n\n   When an SNS event is detected from AWS Textract, a lambda is fired off to process the result.\n\n6. **Save the processed Textract result to DynamoDB.**\n\n   After the full result is pruned down the the desired datastructure, we save the data in DynamoDB.\n   ![Example Extension Popup](https://i.imgur.com/HkTtLmi.png \"Example Extension Popup\")\n\n### Scripts\n\n- `yarn install` - installs dependencies\n- `yarn build` - builds the production-ready CDK Stack\n- `yarn test` - runs Jest\n- `cdk bootstrap` - bootstraps AWS Cloudformation for your CDK deploy\n- `cdk deploy` - deploys the CDK stack to AWS\n\n**Notes**\n\n- **Warning** - the `AnalyzeDocument` process from AWS Textract costs \\$50 per 1,000 PDF pages. Be careful when deploying this CDK stack as you could unintentionally rack up an expensive AWS bill quickly if you're not paying attention.\n\n- If a PDF download URL has already been added to the `pdfUrlsTable` DynamoDB table, the pipeline will not re-execute for the PDF.\n\n- Includes tests with Jest.\n\n- Recommended to use `Visual Studio Code` with the `Format on Save` setting turned on.\n\n**Built with**\n\n- [TypeScript](https://www.typescriptlang.org/)\n- [Jest](https://jestjs.io)\n- [Puppeteer](https://jestjs.io)\n- [AWS CDK](https://aws.amazon.com/cdk/)\n- [AWS Lambda](https://aws.amazon.com/lambda/)\n- [AWS SNS](https://aws.amazon.com/sns/)\n- [AWS DynamoDB](https://aws.amazon.com/dynamodb/)\n- [AWS S3](https://aws.amazon.com/s3/)\n\n**Additional Resources**\n\n- [CDK API Reference](https://docs.aws.amazon.com/cdk/api/latest/docs/aws-construct-library.html)\n- [Puppeteer](https://github.com/puppeteer/puppeteer)\n- [Puppeteer Lambda](https://github.com/alixaxel/chrome-aws-lambda)\n- [CDK TypeScript Reference](https://docs.aws.amazon.com/cdk/api/latest/typescript/api/index.html)\n- [CDK Assertion Package](https://github.com/aws/aws-cdk/tree/master/packages/%40aws-cdk/assert)\n- [Textract Pricing Chart](https://aws.amazon.com/textract/pricing/)\n- [awesome-cdk repo](https://github.com/eladb/awesome-cdk)\n\n**License**\n\nOpens source under the MIT License.\n\nBuilt with :heart: by [aeksco](https://twitter.com/aeksco)\n\n\u003c!-- Reddit Threads --\u003e\n\u003c!-- https://www.reddit.com/r/aws/comments/fbwtr2/example_serverless_data_pipeline_for_crawling/ --\u003e\n\u003c!-- https://www.reddit.com/r/serverless/comments/fbwsak/serverless_data_pipeline_for_crawling_pdfs_from/ --\u003e\n\u003c!-- https://www.reddit.com/r/typescript/comments/fcy30x/example_serverless_data_pipeline_for_crawling/ --\u003e\n\u003c!-- https://www.reddit.com/r/webdev/comments/fd65r2/example_serverless_data_pipeline_for_crawling/ --\u003e\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faeksco%2Faws-pdf-textract-pipeline","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faeksco%2Faws-pdf-textract-pipeline","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faeksco%2Faws-pdf-textract-pipeline/lists"}