{"id":23120532,"url":"https://github.com/tupizz/data-processing-pipeline-aws","last_synced_at":"2026-05-13T07:02:08.445Z","repository":{"id":267616333,"uuid":"898429567","full_name":"tupizz/data-processing-pipeline-aws","owner":"tupizz","description":"This project is a serverless application built with the Serverless Framework, TypeScript, and AWS services. It provides an enrichment service that processes contact information and enriches it with additional data.","archived":false,"fork":false,"pushed_at":"2024-12-11T16:40:12.000Z","size":6624,"stargazers_count":1,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-04T03:15:38.562Z","etag":null,"topics":["aws","data","pipeline","serverless","typescript"],"latest_commit_sha":null,"homepage":"","language":"TypeScript","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/tupizz.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-04T11:31:51.000Z","updated_at":"2024-12-11T16:40:16.000Z","dependencies_parsed_at":"2024-12-11T12:30:43.004Z","dependency_job_id":null,"html_url":"https://github.com/tupizz/data-processing-pipeline-aws","commit_stats":null,"previous_names":["tupizz/data-processing-pipeline-aws"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/tupizz/data-processing-pipeline-aws","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupizz%2Fdata-processing-pipeline-aws","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupizz%2Fdata-processing-pipeline-aws/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupizz%2Fdata-processing-pipeline-aws/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupizz%2Fdata-processing-pipeline-aws/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/tupizz","download_url":"https://codeload.github.com/tupizz/data-processing-pipeline-aws/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/tupizz%2Fdata-processing-pipeline-aws/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32971672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-13T06:31:55.726Z","status":"ssl_error","status_checked_at":"2026-05-13T06:31:51.336Z","response_time":115,"last_error":"SSL_read: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","data","pipeline","serverless","typescript"],"created_at":"2024-12-17T06:11:30.535Z","updated_at":"2026-05-13T07:02:08.431Z","avatar_url":"https://github.com/tupizz.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Project Documentation\n\n## Overview\n\nThis project is a serverless application built with the Serverless Framework, TypeScript, and AWS services. It provides an enrichment service that processes contact information and enriches it with additional data.\n\n## Showcase\n\nClick on the image below to watch the showcase video or here [Link](https://cln.sh/bhWByGSF)\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://cln.sh/bhWByGSF\"\u003e\n    \u003cimg src=\"./docs/showcase.gif\" alt=\"Watch the video\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n## Deploying \u0026 Running\n\n```bash\nnpx sls deploy --aws-profile personal\nnpx ts-node src/scripts/createJson.ts 50 # create 50k contacts\nnpx ts-node src/scripts/createEnrichement.ts contacts_50k.json # create enrichment request\nnpx ts-node src/scripts/createEnrichement.ts contacts_100k.json true # create enrichment request and push to S3, for large datasets\n```\n\n#### Example\n\ncommand:\n\n```bash\nnpx ts-node src/scripts/createEnrichement.ts contacts_100k.json true\n```\n\noutput:\n\n```json\n{\n  \"message\": \"Request accepted\",\n  \"requestId\": \"4b92560e-5c60-4d90-9bdd-195f39f8a91d\",\n  \"downloadUrl\": \"https://storage-primer.s3.amazonaws.com/4b92560e-5c60-4d90-9bdd-195f39f8a91d/output.json\"\n}\n```\n\ncommand:\n\n```bash\n./bin/getEnrichment.sh 4b92560e-5c60-4d90-9bdd-195f39f8a91d\n```\n\noutput:\n\n```json\n{\n  \"requestId\": \"4b92560e-5c60-4d90-9bdd-195f39f8a91d\",\n  \"status\": \"processing\",\n  \"createdAt\": \"2024-12-11T16:36:13.453Z\",\n  \"totalBatches\": 1001,\n  \"processedBatches\": 162,\n  \"outputFileKey\": \"https://storage-primer.s3.amazonaws.com/4b92560e-5c60-4d90-9bdd-195f39f8a91d/output.json\"\n}\n```\n\n## Datadog\n\n- Dashboard: [Link](https://p.datadoghq.com/sb/836b9d5c-b1bf-11ef-a55b-0ee733f937a2-ff8b75cf46559dca2d25a0e8de156a49?refresh_mode=sliding\u0026from_ts=1733306292479\u0026to_ts=1733320692479\u0026live=true)\n- Logs: [Link](https://app.datadoghq.com/logs?saved-view-id=3174992)\n- APM: [Link](https://app.datadoghq.com/apm/entity/service%3Aprimer-integration-pipeline?dependencyMap=qson%3A%28data%3A%28telemetrySelection%3Aall_sources%29%2Cversion%3A%210%29\u0026deployments=qson%3A%28data%3A%28hits%3A%28selected%3Aversion_count%29%2Cerrors%3A%28selected%3Aversion_count%29%2Clatency%3A%28selected%3Ap95%29%2CtopN%3A%215%29%2Cversion%3A%210%29\u0026env=dev\u0026errors=qson%3A%28data%3A%28issueSort%3AFIRST_SEEN%29%2Cversion%3A%210%29\u0026fromUser=false\u0026groupMapByOperation=null\u0026infrastructure=qson%3A%28data%3A%28viewType%3Apods%29%2Cversion%3A%210%29\u0026isInferred=false\u0026logs=qson%3A%28data%3A%28indexes%3A%5B%5D%29%2Cversion%3A%210%29\u0026operationName=aws.lambda\u0026panels=qson%3A%28data%3A%28%29%2Cversion%3A%210%29\u0026resources=qson%3A%28data%3A%28visible%3A%21t%2Chits%3A%28selected%3Atotal%29%2Cerrors%3A%28selected%3Atotal%29%2Clatency%3A%28selected%3Ap95%29%2CtopN%3A%215%29%2Cversion%3A%211%29\u0026summary=qson%3A%28data%3A%28visible%3A%21t%2Cchanges%3A%28%29%2Cerrors%3A%28selected%3Acount%29%2Chits%3A%28selected%3Acount%29%2Clatency%3A%28selected%3Alatency%2Cslot%3A%28agg%3A95%29%2Cdistribution%3A%28isLogScale%3A%21f%29%2CshowTraceOutliers%3A%21t%29%2Csublayer%3A%28slot%3A%28layers%3Aservice%29%2Cselected%3Apercentage%29%2ClagMetrics%3A%28selectedMetric%3A%21s%2CselectedGroupBy%3A%21s%29%29%2Cversion%3A%211%29\u0026traces=qson%3A%28data%3A%28%29%2Cversion%3A%210%29\u0026start=1733314516685\u0026end=1733318116685\u0026paused=false#resources)\n\n## Architecture\n\n### E2E Architecture\n\n- [Link](https://link.excalidraw.com/readonly/L0PvBWorR4GE36O1TEoF)\n\n\u003cp align=\"center\"\u003e\n  \u003ca href=\"https://link.excalidraw.com/readonly/L0PvBWorR4GE36O1TEoF\"\u003e\n    \u003cimg src=\"./docs/e2e_diagram.png\" alt=\"E2E Architecture\"\u003e\n  \u003c/a\u003e\n\u003c/p\u003e\n\n### Lambda Architecture\n\nThe application is designed using a microservices architecture, leveraging AWS Lambda, S3, SQS, and DynamoDB. Below is a high-level architecture diagram:\n\n```mermaid\ngraph TD;\nA[API Gateway] --\u003e B[Lambda Function: Create Enrichment]\nB --\u003e C[S3 Bucket]\nB --\u003e D[SQS Queue]\nD --\u003e E[Lambda Function: Process Enrichment]\nE --\u003e F[DynamoDB: Status Table]\nF --\u003e G[Lambda Function: Get Enrichment Status]\nG --\u003e A\n```\n\n### 1. Create Enrichment\n\n```mermaid\nsequenceDiagram\n    participant Client\n    participant API Gateway\n    participant CreateHandler\n    participant S3\n    participant SQS\n    participant DynamoDB\n\n    Client-\u003e\u003eAPI Gateway: POST /enrichment\n    API Gateway-\u003e\u003eCreateHandler: Trigger\n    CreateHandler-\u003e\u003eCreateHandler: Validate Input\n    CreateHandler-\u003e\u003eS3: Upload input.json\n    CreateHandler-\u003e\u003eS3: Create empty output.json\n    CreateHandler-\u003e\u003eDynamoDB: Save request status\n    CreateHandler-\u003e\u003eSQS: Send batched messages\n    CreateHandler-\u003e\u003eAPI Gateway: Return requestId\n    API Gateway-\u003e\u003eClient: 200 OK + requestId\n```\n\n### 2. Process Enrichment\n\n```mermaid\nsequenceDiagram\n    participant SQS\n    participant Process Lambda\n    participant Mock API\n    participant S3\n    participant DynamoDB\n\n    SQS-\u003e\u003eProcess Lambda: Trigger (batch=1)\n    Process Lambda-\u003e\u003eMock API: Enrich contacts\n    Mock API--\u003e\u003eProcess Lambda: Return enriched data\n    Process Lambda-\u003e\u003eS3: Get current output.json\n    Process Lambda-\u003e\u003eS3: Update output.json\n    Process Lambda-\u003e\u003eDynamoDB: Increment processed count\n    Process Lambda-\u003e\u003eDynamoDB: Update status if complete\n```\n\n### 3. Get Enrichment Status\n\n```mermaid\nsequenceDiagram\n    participant Client\n    participant API Gateway\n    participant Get Lambda\n    participant DynamoDB\n\n    Client-\u003e\u003eAPI Gateway: GET /enrichment/{id}\n    API Gateway-\u003e\u003eGet Lambda: Trigger\n    Get Lambda-\u003e\u003eDynamoDB: Query status\n    DynamoDB--\u003e\u003eGet Lambda: Return status\n    Get Lambda--\u003e\u003eAPI Gateway: Return status\n    API Gateway--\u003e\u003eClient: 200 OK + status\n```\n\n## Key Components\n\n### Handlers\n\n- **Create Handler**: Handles incoming requests to create enrichment tasks. It validates input, stores data in S3, and sends messages to SQS for processing. Fan-out pattern.\n\n- **Process Handler**: Processes messages from SQS, enriches contact data using a mock API, and updates the status in DynamoDB. Worker pattern.\n\n- **Get Handler**: Retrieves the status of enrichment requests from DynamoDB.\n\n### Infrastructure\n\n- **S3 Adapter**: Manages interactions with AWS S3 for storing and retrieving objects.\n\n- **SQS Adapter**: Handles sending messages to AWS SQS.\n\n- **Status Repository**: Interacts with DynamoDB to manage the status of enrichment requests.\n\n## Scripts\n\n- **createEnrichment.sh**: Shell script to send a POST request to the enrichment API.\n- **getEnrichment.sh**: Shell script to retrieve the status of an enrichment request.\n\n## Configuration\n\n- **tsconfig.json**: TypeScript configuration file.\n- **.nvmrc**: Node version manager configuration file.\n\n## Improvements\n\n1. **S3 Direct Upload**\n\n   - Instead of receiving huge JSON payloads, we could accept files directly in S3\n   - Benefits: Improved latency and reduced costs\n\n2. **High Volume Processing**\n\n   - For large datasets (\u003e100k contacts), implement more efficient processing\n   - Solution: Use SQS for initial request and fan-out pattern for processing\n   - Architecture: Post request to SQS → Process in separate lambda → Fan-out results to processing lambda\n\n3. **Request Deduplication**\n   - Implement SHA-256 signatures for request deduplication\n   - Purpose: Prevent processing duplicate requests\n   - Implementation:\n     ```typescript\n     export const getJsonHash = (obj: object): string =\u003e {\n       const canonicalJSON = JSON.stringify(obj, Object.keys(obj).sort());\n       return crypto\n         .createHash(\"sha256\")\n         .update(canonicalJSON, \"utf8\")\n         .digest(\"hex\");\n     };\n     ```\n   - Process:\n     - Create signature from input using SHA-256\n     - Store signature in request metadata\n     - Verify signature before processing\n     - Return cached result if duplicate detected\n   - Why SHA-256? Secure hash function with strong collision resistance, widely used for digital signatures\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftupizz%2Fdata-processing-pipeline-aws","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ftupizz%2Fdata-processing-pipeline-aws","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ftupizz%2Fdata-processing-pipeline-aws/lists"}