{"id":22428507,"url":"https://github.com/kwame-mintah/aws-lambda-data-preprocessing","last_synced_at":"2026-05-05T02:36:27.245Z","repository":{"id":210557260,"uuid":"726258301","full_name":"kwame-mintah/aws-lambda-data-preprocessing","owner":"kwame-mintah","description":"A lambda function to perform data preprocessing on new data placed into an AWS S3 Bucket.","archived":false,"fork":false,"pushed_at":"2025-04-27T17:57:32.000Z","size":77,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-05-29T12:09:39.414Z","etag":null,"topics":["aws","aws-lambda","data-preprocessing","python","python312"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kwame-mintah.png","metadata":{"files":{"readme":"README.md","changelog":"CHANGELOG.md","contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2023-12-01T22:05:43.000Z","updated_at":"2025-04-27T17:57:35.000Z","dependencies_parsed_at":"2024-04-02T02:27:15.863Z","dependency_job_id":"39e1570e-874e-4a3f-9201-cd389512938b","html_url":"https://github.com/kwame-mintah/aws-lambda-data-preprocessing","commit_stats":null,"previous_names":["kwame-mintah/aws-lambda-data-preprocessing"],"tags_count":17,"template":false,"template_full_name":null,"purl":"pkg:github/kwame-mintah/aws-lambda-data-preprocessing","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwame-mintah%2Faws-lambda-data-preprocessing","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwame-mintah%2Faws-lambda-data-preprocessing/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwame-mintah%2Faws-lambda-data-preprocessing/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwame-mintah%2Faws-lambda-data-preprocessing/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kwame-mintah","download_url":"https://codeload.github.com/kwame-mintah/aws-lambda-data-preprocessing/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwame-mintah%2Faws-lambda-data-preprocessing/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32633434,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-04T10:08:07.713Z","status":"online","status_checked_at":"2026-05-05T02:00:06.033Z","response_time":54,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","aws-lambda","data-preprocessing","python","python312"],"created_at":"2024-12-05T20:15:02.444Z","updated_at":"2026-05-05T02:36:27.232Z","avatar_url":"https://github.com/kwame-mintah.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AWS Lambda Data Preprocessing\n\n[![Python 3.12](https://img.shields.io/badge/python-3.12-blue.svg)](https://www.python.org/downloads/release/python-3121/)\n[![🚧 Bump version](https://github.com/kwame-mintah/aws-lambda-data-preprocessing/actions/workflows/bump-repository-version.yml/badge.svg)](https://github.com/kwame-mintah/aws-lambda-data-preprocessing/actions/workflows/bump-repository-version.yml)\n[![🚀 Push Docker image to AWS ECR](https://github.com/kwame-mintah/aws-lambda-data-preprocessing/actions/workflows/push-docker-image-to-aws-ecr.yml/badge.svg)](https://github.com/kwame-mintah/aws-lambda-data-preprocessing/actions/workflows/push-docker-image-to-aws-ecr.yml)\n[![🧹 Run linter](https://github.com/kwame-mintah/aws-lambda-data-preprocessing/actions/workflows/run-linter.yml/badge.svg)](https://github.com/kwame-mintah/aws-lambda-data-preprocessing/actions/workflows/run-linter.yml)\n\u003ca href=\"https://github.com/psf/black\"\u003e\u003cimg alt=\"Code style: black\" src=\"https://img.shields.io/badge/code%20style-black-000000.svg\"\u003e\u003c/a\u003e\n\nA lambda to perform data pre-processing on new data put into an S3 bucket. An assumption has been made that new data\nuploaded will be of the same format e.g. same features, data schema etc. Actions performed not limited to removing missing\ndata, imputing numerical values and/or categorical values etc.\n\nThis repository does not create the S3 Bucket, this is created via Terraform found here [terraform-aws-machine-learning-pipeline](https://github.com/kwame-mintah/terraform-aws-machine-learning-pipeline).\nData uploaded into these buckets can be found here [ml-data-copy-to-aws-s3](https://github.com/kwame-mintah/ml-data-copy-to-aws-s3). Additionally, data preparation is\nspecific to a specific set of data found within the GitHub repository.\n\n# Flowchart\n\nThe [diagram below](https://mermaid.js.org/syntax/flowchart.html#flowcharts-basic-syntax) demonstrates what happens when the lambda is trigger, when a new `.csv` object has been uploaded to the S3 Bucket.\n\n```mermaid\ngraph LR\n  S0(Start)\n  T1(Pull dataset from S3 Bucket)\n  T2(Dataset transformed using Pandas)\n  T3(Upload transformed data to output bucket)\n  T4(Tag original dataset as processed)\n  E0(End)\n\n  S0--\u003eT1\n  T1--\u003eT2\n  T2--\u003eT3\n  T3--\u003eT4\n  T4--\u003eE0\n```\n\n# Notice\n\nAs mentioned in the project description, the code provided here is to be used for a specific dataset which is predicting\nif potential customers will engage with offers provided to them. Using various information, such as demographic, past\ninteractions, and environmental factors and only target a specific set of customers with an offer.\n\nData pre-processing if done incorrectly is one of the biggest risk in every machine learning project. Please ensure that\nenough time is taken and the data received is properly understood before attempting to automate the process.\n\n## Development\n\n### Dependencies\n\n- [Python](https://www.python.org/downloads/release/python-3120/)\n- [Docker for Desktop](https://www.docker.com/products/docker-desktop/)\n- [Amazon Web Services](https://aws.amazon.com/?nc2=h_lg)\n\n## Usage\n\n1. Build the docker image locally:\n\n   ```shell\n   docker build --no-cache -t data-preprocessing:local .\n   ```\n\n2. Run the docker image built:\n\n   ```shell\n   docker run --platform linux/amd64 -p 9000:8080 data-preprocessing:local\n   ```\n\n3. Send an event to the lambda via curl:\n   ```shell\n   curl \"http://localhost:9000/2015-03-31/functions/function/invocations\" -d '{\u003cEXPAND_BELOW_AND_REPLACE_WITH_JSON_BELOW\u003e}'\n   ```\n   \u003cdetails\u003e\n   \u003csummary\u003eExample AWS S3 event received\u003c/summary\u003e\n   ```json\n   {\n     \"Records\": [\n       {\n         \"eventVersion\": \"2.1\",\n         \"eventSource\": \"aws:s3\",\n         \"awsRegion\": \"eu-west-2\",\n         \"eventTime\": \"2023-12-01T21:48:58.339Z\",\n         \"eventName\": \"ObjectCreated:Put\",\n         \"userIdentity\": { \"principalId\": \"AWS:ABCDEFGHIJKLMNOPKQRST\" },\n         \"requestParameters\": { \"sourceIPAddress\": \"127.0.0.1\" },\n         \"responseElements\": {\n           \"x-amz-request-id\": \"BY65CG6WZD6HBVX2\",\n           \"x-amz-id-2\": \"c2La85nMEE2WBGPHBXDc5a8fd28kEpGt/QsP8n/xmbLv0ZAJeqsK/XmNcCCS+phWuVz8KP3/gn3Ql3/z7RPyC3n176rqpzvZ\"\n         },\n         \"s3\": {\n           \"s3SchemaVersion\": \"1.0\",\n           \"configurationId\": \"huh\",\n           \"bucket\": {\n             \"name\": \"example-bucket-name\",\n             \"ownerIdentity\": { \"principalId\": \"ABCDEFGHIJKLMN\" },\n             \"arn\": \"arn:aws:s3:::example-bucket-name\"\n           },\n           \"object\": {\n             \"key\": \"data/bank-additional-full.csv\",\n             \"size\": 515246,\n             \"eTag\": \"0e29c0d99c654bbe83c42097c97743ed\",\n             \"sequencer\": \"00656A54CA3D69362D\"\n           }\n         }\n       }\n     ]\n   }\n   ```\n   \u003c/details\u003e\n\n## GitHub Action (CI/CD)\n\nThe GitHub Action \"🚀 Push Docker image to AWS ECR\" will check out the repository and push a docker image to the chosen AWS ECR using\n[configure-aws-credentials](https://github.com/aws-actions/configure-aws-credentials/tree/v4.0.1/) action. The following repository secrets need to be set:\n\n| Secret             | Description                  |\n|--------------------|------------------------------|\n| AWS_REGION         | The AWS Region.              |\n| AWS_ACCOUNT_ID     | The AWS account ID.          |\n| AWS_ECR_REPOSITORY | The AWS ECR repository name. |\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkwame-mintah%2Faws-lambda-data-preprocessing","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkwame-mintah%2Faws-lambda-data-preprocessing","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkwame-mintah%2Faws-lambda-data-preprocessing/lists"}