{"id":35269377,"url":"https://github.com/factset/aws-s3-to-redshift-loader","last_synced_at":"2026-04-02T02:10:08.067Z","repository":{"id":185934709,"uuid":"670664307","full_name":"factset/aws-s3-to-redshift-loader","owner":"factset","description":null,"archived":false,"fork":false,"pushed_at":"2023-11-28T00:25:59.000Z","size":71419,"stargazers_count":0,"open_issues_count":3,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2024-03-20T09:59:10.130Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"HCL","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/factset.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2023-07-25T14:52:26.000Z","updated_at":"2023-08-03T17:59:03.000Z","dependencies_parsed_at":null,"dependency_job_id":"ebf0690c-186a-4ac2-95bb-ea2961a27995","html_url":"https://github.com/factset/aws-s3-to-redshift-loader","commit_stats":null,"previous_names":["factset/aws-s3-to-redshift-loader"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/factset/aws-s3-to-redshift-loader","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/factset%2Faws-s3-to-redshift-loader","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/factset%2Faws-s3-to-redshift-loader/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/factset%2Faws-s3-to-redshift-loader/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/factset%2Faws-s3-to-redshift-loader/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/factset","download_url":"https://codeload.github.com/factset/aws-s3-to-redshift-loader/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/factset%2Faws-s3-to-redshift-loader/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31294416,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-02T01:43:37.129Z","status":"online","status_checked_at":"2026-04-02T02:00:08.535Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2025-12-30T11:53:44.665Z","updated_at":"2026-04-02T02:10:08.023Z","avatar_url":"https://github.com/factset.png","language":"HCL","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Overview\n![s3 to redshift loader](docs/images/s3-to-redshift-loader.png)\n\n\nS3-Redshift-Loader loads data from an AWS S3 bucket the data provider owns to your AWS Redshift instance. To achieve the goal, it will create IAM roles, S3 buckets, Simple Queue Service (SQS) queues, Lambda functions, a RedShift cluster, a Simple Notification Service (SNS) topic, and CloudWatch alarms to automate the workflow end-to-end with a loosely coupled architecture.\n\nThere are four steps to set up the full pipeline. In the beginning, we recommend the users to go through step by step from step 1 to 4. Moving forward, any changes to the Lambda you make will only affect step 3, which means you will need to only deploy the step 3 without touching any other steps. Scroll down to see the details of each step. \n\n# Requirements\n## [Python 3.9](https://www.python.org/downloads/)\nThe runtime for Lambda functions is set to Python 3.9. When you test locally or in a container, using a different version of Python may cause an issue.\n\n## [Terraform \u003e= 0.13](https://www.terraform.io/downloads.html)\n\n## AWS Credentials\nYou must store the AWS credentials in the *credentials* file, usually in the *.aws* directory, or set them as [environment variables](#set-aws-credentials).\n\n## VPC and Subnets\nThis project uses a Virtual Private Cloud (VPC) and subnets. Create a VPC beforehand and **set SQS as an endpoint**. If a VPC is unnecessary, remove all the logic around the VPC.\n\n## `var.tfvars` files\nEach step needs a `var.tfvars` file in its directory to provide the required inputs. Information about the required variables can be found in each step's readme page under **Inputs** section. For more information on how to write `.tfvars` file, please check out the [terraform official documentation](https://registry.terraform.io/providers/terraform-redhat/rhcs/latest/docs/guides/terraform-vars).\n\n# Deployment\n## Step 1. [IAM Role Creator Module](step_1-iam_role_creator/README.md) (Optional)\nThis is an one-time deployment step that creates an IAM role with an appropriate access policy to subscribe data provider’s SNS topic and copy the data from the provider's S3. On successful execution terraform script will output the IAM Role ARN. Add the IAM role ARN to *var.tfvars* in Step 2, 3, and 4. If you already have a role with the proper IAM policy, you can skip running this module and follow the next instruction.\n\n\u003e Before executing the rest of the process, please share the IAM Role ARN with your data provider to grant access to their resources from your account. Once the provider grants access, they should share the S3-alias and SNS Topic ARN. Set them as *data_source_access_point_alias* and *data_source_sns_arn* in your tfvars file and switch the *run_only_iam_creator* value to false to deploy the entire pipeline.\n\n## Step 2. [Redshift Cluster Creator Module](step_2-rs_cluster_creator/README.md) (Optional)\nThis is an one-time deployment step that creates a Redshift cluster, database, username, and password. On successful execution terraform script will output the Redshift DNS name and Redshift security group IDs. Add the output values to *var.tfvars* in Step 3. If you already have a Redshift cluster set up with security group, you can skip running this module.\n\n## Step 3. [Pipeline Builder](step_3-pipeline_builder/README.md)\n### [Data Copier Module](step_3-pipeline_builder/modules/data_copier/README.md)\nThe data copier module listens to the data provider's SNS topic, copies relevant data into your bucket, and triggers *data_transformer* SQS.\n\n#### Notes\n- The module publishes a message to the *data_transformer* SQS only when [data file(txt file)](#data-transformer-module) is delivered.\n- the module assumes that there are specific directories to watch within the provider's S3 bucket. For instance, your provider created a directory called *relevant_data_to_abc* to store data that is relevant to you. Add *relevant_data_to_abc* to a global variable *TARGET_DIRS* in the *data_copier* Lambda script to copy data stored in a specific directory.\n\n### [Data Transformer Module](modules/data_transformer/README.md)\n- Parses data and schema (**both are required**).\n- Converts data column headers to Redshif-friendly names.\n- Sets SQL data types per column based on the schema.\n- Maps each data row with a Redshift table, its final destination in RedShift.\n- Removes existing rows with matching lookup column value for re-statement if Redshift tables exist.\n- Save data per Redshift table as a gzip file in the staging S3 bucket.\n- Publish a message to the *rs_loader* SQS.\n\n#### Notes\n- Two files are expected per data set with supported file types and structures:\n    - Data: *txt* file(.txt) with *pipe*(|) as a delimiter\n    - Schema: *json* file(.json) with the following structure\n        ```json\n        {\n        \"fields\": [\n            {\n            \"name\": \"column_header\",\n            \"type\": \"column_data_type\"\n            },\n        ]\n        }\n        ```\n- If a schema file is not delivered, the data transformer module will re-try to retrieve the schema file a few times and eventually raise an exception.\n- The data transformer module uses a configuration file, *cfg.csv*, for data-table mapping.\n- The module uses a column, *Report Type*, for the date-table mapping in the cfg file.\n- Redshift only allows letter, @, \\_, or \\# as a first character of the column name. If this requirement is not met, `_get_rs_column_name()` adds \"_\" to the front of the name.\n- Redshift column name should be less than 127 bytes. If the name is longer, Redshift will truncate it to 127 bytes.\n- Required data file columns have to be included in `REQUIRED_COLS` in Lambda.\n- The module processes data in chunk to support large data set efficiently. The chunk size can be set as `CHUNK_SIZE` in Lambda.\n- The module uses a column, *Output ID*, to handle re-statement.\n- The module only supports *varchar(max)* and *float* SQL data types to reduce complexity around various data types.\n- Lambda has up to 15-minute timeout restriction and memory restriction which can cause problem when processing large data.\n\n### [Redshift Loader Module](modules/redshift_loader/README.md)\nThe Redshift loader module prepares a Redshift table for data to be copied, including creating a table if it does not exist and adding new columns, copies the data, and deletes the used staging file.\n\n## Step 4. [SNS Subscriber](step_4-sns_subscriber/README.md)\nThis is an one-time deployment step that subscribes the data source's SNS topic. This step has to be run after the data source granted a permission to your IAM role to access their resources.\n\n# Development\n## Set AWS credentials\n```shell\nexport AWS_ACCESS_KEY_ID=[your_aws_access_key_id]\nexport AWS_SECRET_ACCESS_KEY=[your_aws_secret_access_key]\nexport AWS_SESSION_TOKEN=[your_aws_session_token]\n```\n\n## Initialize Terraform\nGo to the step you want to deploy. Then, run the following commands.\n```shell\nterraform init\n```\n\n## Preview Terraform's plan\n```shell\nterraform plan -var-file=\"var.tfvars\" # var file should be created by you\n```\n\n## Apply Terraform's plan\n```shell\nterraform apply -auto-approve -var-file=\"var.tfvars\"\n```\n\n## Destroy Terraform application\n```shell\nterraform destroy -auto-approve -var-file=\"var.tfvars\"\n```\n\n# Copyright\n\nCopyright 2023 FactSet Research Systems Inc\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n    http://www.apache.org/licenses/LICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffactset%2Faws-s3-to-redshift-loader","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Ffactset%2Faws-s3-to-redshift-loader","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Ffactset%2Faws-s3-to-redshift-loader/lists"}