{"id":20271743,"url":"https://github.com/rodneyshag/aws_redshift","last_synced_at":"2025-03-04T00:05:32.278Z","repository":{"id":123421515,"uuid":"212214285","full_name":"RodneyShag/AWS_Redshift","owner":"RodneyShag","description":"AWS Redshift tutorial","archived":false,"fork":false,"pushed_at":"2019-12-30T04:59:51.000Z","size":138,"stargazers_count":0,"open_issues_count":0,"forks_count":7,"subscribers_count":3,"default_branch":"master","last_synced_at":"2025-01-14T05:49:47.287Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/RodneyShag.png","metadata":{"files":{"readme":"readme.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-10-01T22:53:39.000Z","updated_at":"2019-12-30T04:59:53.000Z","dependencies_parsed_at":null,"dependency_job_id":"55be71e5-ea99-4cb2-adec-1960d3a38d70","html_url":"https://github.com/RodneyShag/AWS_Redshift","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FAWS_Redshift","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FAWS_Redshift/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FAWS_Redshift/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/RodneyShag%2FAWS_Redshift/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/RodneyShag","download_url":"https://codeload.github.com/RodneyShag/AWS_Redshift/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":241758964,"owners_count":20015251,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T12:39:15.464Z","updated_at":"2025-03-04T00:05:32.255Z","avatar_url":"https://github.com/RodneyShag.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cp align=\"center\"\u003e\n    \u003cimg src=\"images/redshift_logo.png\"\u003e\n\u003c/p\u003e\n\n\nThis repo is a concise summary and _replacement_ of the [Amazon Redshift by Edureka](https://www.youtube.com/watch?v=fc5WPKnbam8) tutorial. Using the hyperlinks below is optional.\n\n\n# [Tutorial by Edureka](https://www.youtube.com/watch?v=fc5WPKnbam8)\n\n[What is a data warehouse?](https://youtu.be/fc5WPKnbam8?t=75) - a repository where data generated from your organization's operational systems is collected, transformed, and stored.\n\n[What is Redshift?](https://youtu.be/fc5WPKnbam8?t=349) - a parallel, column-oriented database for analyzing data in your data warehouse.\n\n[What are clusters and nodes?](https://youtu.be/fc5WPKnbam8?t=381) - _nodes_ are a collection of compute resources. A group of nodes is called a _cluster_. Each cluster runs a Redshift engine, and it contains 1+ databases.\n\n![Clusters and Nodes](./images/clustersAndNodes.png)\n\n[What is a leader node?](https://youtu.be/fc5WPKnbam8?t=397) - receives queries from client applications. It coordinates the parallel execution of a query using 1+ compute nodes. The leader node also aggregates the results from the compute nodes and sends it back to the client appliation.\n\n[What is a compute node?](https://youtu.be/fc5WPKnbam8?t=422) - compute resources that are used to process queries that the _leader node_ sends it. Compute nodes can transfer data amongst themselves to solve queries.\n\n[What are node slices?](https://youtu.be/fc5WPKnbam8?t=435) - _compute nodes_ are divided into _node slices_. Each slice receives memory and disk space. Slices work in parallel to perform operations.\n\n[How do clients communicate with Redshift?](https://youtu.be/fc5WPKnbam8?t=466) - client applications  communicate with the _Leader Node_ using either:\n\n1. JDBC (Java Database Connectivity Driver) - an API for Java, or\n1. ODBC (Other Database Connectivity Driver) - uses SQL to interact with Leader Node\n\n[What are some settings you select during Redshift cluster creation?](https://youtu.be/fc5WPKnbam8?t=598) - type of node you want, number of nodes, the VPC where you want to create your data warehouse, etc.\n\n[How does Redshift autoscale up and down?](https://youtu.be/fc5WPKnbam8?t=657) - it just increases or lowers the number of compute nodes.\n\n[Is Redshift row or column storage?](https://youtu.be/fc5WPKnbam8?t=730) - column storage.\n\n[What are some benefits of column storage?](https://youtu.be/fc5WPKnbam8?t=817) Since the data is stored next to each other, we get:\n1. Improved Data compression\n1. Faster queries and updates on columns\n\n[What is a Data Lake?](https://youtu.be/fc5WPKnbam8?t=962) - a storage repository that holds a vast amount of raw data in it's native format until it's needed (CSV, Parquet, TSV, RCFile, etc.)\n\n[What is ETL?](https://youtu.be/fc5WPKnbam8?t=977) - A sample ETL is when you _Extract, Transform, Load_ data from a data lake into Redshift. This is time-consuming, compute-intensive, and costly (due to the need of growing your clusters in Redshift).\n\n[What is _Amazon Redshift Spectrum_?](https://youtu.be/fc5WPKnbam8?t=1012) - Instead of doing ETL jobs, you can use _Amazon RedShift Spectrum_ to directly query data in s3 or a data lake, without unnecessary data movement.\n\n[How is data backed up in Redshift?](https://youtu.be/fc5WPKnbam8?t=1034) - Redshift offers backup and recovery. As soon as data is stored in Redshift, a copy of that data is sent to s3 through a secure connection. If you lose your data, you can restore it easily using s3.\n\n[Does Redshift provide encryption?](https://youtu.be/fc5WPKnbam8?t=1059) - yes.\n\n[What software do you need to install to use Redshift?](https://youtu.be/fc5WPKnbam8?t=1097)\n\n1. SQL Workbench - this is where you do queries\n1. JDBC Driver - enables the client application to communicate with Redshift\n1. Also, Java Runtime should be enabled on your OS\n\n[How create a Redshift cluster in AWS Console?](https://youtu.be/fc5WPKnbam8?t=1230) - in the AWS Console, go to _Amazon Redshift_ and click _Quick launch cluster_ (fastest way) or _Launch cluster_ (more versatile way)\n\n[When creating a Redshift cluster, what is the default port?](https://youtu.be/fc5WPKnbam8?t=1348) - they set it to `5439` for us. We use this port number later.\n\n[How is a database created in Redshift?](https://youtu.be/fc5WPKnbam8?t=1459) - when we used the \"Quick launch cluster\" option, a default database called `dev` is created for us.\n\n[What URL does SQL workbench need to connect to the cluster](https://youtu.be/fc5WPKnbam8?t=1571) - either a JDBC URL or ODBC URL. Both can be found in AWS Console's Redshift dashboard, under _Clusters_.\n\n[How create tables in Redshift?](https://youtu.be/fc5WPKnbam8?t=1669) - can use a sample database from AWS documentation (which is a bunch of `create table` commands to copy/paste)\n\n[How copy data for a database from s3 into Redshift?](https://youtu.be/fc5WPKnbam8?t=1742)\n\nYou can do something like:\n\n```sql\ncopy users from 's3://awssamplebuswest2/tickit/allusers_pipe.txt'\ncredentials 'aws_iam_role=\u003ciam-role-arn\u003e'\ndelimiter '|' region 'us-west-2';\n```\n- This copies to the table `users` from a given path.\n- You also have to provide an IAM role as credentials\n- The `delimiter` is what separates the fields of the columns\n- `region` is where your s3 bucket is located\n\n[How perform queries on this new data?](https://youtu.be/fc5WPKnbam8?t=1878) - use SQL workbench.\n\n[How see previous queries you've performed?](https://youtu.be/fc5WPKnbam8?t=1981) - In AWS Console, go to your Redshift cluster and click the _Queries_ tab.\n\n\n# More Notes\n\n__How much data should you have for Redshift to be a good idea?__ - 100 GB or more. Otherwise, use MySQL.\n\n__Is Redshift good for BLOB data?__ - No. Use s3 instead.\n\n__How is backup/restore achieved?__ The _compute nodes_ asynchronously backup to s3 (currently this is every 5 GB of changed data, or 8 hours)\n\n__What is the primary goal when distributing data across nodes?__ - to distribute the data evenly. This maximizes parallel processing.\n\n__What are the distribution styles: DISTKEY, ALL, EVEN?__\n\n1. DISTKEY - hashes a key to distribute data.\n1. ALL - puts a copy of the table on every node.\n1. EVEN - splits a table evenly across 2+ nodes\n\n__What's the best distribution style (DISTKEY, ALL, EVEN) for small tables?__ - ALL. For scenarios where a big table is being joined with a small table, it's beneficial that the small table exists on the same node, so no data transfer happens between nodes.\n\n__If JOINs are being performed often on a column (that has distinct values), how should you distribute data?__ - DISTKEY will evenly distribute the data across nodes. In addition, since JOINs are being performed often on a certain column, using that column as the key will spread out that column's data evenly across nodes.\n\n__What if you have a huge table, and Column1 and Column2 are each accessed 50% of the time? How can you do distribution?__ - Create 2 of this table, one with a DISTKEY for Column1, and one with a DISTKEY for Column2.\n\n__When do you use the EVEN distribution style?__ - when you have no insights into the data, or which columns are accessed frequently, you can just distribute the data evenly across the nodes.\n\n__Does Redshift enforce primary key or foreign keys?__ - No, but defining them can speed up queries.\n\n__What's a good way to copy data into redshift, using files?__ - Instead of using 1 big file, split the data into multiple files and use 1 COPY command to copy data from multiple files. Each slice in a node will process 1 file, so if you have 960 slices, use 960 files.\n\n__What is _vacuuming_?__ - When you perform a delete, rows are marked for deletion, but not removed. Redshift automatically runs a [VACUUM DELETE](https://docs.aws.amazon.com/en_pv/redshift/latest/dg/t_Reclaiming_storage_space202.html) operation (during periods of reduced load) to actually delete this rows.\n\n__Instead of deleting old data, what's a better idea?__ - Deleting data causes _vacuuming_, which is slow (could take hours). Instead, split the data into tables by time (such as month). You can delete old data by simply using `DROP TABLE` on the table.\n\n__What are subqueries? When are they okay to use?__ - A subquery is a query within a query. Use subqueries in cases where one table in the query is used only for predicate conditions. A subquery in a query results in nested loops, so only use it when a subquery returns a small number of rows (like less than 200).\n\n__What are blocks?__ - Column data is persisted to 1 MB immutable blocks. When factoring in compression, there can be 1 million values in a single block.\n\n__What is a Redshift Sort Key?__ - [Redshift Sort Key determines the order in which rows in a table are stored. Query performance is improved when Sort keys are properly used as it enables query optimizer to read fewer chunks of data filtering out the majority of it.](https://hevodata.com/blog/redshift-sort-keys-choosing-best-sort-style/)\n\n# References\n\n### References - Used in this repo\n\n- Youtube: [Amazon Redshift Tutorial | AWS Tutorial for Beginners | AWS Certification Training | Edureka](https://www.youtube.com/watch?v=fc5WPKnbam8) - good beginner tutorial.\n\n### References - Deprecated\n\n- YouTube: [Deep Dive on Amazon Redshift - AWS Online Tech Talks](https://www.youtube.com/watch?v=Hur-p3kGDTA) - advanced, high-level tutorial for users who already know Redshift. Mediocre.\n- A Cloud Guru: [Hands on with AWS Redshift: Table Design](https://learn.acloud.guru/course/aws-redshift-table-design/dashboard) - assumes AWS Knowledge (Security Groups, VPC, etc.). Everything starting from \"Load data and run sql queries\" was not well explained. Very confusing tutorial.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodneyshag%2Faws_redshift","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Frodneyshag%2Faws_redshift","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Frodneyshag%2Faws_redshift/lists"}