{"id":20909723,"url":"https://github.com/lovenui/dataengineering-capstone-project","last_synced_at":"2025-04-11T07:48:34.390Z","repository":{"id":181168323,"uuid":"644723966","full_name":"LoveNui/DataEngineering-Capstone-Project","owner":"LoveNui","description":null,"archived":false,"fork":false,"pushed_at":"2023-07-15T15:51:50.000Z","size":12242,"stargazers_count":17,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-03-25T05:25:40.022Z","etag":null,"topics":["airflow","aws-redshift","aws-s3","data-engineering","python","spark","sql"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/LoveNui.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-05-24T06:03:05.000Z","updated_at":"2024-07-31T18:15:10.000Z","dependencies_parsed_at":"2024-11-18T14:48:21.188Z","dependency_job_id":null,"html_url":"https://github.com/LoveNui/DataEngineering-Capstone-Project","commit_stats":null,"previous_names":["lovenui/balancer-core","lovenui/dataengineering-capstone-project"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LoveNui%2FDataEngineering-Capstone-Project","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LoveNui%2FDataEngineering-Capstone-Project/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LoveNui%2FDataEngineering-Capstone-Project/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/LoveNui%2FDataEngineering-Capstone-Project/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/LoveNui","download_url":"https://codeload.github.com/LoveNui/DataEngineering-Capstone-Project/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248359609,"owners_count":21090557,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","aws-redshift","aws-s3","data-engineering","python","spark","sql"],"created_at":"2024-11-18T14:12:30.292Z","updated_at":"2025-04-11T07:48:34.352Z","avatar_url":"https://github.com/LoveNui.png","language":"Jupyter Notebook","readme":"\u003cimg align=\"right\" src=\"https://eclectic-thoughts.com/wp-content/uploads/2018/04/Udacity_logo-421x500.png\" width=108\u003e\n\n## Data Engineering Capstone Project for Udacity\n\n### Objective  \n\n---\nIn this project we are going to work with US immigraton data from the \nyear 1994. We have facts such as visa types, transport modes, landing \nports, us state codes, country codes. Apart from the sas7bdat formatted\nimmigration data we have us airport information and us demographics \ndata. We are going to parse SAS descriptor files for all the dimensions \nand sas7bdat files for all the facts. The tools we are going to use here\nare Apache Spark, Apache Airflow, Amazon Redshift, Amazon S3. \n\nWe will be reading, parsing and cleaning the data from local file\nsystems, Amazon S3 and transferring data to redshift tables in AWS. We\nwill be orchestrating the flow of data through Apache Airflow DAGs. \n\nFinally we will be using some SQL queries to extract some valuable stats\nand graphs from the data itself. \n\n### Data Model\n\n---\n![alt text](img/schema.PNG)\n\n### Data Pipeline\n\n___\n![alt text](img/marker.png)\n![alt text](img/pipeline.png)\n![alt_text](img/pipeline-tree.png)\n\n### Installing and starting\n\n---\n\n#### Installing Python Dependencies\nYou need to install this python dependencies\nIn Terminal/CommandPrompt:  \n\nwithout anaconda you can do this:\n```\n$ python3 -m venv virtual-env-name\n$ source virtual-env-name/bin/activate\n$ pip install -r requirements.txt\n```\nwith anaconda you can do this (in Windows):\n```\n$ conda env create -f env.yml\n$ source activate \u003cconda-env-name\u003e\n```\nor (in Others)\n```\nconda create -y -n \u003cconda-env-name\u003e python==3.6\nconda install -f -y -q -n \u003cconda-env-name\u003e -c conda-forge --file requirements.txt\n[source activate/ conda activate] \u003cconda-env-name\u003e\n```\n#### Fixing/Configuring Airflow\n```\n$ pip install --upgrade Flask\n$ pip install zappa\n$ mkdir airflow_home\n$ export AIRFLOW_HOME=./airflow_home\n$ cd airflow_home\n$ airflow initdb\n$ airflow webserver\n$ airflow scheduler\n```\n\n#### More Airflow commands\nTo list existing dags registered with airflow\n```\n$ airflow list_dags\n```\n\n#### Secure/Encrypt your connections and hooks\n**Run**\n```bash\n$ python cryptosetup.py\n```\ncopy this key to *airflow.cfg* to paste after   \nfernet_key = ************\n\n#### Setting up connections and variables in Airflow UI for AWS\nTODO: There is no code to modify in this exercise. We're going to \ncreate a connection and a variable.  \n\n**S3**\n1. Open your browser to localhost:8080 and open Admin-\u003eVariables\n2. Click \"Create\"\n3. Set \"Key\" equal to \"s3_bucket\" and set \"Val\" equal to \"udacity-dend\"\n4. Set \"Key\" equal to \"s3_prefix\" and set \"Val\" equal to \"data-pipelines\"\n5. Click save  \n\n**AWS**\n1. Open Admin-\u003eConnections\n2. Click \"Create\"\n3. Set \"Conn Id\" to \"aws_credentials\", \"Conn Type\" to \"Amazon Web Services\"\n4. Set \"Login\" to your aws_access_key_id and \"Password\" to your aws_secret_key\n5. Click save\n6. If it doesn't work then in \"Extra\" field put:  \n{\"region_name\": \"your_aws_region\", \"aws_access_key_id\":\"your_aws_access_key_id\", \"aws_secret_access_key\": \"your_aws_secret_access_key\", \"aws_iam_user\": \"your_created_iam_user\"} \n7. These are all you can put:\n- aws_account_id: AWS account ID for the connection\n- aws_iam_role: AWS IAM role for the connection\n- external_id: AWS external ID for the connection\n- host: Endpoint URL for the connection\n- region_name: AWS region for the connection\n- role_arn: AWS role ARN for the connection\n\n**Redshift**\n1. Open Admin-\u003eConnections\n2. Click \"Create\"\n3. Set \"Conn Id\" to \"redshift\", \"Conn Type\" to \"postgres\"\n4. Set \"Login\" to your master_username for your cluster and \"Password\" \nto your master_password for your cluster\n5. Click save\n\n#### Optional\nIf you haven't setup your AWS Redshift Cluster yet \n(or don't want to create one manually), then use the files\n inside 'aws' folder \n- To create cluster and IAM role: Run the below code in terminal from 'aws' folder to create your Redshift database and a\n    iam_role in aws having read access to Amazon S3 and permissions \n    attached to the created cluster\n    ```bash\n    $ python aws_operate.py --action start\n    ```\n    copy the DWH_ENDPOINT for \u003ccluster_endpoint_address\u003e and DWH_ROLE_ARN \n    for \u003ciam_role\u003e from the print statements \n- To create Tables: Run the below code in terminal from project dir to create tables in your Redshift database\n    in aws \n    ```bash\n    $ python create_table.py --host \u003ccluster_endpoint_address\u003e\n\n- To Stop: Run the below code in terminal from 'aws' directory to destroy your Redshift database and\n    detach iam_role from the cluster \n    ```bash\n    $ python aws_operate.py --action stop\n    ```\n\n### About the data\n\n---\n#### I94 Immigration Data: \nThis data comes from the US National Tourism and Trade Office. \n[This](https://travel.trade.gov/research/reports/i94/historical/2016.html) \nis where the data comes from. There's a sample file so you can take a look \nat the data in csv format before sreading it all in. The report contains \ninternational visitor arrival statistics by world regions and selected \ncountries (including top 20), type of visa, mode of transportation, \nage groups, states visited (first intended address only), and the top \nports of entry (for select countries)\n\n#### World Temperature Data: \nThis dataset came from Kaggle. You can read more about it [here](https://www.kaggle.com/berkeleyearth/climate-change-earth-surface-temperature-data).\n\n#### U.S. City Demographic Data: \nThis data comes from OpenSoft. You can read more about it [here](https://public.opendatasoft.com/explore/dataset/us-cities-demographics/export/).\n\n#### Airport Code Table: \nThis is a simple table of airport codes and corresponding cities. It comes from [here](https://datahub.io/core/airport-codes#data).\n\n### Run the project\n\n---\n1. Follow all the setup mentioned above\n2. Create a bucket in region 'us-west-2' in Amazon S3\n3. You have to setup all the connections and variables in the Airflow\nadmin  \n    i. Setup aws connection with user credentials (access_key and \n    secret_key with login and password). Make sure the region is 'us-west-2'  \n    ii. Setup Redshift connection with user, password, host, port, \n    schema, db  \n    iii. Setup iam_role for your aws account  \n    iv. Setup variables for 'temp_input', 'temp_output', 'spark_path' (spark\n    manipulation path for parquet files), sas_file (sas7bdat descriptor \n    files)  \n    v. Place all the csv inputs inside temp_output directory   \n    vi. Create a folder called 'spark_path' inside \\airflow\\dags\\    \n    vii. Create variable called 's3_bucket' (make sure the bucket in \n    AWS is in region 'us-west-2')\n  \n    Example:  \n    \n    | variable     | example value |\n    |:-------------|-------------:|\n    | iam_role | #### |\n    | s3_bucket | #### |\n    | sas_file | /home/workspace/airflow/dags/temp_input/I94_SAS_Labels_Descriptions.SAS |\n    | spark_path | /home/workspace/airflow/dags/spark_path |\n    | temp_input | /home/workspace/airflow/dags/temp_input/ |\n    | temp_output | /home/workspace/airflow/dags/temp_output/ |\n\n4. Data Location for input files:  \n    i. Put all your sas7bdat formatted files in temp_input directory \n    (whenever you want to process/insert them into the db, when you are\n    done remove the .sas7bdat file/files and drop new files)  \n    ii. Put SAS descriptor file in temp_input directory  \n    iii. Put airport-codes_csv.csv file in temp_output directory  \n    \n    \n### Test it Yourself!\n\n---\n\nHere are some example queries we test to see the uploaded results into \nthe Redshift schema \n \n**Example Queries**\n#### City from where immigrants arrived\n```\nSELECT TOP 10 b.port_city, b.port_state_or_country, COUNT(cicid) AS count\nFROM project.immigration a INNER JOIN project.i94ports b ON a.i94port=b.port_code\nGROUP BY b.port_city, b.port_state_or_country\nORDER BY COUNT(cicid) DESC\n```\n\n#### Different kinds of airports\n```\nSELECT top 10 distinct type, count(*) AS count_type\nFROM project.airport_codes\nWHERE iso_country = 'US'\nGROUP BY type\nORDER BY count_type DESC\n```\n\n#### Immigrants from different countries\n```\nSELECT top 10 SUBSTRING(b.country_name, 0, 15) as country_name, COUNT(cicid) as count\nFROM project.immigration a INNER JOIN project.i94res b ON a.i94res=b.country_code\nGROUP BY b.country_name\nORDER BY COUNT(cicid) DESC\n```\n\n#### Small airports from different states\n```\nSELECT a.state_name AS State, airports.count AS Count_of_Airports\nFROM\n    (SELECT top 10 distinct substring(iso_region, 4, length(iso_region)) AS state, count(*)\n     FROM project.airport_codes\n     WHERE iso_country = 'US' AND type='small_airport'\n     GROUP BY iso_region) airports INNER JOIN project.i94addr a ON airports.state=a.state_code\nORDER BY airports.count DESC\n```\n\n#### Small airport locations\n```\nSELECT a.longitude_deg, a.latitude_deg \nFROM project.airport_codes a \nWHERE a.iso_country = 'US' AND a.type = 'small_airport'\n```\n### Stats and Graphs\n\n---\n#### City from where immigrants arrived\n![alt text](img/city_intake.png)\n\n#### Different kinds of airports\n![alt_text](img/diff_airports.png)\n\n#### Immigrants from different countries\n![alt text](img/no_of_immigrants.png)\n\n#### Small airports from different states\n![alt_text](img/state_airports.png)\n\n#### Small airports locations in different states\n![alt_text](img/graph.png)\n\nScoping the Project\n---\n\nThe purpose is to produce interesting stats from the US immigration \ndata, airports around the world, and different dimensions such as visa \ntype, transport mode, nationality etc.\n\n### Steps Taken:\nThe steps taken are in the following order:  \n    **Gather the data**:  \n        This took a while as different kinds of formats were chosen, I\n        needed to fix my mindset on which data I will actually use in \n        future for my analysis and queries. I fixated on .sas7bdat \n        formatted immigration data which fulfills the minimum number of \n        rows requirements, the cleaned airport data for dimensions and\n        SAS descriptor file for fulfilling the different kind of formats\n        to be chosen for the project  \n    **Study the data**:  \n        This took a while as I needed to understand what kind of \n        pre-processing I would use to clean the individual datasets \n        mentioned above. Dropping rows on a condition, filtering rows \n        according to other dimensions and facts etc.  \n    **Choice of infrastructure**:  \n        After studying the data I decided upon certain tools and \n        technologies, to the point where I am comfortable; I made use of\n        maximum number of skills that I think I learnt through out the \n        process.  \n    **Implementation and Testing**:   \n        Once my pipeline started running, I did all kinds of quality \n        checks to ensure that data is processed correctly and provided a\n        Jupyter notebook to test the project.  \n       \n### Purpose of Final Data Model:\nGather interesting insights like demographic population based on certain\n dimensions based upon some filter conditions.\n e.g.   \n - Compare immigration of different nationalities\n - Compare number of airports by state\n - Different kinds of airport statistics\n - Aggregate flow of immigrants through different cities\n\nSo I am using the airport codes, US immigration data of '94 and \ndimensions such as visa type, mode of transport, nationality codes, US \nstate code information\n\n\nAddressing other scenarios\n---\n\n### Data Increased by 100x:\n - I am using columnar format of redshift, so querying will not be slower\n - Incremental update is provided so that every time full amount is not \n inserted everytime. Whenever one wants to insert data into the database\n for immigration can just drop their sas7bdat files into the temp_input\n folder \n - Spark is used where heavy data is read and parsed, so distributed \n processing is also involved\n - Spark memory and processors is configurable to handle more pressure\n - S3 storage is used which is scalable and easily accessible with other\n AWS infrastructure\n \n\n### The pipelines would be run on a daily basis by 7 am every day:\n- The pipeline is scheduled as per requirements\n\n### The database needed to be accessed by 100+ people:\n- People are granted usage on schema, so not everyone but people who \nhave access to the data can use it as necessary, below are the \nnecessary commands one you use in Redshift query editor, that's why it\nis purely optional to use it as a task in the pipeline:\n\nWe can create a group of users, called _webappusers_, who will use the\nuse the functionality of the schema but cannot take admin decisions and \nwe can add individual users with their name and init password.\n\n```bash\ncreate group webappusers;\ncreate user webappuser1 password 'webAppuser1pass' in group webappusers;\ngrant usage on schema project to group webappusers;\n``` \n\nWe can create a group of users called __webdevusers__, who will have \nadmin privileges on the schema, we can add those individual users with \ntheir name and init password\n```\ncreate group webdevusers;\ncreate user webappdevuser1 password 'webAppdev1pass' in group webdevusers;\ngrant all on schema webapp to group webdevusers;\n```\n\nDefending Decisions\n---\n\n### The choice of tools, technologies:\n- Airflow to view, monitor and log flow of information: Extremely useful tool to control end to end ETL processing\n- S3 Storage to store data on a large scale: Never complain about storage and most importantly when it stores big data\n- Redshift to make advantage of columnar format and faster querying strategies: Query from anywhere and anytime\n- Spark for distributed processing of heavy data: Best in-memory faster processing\n- Pandas for cleaning data frames: absolutely neccessary\n\n### Links for Airflow\n\n---\n**Context Variables**  \nhttps://airflow.apache.org/macros.html\n\n**Hacks for airflow**  \nhttps://medium.com/datareply/airflow-lesser-known-tips-tricks-and-best-practises-cf4d4a90f8f  \nhttps://medium.com/handy-tech/airflow-tips-tricks-and-pitfalls-9ba53fba14eb  \nhttps://www.astronomer.io/guides/dag-best-practices/\n\n### Technologies Used\n\u003cimg align=\"left\" src=\"https://upload.wikimedia.org/wikipedia/commons/thumb/9/93/Amazon_Web_Services_Logo.svg/512px-Amazon_Web_Services_Logo.svg.png\" width=108\u003e\n\u003cimg align=\"left\" src=\"https://upload.wikimedia.org/wikipedia/en/2/29/Apache_Spark_Logo.svg\" width=108\u003e\n\u003cimg align=\"left\" src=\"https://ncrocfer.github.io/images/airflow-logo.png\" width=108\u003e\n\u003cimg align=\"left\" src=\"https://upload.wikimedia.org/wikipedia/en/c/cd/Anaconda_Logo.png\" width=108\u003e\n\u003cimg align=\"left\" src=\"https://cdn.sisense.com/wp-content/uploads/aws-redshift-connector.png\" width=108\u003e\n\u003cimg align=\"left\" src=\"https://braze-marketing-assets.s3.amazonaws.com/images/partner_logos/amazon-s3.png\" width=140, height=45\u003e\n\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovenui%2Fdataengineering-capstone-project","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flovenui%2Fdataengineering-capstone-project","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flovenui%2Fdataengineering-capstone-project/lists"}