{"id":19337184,"url":"https://github.com/martachesnova/data-pipelines--apacheairflow","last_synced_at":"2025-02-24T08:14:10.653Z","repository":{"id":107618937,"uuid":"479227399","full_name":"martachesnova/Data-Pipelines--ApacheAirflow","owner":"martachesnova","description":"Created and automated a set of data pipelines with Apache Airflow including monitoring and debugging production pipelines.","archived":false,"fork":false,"pushed_at":"2022-04-09T03:56:59.000Z","size":92,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-01-06T10:13:38.757Z","etag":null,"topics":["airflow","apache-airflow","python3","sql"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/martachesnova.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-04-08T03:13:30.000Z","updated_at":"2022-04-08T22:10:44.000Z","dependencies_parsed_at":null,"dependency_job_id":"8011c7ec-3af2-4033-8cda-2efdbf890ebe","html_url":"https://github.com/martachesnova/Data-Pipelines--ApacheAirflow","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martachesnova%2FData-Pipelines--ApacheAirflow","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martachesnova%2FData-Pipelines--ApacheAirflow/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martachesnova%2FData-Pipelines--ApacheAirflow/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/martachesnova%2FData-Pipelines--ApacheAirflow/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/martachesnova","download_url":"https://codeload.github.com/martachesnova/Data-Pipelines--ApacheAirflow/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":240441955,"owners_count":19801793,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","apache-airflow","python3","sql"],"created_at":"2024-11-10T03:13:38.926Z","updated_at":"2025-02-24T08:14:10.616Z","avatar_url":"https://github.com/martachesnova.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Pipelines with Apache Airflow\n\nA music streaming company has decided that it is time to introduce more automation and monitoring to their data warehouse ETL pipelines.\nThey have also noted that the data quality plays a big part when analyses are executed on top the data warehouse and want to run tests against their datasets after the ETL steps have been executed to catch any discrepancies in the datasets.\n\nMy task was to create high grade data pipelines that are dynamic and built from reusable tasks. They are monitored and allow easy backfills. Data quality checks are also automated for analysis execution over the data warehouse, to catch any discrepancies in the datasets.\n\n# Project Overview\n\nI've created custom operators to perform tasks such as staging the data, filling the data warehouse, and running checks on the data as the final step.\n\n### Configuring the DAG\nI've added `default parameters` according to these guidelines:\n\n* DAG does not have dependencies on past runs\n* On failure, the task are retried 3 times\n* Retries happen every 5 minutes\n* Catchup is turned off\n* Do not email on retry\n\u003cbr\u003e\nIn addition, I configured the task dependencies so that after the dependencies are set, the graph view follows the flow shown in the image below.\n\n![image-dag](images/dag.png)\n\n## Building the operators\nBuilt four different operators that stage the data, transform the data, and run checks on data quality.\u003cbr\u003e\nAll of the operators and task instances run SQL statements against the Redshift database. \n\n### Stage Operator\nThe stage operator is able to load the JSON formatted files from S3 to Amazon Redshift. The operator creates and runs a SQL COPY statement based on the parameters provided. The operator's parameters specify where in S3 the file is loaded and what is the target table.\n\n### Fact and Dimension Operators\nWith dimension and fact operators, I utilize the SQL helper class to run data transformations. Most of the logic is within the SQL transformations and the operator is expected to take as input a SQL statement and target database on which to run the query against.\n\n### Data Quality Operator\nThe final operator is used to run checks on the data itself. The operator's main functionality is to receive one or more SQL based test cases along with the expected results and execute the tests. \n\u003cbr\u003e\n\u003chr\u003e\n\n# Datasets\nThe source data resides in S3 and needs to be processed in Sparkify's data warehouse in Amazon Redshift. The source datasets consist of JSON logs that tell about user activity in the application and JSON metadata about the songs the users listen to.\n\n### Song dataset:\nIt's a subset of real data from the Million Song Dataset. Each file is in JSON format and contains metadata about a song and the artist of that song. \n\n```\n{\n    \"num_songs\":1,\n    \"artist_id\":\"ARD7TVE1187B99BFB1\",\n    \"artist_latitude\":null,\n    \"artist_longitude\":null,\n    \"artist_location\":\"California - LA\",\n    \"artist_name\":\"Casual\",\n    \"song_id\":\"SOMZWCG12A8C13C480\",\n    \"title\":\"I Didn't Mean To\",\n    \"duration\":218.93179,\n    \"year\":0\n }\n ```\n\n### Log dataset:\n\nIt consists of log files in JSON format generated by this event simulator based on the songs in the dataset above. These simulate activity logs from a music streaming app based on specified configurations.\n\n```\n{\n   \"artist\":null,\n   \"auth\":\"Logged In\",\n   \"firstName\":\"Walter\",\n   \"gender\":\"M\",\n   \"itemInSession\":0,\n   \"lastName\":\"Frye\",\n   \"length\":null,\n   \"level\":\"free\",\n   \"location\":\"San Francisco-Oakland-Hayward, CA\",\n   \"method\":\"GET\",\n   \"page\":\"Home\",\n   \"registration\":1540919166796.0,\n   \"sessionId\":38,\n   \"song\":null,\n   \"status\":200,\n   \"ts\":1541105830796,\n   \"userAgent\":\"\\\"Mozilla\\/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit\\/537.36 (KHTML, like Gecko) Chrome\\/36.0.1985.143 Safari\\/537.36\\\"\",\n   \"userId\":\"39\"\n}\n```\n\n### Fact Table\n* **songplays** - records in log data associated with song plays i.e. records with page NextSong \u003cbr\u003e\n  *table: songplay_id, start_time, user_id, level, song_id, artist_id, session_id, location, user_agent*\n### Dimension Tables\n* **users** - users in the app\u003cbr\u003e\n  *table: user_id, first_name, last_name, gender, level*\n\n* **songs** - songs in music database\u003cbr\u003e\n  *table: song_id, title, artist_id, year, duration*\n\n* **artists** - artists in music database\u003cbr\u003e\n  *table: artist_id, name, location, latitude, longitude*\n\n* **time** - timestamps of records in songplays broken down into specific units\u003cbr\u003e\n  *table: start_time, hour, day, week, month, year, weekday*\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartachesnova%2Fdata-pipelines--apacheairflow","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmartachesnova%2Fdata-pipelines--apacheairflow","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmartachesnova%2Fdata-pipelines--apacheairflow/lists"}