{"id":22357284,"url":"https://github.com/mdh266/airflowetl","last_synced_at":"2025-07-30T10:33:03.208Z","repository":{"id":93022639,"uuid":"101519760","full_name":"mdh266/AirflowETL","owner":"mdh266","description":"Blog post on ETL pipelines with Airflow","archived":false,"fork":false,"pushed_at":"2020-06-07T18:11:17.000Z","size":801,"stargazers_count":21,"open_issues_count":0,"forks_count":8,"subscribers_count":2,"default_branch":"master","last_synced_at":"2024-06-11T17:53:55.028Z","etag":null,"topics":["airflow","data-engineering","data-pipeline","database","etl","etl-pipeline","postgresql","python","schedule","sql"],"latest_commit_sha":null,"homepage":"http://michael-harmon.com/blog/AirflowETL.html","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mdh266.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null}},"created_at":"2017-08-26T23:54:55.000Z","updated_at":"2023-09-24T21:27:31.000Z","dependencies_parsed_at":"2023-03-12T08:30:43.790Z","dependency_job_id":null,"html_url":"https://github.com/mdh266/AirflowETL","commit_stats":{"total_commits":8,"total_committers":1,"mean_commits":8.0,"dds":0.0,"last_synced_commit":"597a3db630dbd007d9027b275a06201017b6727f"},"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FAirflowETL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FAirflowETL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FAirflowETL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mdh266%2FAirflowETL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mdh266","download_url":"https://codeload.github.com/mdh266/AirflowETL/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":228124574,"owners_count":17873170,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["airflow","data-engineering","data-pipeline","database","etl","etl-pipeline","postgresql","python","schedule","sql"],"created_at":"2024-12-04T14:13:46.370Z","updated_at":"2024-12-04T14:13:47.758Z","avatar_url":"https://github.com/mdh266.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# An Example ETL Pipeline With Airflow\n\nIn this blog post I want to go over the operations of data engineering called Extract, Transform, Load (ETL) and show how they can be automated and scheduled using \u003ca href=\"https://airflow.incubator.apache.org/\"\u003eApache Airflow\u003c/a\u003e. You can see the source code for this project \u003ca href=\"https://github.com/mdh266/AirflowDataPipeline\"\u003ehere\u003c/a\u003e.\n\n\n*Extracting* data can be done in a multitude of ways, but one of the most common ways is to query a \u003ca href=\"https://en.wikipedia.org/wiki/Web_API\"\u003eWEB API\u003c/a\u003e.  If the query is sucessful, then we will receive data back from the API's server. Often times the data we get back is in the form of \u003ca href=\"https://en.wikipedia.org/wiki/JSON\"\u003eJSON\u003c/a\u003e.  JSON can pretty much be thought of a semi-structured data or as a dictionary where the dictionary keys and values are strings.  Since the data is a dictionary of strings this means we must *transform* it before storing or *loading* into a database. Airflow is a platform to schedule and monitor workflows and in this post I will show you how to use it to extract the daily weather in New York from the \u003ca href=\"https://openweathermap.org/api\"\u003eOpenWeatherMap\u003c/a\u003e API, convert the temperature to Celsius and load the data in a simple \u003ca href=\"https://www.postgresql.org/\"\u003ePostgreSQL\u003c/a\u003e database.\n\n\n## Requirements\n\n\u003ca href=\"https://airflow.incubator.apache.org/\"\u003eAirflow\u003c/a\u003e\n\n\u003ca href=\"https://www.python.org/\"\u003ePython 2.7\u003c/a\u003e\n\n\u003ca href=\"https://www.postgresql.org/\"\u003ePostgreSQL\u003c/a\u003e\n\n\u003ca href=\"http://initd.org/psycopg/\"\u003epsycopg2\u003c/a\u003e\n\n\u003ca href=\"https://www.sqlalchemy.org/\"\u003eSQLAlchemy\u003c/a\u003e\n\n\u003ca href=\"https://sqlalchemy-utils.readthedocs.io/en/latest/\"\u003eSQLAlchemy-Utils\u003c/a\u003e\n\nTo install the requirements (except for Python and postgres) type:\n\n\tpip install -r requirements.t\n\nYou can see the actual blog post \u003ca href=\"http://michael-harmon.com/blog/AirflowETL.html\"\u003ehere\u003c/a\u003e.","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdh266%2Fairflowetl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmdh266%2Fairflowetl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmdh266%2Fairflowetl/lists"}