{"id":17311905,"url":"https://github.com/artempyanykh/data-zero-to-cloud","last_synced_at":"2025-03-27T01:14:31.706Z","repository":{"id":73010349,"uuid":"137591995","full_name":"artempyanykh/data-zero-to-cloud","owner":"artempyanykh","description":"Slides and code for my talk 'Data pipelines. From zero to cloud scale'","archived":false,"fork":false,"pushed_at":"2018-06-28T19:55:15.000Z","size":36407,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":2,"default_branch":"master","last_synced_at":"2025-02-01T06:27:37.182Z","etag":null,"topics":["data-engineering","data-processing","datawarehousing","etl","google-cloud","google-dataflow"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/artempyanykh.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2018-06-16T15:50:19.000Z","updated_at":"2018-09-22T20:43:58.000Z","dependencies_parsed_at":null,"dependency_job_id":"6e164fe7-3bff-4dc8-9503-cb0dec0cd5aa","html_url":"https://github.com/artempyanykh/data-zero-to-cloud","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/artempyanykh%2Fdata-zero-to-cloud","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/artempyanykh%2Fdata-zero-to-cloud/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/artempyanykh%2Fdata-zero-to-cloud/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/artempyanykh%2Fdata-zero-to-cloud/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/artempyanykh","download_url":"https://codeload.github.com/artempyanykh/data-zero-to-cloud/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":245761298,"owners_count":20667895,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-engineering","data-processing","datawarehousing","etl","google-cloud","google-dataflow"],"created_at":"2024-10-15T12:41:49.865Z","updated_at":"2025-03-27T01:14:31.688Z","avatar_url":"https://github.com/artempyanykh.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data pipelines. From zero to cloud\n\n## General overview\n\nHave you ever wondered how to level up your data processing game?\nIf you're transitioning from ad-hoc analytics and researching options, this might be a good starting point.\n\nThis project has two main modules:\n\n* `local` which shows how to setup a simple data processing pipeline using Luigi, Python, Pandas and Postgres in no time.\n  Though simple, this approach can get you pretty far.\n* `cloud` which illustrates how you can easily swap out:\n    1. local storage in favour of durable and distributed **Google Cloud Storage**,\n    2. local processing power in favour of scalable **Google Dataflow**,\n    3. local PostgreSQL database that you need to manage in favour of **BigQuery** which has a familiar SQL interface, but can process TBs of data without breaking a sweat and integrates nicely with GSuite accounts.\n\nThere is another module `sampledata` which is used to generate sample data.\nTo make it a bit more interesting imagine that the data is from a car renting company called **DailyCar**.\nSpecifically, we have the following information (under `sampledata/generated/`):\n\n1. `users.csv` has information about registered clients of **DailyCar**.\n2. `cars.csv` has information about its car park.\n3. `rents.csv` contains a list of rents, specifically, who and when rented what car.\n4. `fines.csv` is pulled from police database, and help us see all the fines (like speed limit) that are related to company's cars.\n\nBusiness would like to enrich information about fines, so it's able to understand who was driving a specific car at a particular point in time.\nMore formally, we need to generate a table with the following fields (transposed):\n\n| column | data |\n| --- | --- |\n| fine_id | 1 |\n| fine_amount | 15 |\n| fine_registered_at | 2017-10-01 21:36:00 |\n| rent_id | 1 |\n| rented_on | 2017-10-01 |\n| car_id  | 3 |\n| car_reg_number | ks2888 |\n| car_make | bmw |\n| car_model | series_2 |\n| user_id | 3 |\n| user_name | Dumitru Matei |\n| user_passport_no | 482850738 |\n| user_birth_date | 1966-06-22 |\n| user_driving_permit_since | 1991-10-18 |\n\nWe'll demonstrate how to build an ETL pipline around this problem under `local` and `cloud` modules.\nAlso, feel free to tune parameters in `sampledata/generate.py` to get more or less data to work with.\n\n## Setup\n\nFirst, make sure you have `python 2.7`.\nThen, inside project's root folder execute the following commands to install required packages:\n\n```bash\n$ pip install pipenv\n$ pipenv install --skip-lock\n```\n\n\nFor the `local` part you need to install **PostgreSQL** and create a database and a user, like this:\n\n```bash\n\u003e psql postgres\n=# create role dwh login password 'dwh';\n=# create database data_zero_to_cloud owner dwh;\n```\n\nFor the `cloud` part you need to obtain Google Cloud Service credentials and put them under `config/credentials.json`.\nDon't forget to update `config/config.ini` accordingly.\n\n## Run ETL\n\nTo run an ETL task use the following command:\n\n```bash\n$ ./run-luigi.py --local-scheduler --module=MODULE_NAME TASK_NAME --on=DATE\n```\n\nReplace `TASK_NAME` with the name of a defined task, like `ProcessFines`.\n`DATE` parameter can take any value (for our purposes it doesn't matter much what value), for instance `2017-11-16`.\n`MODULE_NAME` can be either `local` or `cloud`.\n\nFor example:\n\n```bash\n$ ./run-luigi.py --local-scheduler --module=cloud ProcessFines --on=2017-11-16\n```\n\nIf you want to go really wild, change `runner` parameter in `config.ini` to `DataflowRunner` and unleash the full power of the cloud, as it will run Apache Beam tasks using **Google Dataflow**.\n\n## Explore contents in Google Cloud Storage\n\nAfter you run a `cloud` ETL, you may want to see the result.\n\nIf you have a Google Cloud account and your own credentials, feel free to go to the web console.\nOtherwise, obtain workshop host's credentials and use a `./shell.py` script to load an iPython session with some predefined functions, such as `gls` and `gcat`.\nAn example usage is below:\n\n```python\nIn [5]: gls('2017-11-15')\nOut[5]:\n[\u003cBlob: warehouse-in-gcs-store, 2017-11-15/cars.csv\u003e,\n \u003cBlob: warehouse-in-gcs-store, 2017-11-15/fines.csv\u003e,\n \u003cBlob: warehouse-in-gcs-store, 2017-11-15/rents.csv\u003e,\n \u003cBlob: warehouse-in-gcs-store, 2017-11-15/rich_fines/_SUCCESS\u003e,\n \u003cBlob: warehouse-in-gcs-store, 2017-11-15/rich_fines/data.csv-00000-of-00001\u003e,\n \u003cBlob: warehouse-in-gcs-store, 2017-11-15/users.csv\u003e]\n\nIn [6]: gcat('2017-11-15/cars.csv')\nid,make,model,reg_number\n1,nissan,murano,ko2116\n2,hyundai,solaris,ct8988\n3,bmw,series_2,ks2888\n\n\nIn [7]: gcat('2017-11-15/rich_fines/data.csv-00000-of-00001')\nfine_id,fine_amount,fine_registered_at,rent_id,rented_on,car_id,car_reg_number,car_make,car_model,user_id,user_name,user_passport_no,user_birth_date,user_driving_permit_since\n8,1,2017-10-03 09:09:00,7,2017-10-03,1,ko2116,nissan,murano,1,Cristina Ciobanu,547345952,1988-02-17,1991-02-27\n...\n```\n\n## Exercises\nPractice makes perfect, so if you'd like to go a little bit deeper, here are some ideas to try:\n\n1. Task `local.LoadRichFines` will not replace contents of the table, which may not be desirable especially if you run your ETL several times a day.\n   Try to implement a task that inherits from `luigi.contrib.postgres.CopyToTable`, and disregards whether it was run before or not.\n2. Similarly, `cloud.LoadRichFines` wont't replace a table in BigQuery. Try to fix this.\n3. There's a bit of a boilerplate in `cloud.ProcessFines` with `Map`s and `CoGroupBy`s.\n   Try to implement a custom `Join` transform that does SQL-style join on two `PCollection`s.\n   Example usage is:\n\n   ```python\n   ((rich_rents, fines)\n            | Join(\n                left_on=lambda x: (x['car_reg_number'], x['rented_on']),\n                right_on=lambda x: (x['car_reg_number'], x['registered_on'])))\n   ```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fartempyanykh%2Fdata-zero-to-cloud","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fartempyanykh%2Fdata-zero-to-cloud","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fartempyanykh%2Fdata-zero-to-cloud/lists"}