{"id":19989196,"url":"https://github.com/clamytoe/de_capstone","last_synced_at":"2025-10-09T15:12:24.044Z","repository":{"id":233382474,"uuid":"621040267","full_name":"clamytoe/de_capstone","owner":"clamytoe","description":"Data Engineer Zoomcamp: Capstone Project","archived":false,"fork":false,"pushed_at":"2023-04-19T01:12:10.000Z","size":1554,"stargazers_count":2,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-01T21:48:16.784Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/clamytoe.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null}},"created_at":"2023-03-29T21:50:00.000Z","updated_at":"2023-04-21T10:51:13.000Z","dependencies_parsed_at":"2024-04-16T02:03:32.275Z","dependency_job_id":null,"html_url":"https://github.com/clamytoe/de_capstone","commit_stats":null,"previous_names":["clamytoe/de_capstone"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/clamytoe/de_capstone","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clamytoe%2Fde_capstone","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clamytoe%2Fde_capstone/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clamytoe%2Fde_capstone/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clamytoe%2Fde_capstone/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/clamytoe","download_url":"https://codeload.github.com/clamytoe/de_capstone/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/clamytoe%2Fde_capstone/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":272847013,"owners_count":25003114,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-08-30T02:00:09.474Z","response_time":77,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-13T04:45:47.370Z","updated_at":"2025-10-09T15:12:19.015Z","avatar_url":"https://github.com/clamytoe.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Engineer Zoomcamp: Capstone Project (*de_capstone*)\r\n\r\n\u003e *This is an ETL pipeline created for my capstone project as part of the DE Zoomcamp course.*\r\n\r\n![Python version][python-version]\r\n![Latest version][latest-version]\r\n[![GitHub issues][issues-image]][issues-url]\r\n[![GitHub forks][fork-image]][fork-url]\r\n[![GitHub Stars][stars-image]][stars-url]\r\n[![License][license-image]][license-url]\r\n\r\nNOTE: This project template was generated with [Cookiecutter](https://github.com/audreyr/cookiecutter) using my [toepack](https://github.com/clamytoe/toepack) project template.\r\n\r\nFor this project I decided to go with a streaming datasource. With the goal of capturing near real-time crypto currency data and analyze it to see how it fluctuates throughout the day. Here are some details about the data:\r\n\r\n* source: [coincap.io](https://coincap.io/)\r\n* coins: **Top 100** ranking coins\r\n* interval: Captured **every minute**\r\n* storage: Google **BigQuery**\r\n* partition: By day using **timestamp** field\r\n* cluster: By **id** field\r\n\r\nWhile trying to figure out how to approach this problem, I discovered that I could skip uploading the dataset files into Google Cloud Storage and add the rows directly into my database table on BigQuery with the use of `bigquery_insert_stream` from the `bigquery` module found in the `prefect_gcp` package.\r\n\r\nAlthough it took me a while to get it to work right, the approach is relatively straight forward... well once you know how to do it.\r\n\r\n\u003e **NOTE:** I'd like to thank [coincap](https://coincap.io/) for providing the service for free! They are pretty generous with their free tier, which I greatly appreciate.\r\n\r\n## Initial setup\r\n\r\n```zsh\r\ncd Projects\r\ngit clone https://github.com/clamytoe/de_capstone.git\r\ncd de_capstone\r\n```\r\n\r\n### Anaconda setup\r\n\r\nIf you are an Anaconda user, this command will get you up to speed with the base installation.\r\n\r\n```zsh\r\nconda env create\r\nconda activate de_capstone\r\n```\r\n\r\n### Regular Python setup\r\n\r\nIf you are just using normal Python, this will get you ready, but I highly recommend that you do this in a virtual environment.\r\nThere are many ways to do this, the simplest using *venv*.\r\n\r\n```zsh\r\npython3 -m venv venv\r\nsource venv/bin/activate\r\npip install -r requirements.txt\r\n```\r\n\r\n## Start prefect server\r\n\r\nStart the Orion server to get started.\r\n\r\n*terminal 1*:\r\n\r\n```bash\r\nprefect server start\r\n\r\n ___ ___ ___ ___ ___ ___ _____\r\n| _ \\ _ \\ __| __| __/ __|_   _|\r\n|  _/   / _|| _|| _| (__  | |\r\n|_| |_|_\\___|_| |___\\___| |_|\r\n\r\nConfigure Prefect to communicate with the server with:\r\n\r\n    prefect config set PREFECT_API_URL=http://127.0.0.1:4200/api\r\n\r\nView the API reference documentation at http://127.0.0.1:4200/docs\r\n\r\nCheck out the dashboard at http://127.0.0.1:4200\r\n```\r\n\r\nOpen up your browser to [http://127.0.0.1:4200](http://127.0.0.1:4200) to get started.\r\n\r\n## Creating the required Blocks\r\n\r\nFor this simple deployment only the GCP Credentials block is needed. When you generate your API key on Google, simple paste its contents into the *Service Account Info* text field.\r\n\r\n![gcp-creds](images/gcp-creds.png)\r\n\r\n## Notifications\r\n\r\nThis step is optional, but I setup a Slack notification for any failed runs.\r\n\r\n![slack](images/slack.png)\r\n\r\n## Deployment\r\n\r\nTo create the deployment, run the following command:\r\n\r\n```bash\r\nprefect deployment build flows/bq_flow.py:etl_api_to_bq -n \"Decap BQ ETL\"\r\n```\r\n\r\nIt will generate the `etl_api_to_bq-deployment.yaml` configuration file. You will need to fill in the *parameters* field.\r\n\r\nFor this deployment, only the `url` is required.\r\n\r\n```yaml\r\nparameters: { \"url\": \"https://api.coincap.io/v2/assets\" }\r\n```\r\n\r\nOnce the parameters field is set, you can apply the deployment:\r\n\r\n```bash\r\nprefect deployment apply etl_api_to_bq-deployment.yaml\r\n```\r\n\r\nYou can confirm it's creation in Orion:\r\n\r\n![deployment](images/deployment.png)\r\n\r\n## Schedule\r\n\r\nCoincap's free tier allows up to 200 calls per hour, so I decided to stay way below that and keep it at once per minute. For this project I decided to go with the Cron scheduler:\r\n\r\n![cron](images/cron.png)\r\n\r\nOnce set, your deployment details should look something like this:\r\n\r\n![details](images/deployment-details.png)\r\n\r\n## Start the data ETL pipeline\r\n\r\nAs soon as the schedule is set, it will start to schedule the deployments and to kick it off, all you have to do is start the default woker agent.\r\n\r\n*terminal 2*:\r\n\r\n```bash\r\nprefect agent start -q 'default'\r\nStarting v2.8.6 agent connected to http://127.0.0.1:4200/api...\r\n\r\n  ___ ___ ___ ___ ___ ___ _____     _   ___ ___ _  _ _____\r\n | _ \\ _ \\ __| __| __/ __|_   _|   /_\\ / __| __| \\| |_   _|\r\n |  _/   / _|| _|| _| (__  | |    / _ \\ (_ | _|| .` | | |\r\n |_| |_|_\\___|_| |___\\___| |_|   /_/ \\_\\___|___|_|\\_| |_|\r\n\r\n\r\nAgent started! Looking for work from queue(s): default...\r\n06:19:55.567 | INFO    | prefect.agent - Submitting flow run '0b89d4f7-f289-4ff0-bb69-2e5a1b486e33'\r\n```\r\n\r\nAs soon as the agent is started it will start kicking off your flows.\r\n\r\n## Flows\r\n\r\nThe script runs two flows. The first is the ETL pipeline process, with inserting the data into BigQuery being its own independent flow.\r\n\r\n*etl:*\r\n\r\n![flow-etl](images/flow-etl.png)\r\n\r\n*bq-insert:*\r\n\r\n![flow-insert](images/flow-insert-to-bq.png)\r\n\r\nAfter having run the pipeline for over the weekend, it looked like this:\r\n\r\n![flow-runs](images/flow-runs.png)\r\n\r\n## Google BigQuery\r\n\r\nHead on over to BigQuery to verify that your data is being uploaded.\r\n\r\n![bq](images/bq.png)\r\n\r\n## Partition and Cluster table\r\n\r\nI wanted to partition my table by the day using the `timestamp` and cluster it by the `id` field. In order to do that, I first stopped prefect so that it would not keep pushing anymore data.\r\n\r\nI then used the following SQL commands in order to prepare the table:\r\n\r\n```sql\r\n-- Create a new partitioned table\r\nCREATE TABLE dtc-de-course-374214.crypto_decap.crypto_coins\r\nPARTITION BY DATE(timestamp)\r\nCLUSTER BY id\r\nAS SELECT *\r\nFROM dtc-de-course-374214.crypto_decap.coins\r\nWHERE 1 = 0; -- this will create an empty table with the same schema as coins\r\n\r\n-- Copy data from the original table to the new partitioned table\r\nINSERT INTO dtc-de-course-374214.crypto_decap.crypto_coins\r\nSELECT *\r\nFROM dtc-de-course-374214.crypto_decap.coins;\r\n\r\n-- Verify that the data has been successfully copied to the new partitioned table\r\nSELECT COUNT(*) FROM dtc-de-course-374214.crypto_decap.crypto_coins;\r\n\r\n-- Delete the original table\r\nDROP TABLE dtc-de-course-374214.crypto_decap.coins;\r\n\r\n-- Rename the new table to the old name\r\nALTER TABLE dtc-de-course-374214.crypto_decap.crypto_coins RENAME TO coins;\r\n```\r\n\r\n\u003e **NOTE:** The project name is specific to my account, yours will be different.\r\n\r\nOnce you have created the new table, you can verify that it is partitioned and clusted by looking at its details:\r\n\r\n![table](images/table.png)\r\n\r\n## dbt\r\n\r\nNow that the data has been collected for a while, it's time to start creating some tables from it. For this I used [dbt](https://www.getdbt.com/) and my repo for that portion can be found here: [dbt_crypto](https://github.com/clamytoe/dbt_crypto)\r\n\r\n### Local dbt setup\r\n\r\nI created a local deployment of dbt as a Docker container. If you would like to see how that is done, head on over to my repo [dbt_crypto_local](https://github.com/clamytoe/dbt_crypto_local)\r\n\r\n### dbt cloud\r\n\r\nThe following was created on the dbt cloud platform [dbt cloud](https://cloud.getdbt.com/).\r\n\r\nThe line graph for this portion of the project looks like this:\r\n\r\n![dbt-graph](images/dbt-line-graph.png)\r\n\r\n![dbt-forecast](images/dbt-forecast.png)\r\n\r\nI've created a deployment for the dbt project and scheduled it to run once every hour.\r\n\r\n![dbt-schedule](images/dbt-schedule.png)\r\n\r\n### Results from dbt\r\n\r\nHere are the final view and tables after running dbt on BigQuery:\r\n\r\n![dbt-results](images/dbt-results.png)\r\n\r\n### Documentation\r\n\r\nEnabled documentation and autogenerated docs.\r\n\r\n![dbt-docs](images/dbt-docs.png)\r\n\r\n## Dashboard\r\n\r\nTo generate the dashboard I used [Metabase](https://www.metabase.com/). Their interface is slick and easy to use. You can run a local containerized version with the following command:\r\n\r\n```bash\r\ndocker run -d -p 3000:3000 metabase/metabase\r\n6e79400f5d3b8836f120e774f40a0e7206dae2c394785fcaf9d5b8fdd08150dd\r\n```\r\n\r\nThe interface will now be available at: [loccalhost:3000](http://localhost:3000)\r\n\r\n![metabase](images/metabase.png)\r\n\r\nYou will have to configure it to connect to your account on whichever cloud platform you are using.\r\n\r\n![metabase-setup](images/metabase-setup.png)\r\n\r\nOnce you have successfully connected, you can start playing around creating dashboards.\r\n\r\n![dashboard](images/dashboard.png)\r\n\r\nOnce more data has been collected, more meaningful charts can be created.\r\n\r\n![tracker](images/tracker.png)\r\n\r\nBitcoin Distribution\r\n\r\n![btc-dist](images/btc-dist.png)\r\n\r\n## Google Looker Studio\r\n\r\nI also went ahead and created a quick dashboard on Looker Studio.\r\n\r\n![trends](images/trends.png)\r\n\r\n## License\r\n\r\nDistributed under the terms of the [MIT](https://opensource.org/licenses/MIT) license, \"de_capstone\" is free and open source software.\r\n\r\n## Issues\r\n\r\nIf you encounter any problems, please [file an issue](https://github.com/clamytoe/toepack/issues) along with a detailed description.\r\n\r\n[python-version]:https://img.shields.io/badge/python-3.10.9-brightgreen.svg\r\n[latest-version]:https://img.shields.io/badge/version-0.1.0-blue.svg\r\n[issues-image]:https://img.shields.io/github/issues/clamytoe/de_capstone.svg\r\n[issues-url]:https://github.com/clamytoe/de_capstone/issues\r\n[fork-image]:https://img.shields.io/github/forks/clamytoe/de_capstone.svg\r\n[fork-url]:https://github.com/clamytoe/de_capstone/network\r\n[stars-image]:https://img.shields.io/github/stars/clamytoe/de_capstone.svg\r\n[stars-url]:https://github.com/clamytoe/de_capstone/stargazers\r\n[license-image]:https://img.shields.io/github/license/clamytoe/de_capstone.svg\r\n[license-url]:https://github.com/clamytoe/de_capstone/blob/master/LICENSE\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclamytoe%2Fde_capstone","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fclamytoe%2Fde_capstone","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fclamytoe%2Fde_capstone/lists"}