{"id":26371368,"url":"https://github.com/alessine/data-engineering-zoomcamp","last_synced_at":"2026-04-16T12:04:58.465Z","repository":{"id":273923380,"uuid":"913962759","full_name":"Alessine/data-engineering-zoomcamp","owner":"Alessine","description":"Materials from the Data Engineering Zoomcamp 2025","archived":false,"fork":false,"pushed_at":"2025-04-28T15:06:03.000Z","size":2525,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-28T16:23:52.114Z","etag":null,"topics":["bigquery","data-engineering","dbt","docker","kestra","spark"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Alessine.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-08T17:25:15.000Z","updated_at":"2025-04-28T15:06:06.000Z","dependencies_parsed_at":"2025-02-09T17:27:10.845Z","dependency_job_id":"16651761-d53e-424a-bde3-63ed972d5907","html_url":"https://github.com/Alessine/data-engineering-zoomcamp","commit_stats":null,"previous_names":["alessine/data-engineering-zoomcamp"],"tags_count":1,"template":false,"template_full_name":null,"purl":"pkg:github/Alessine/data-engineering-zoomcamp","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alessine%2Fdata-engineering-zoomcamp","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alessine%2Fdata-engineering-zoomcamp/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alessine%2Fdata-engineering-zoomcamp/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alessine%2Fdata-engineering-zoomcamp/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Alessine","download_url":"https://codeload.github.com/Alessine/data-engineering-zoomcamp/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Alessine%2Fdata-engineering-zoomcamp/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":31884931,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-16T11:36:10.202Z","status":"ssl_error","status_checked_at":"2026-04-16T11:36:09.652Z","response_time":69,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bigquery","data-engineering","dbt","docker","kestra","spark"],"created_at":"2025-03-17T00:26:00.091Z","updated_at":"2026-04-16T12:04:58.457Z","avatar_url":"https://github.com/Alessine.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Engineering Zoomcamp Cohort 2025\r\n\r\nThis repo contains all my materials, notes and homework for the [Data Engineering Zoomcamp](https://github.com/DataTalksClub/data-engineering-zoomcamp). \r\n\r\n## Final Project\r\n\r\nFor the Final Project, I built an End-to-End Data Pipeline to extract, load, transform and visualize air quality data from the Open Data Catalog by the City of Zurich. \r\nYou can find all the related materials [here](https://github.com/Alessine/zurich_air_quality).\r\n\r\n## Learning in Public\r\n\r\nI published an article on Medium after completing each module to reflect on the contents. Here's an overview:\r\n\r\n- Module 1: [Docker, SQL, Terraform](https://medium.com/@angelaniederberger/learning-in-public-docker-gcp-and-terraform-e5282f6f9d1b)\r\n- Module 2: [Orchestration with Kestra](https://medium.com/@angelaniederberger/learning-in-public-orchestration-with-kestra-0ec485da063e)\r\n- Module 3: [Data Warehouse](https://medium.com/@angelaniederberger/learning-in-public-data-warehouse-and-bigquery-58ceb162edd4)\r\n- Module 4: [Analytics Engineering](https://medium.com/@angelaniederberger/learning-in-public-analytics-engineering-and-dbt-6e72358783ed)\r\n- Module 5: [Batch Processing and Spark](https://medium.com/@angelaniederberger/learning-in-public-batch-processing-and-spark-ec7e48addf8a)\r\n- Module 6: [Streaming with Kafka and Flink](https://medium.com/@angelaniederberger/learning-in-public-data-streaming-with-kafka-and-flink-5f84629525ee)\r\n\r\n## Homework\r\n\r\n### Module 1: Docker, SQL, Terraform\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 1. Understanding docker first run\u003c/b\u003e\u003c/summary\u003e\r\n\r\nRun docker with the python:3.12.8 image in an interactive mode, use the entrypoint bash. What's the version of pip in the image?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nIn bash: `docker run -it --entrypoint bash python:3.12.8` \r\nThe image will run locally. To check the version of pip: `pip --version`. It is version `24.3.1`.\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 2. Understanding Docker networking and docker-compose\u003c/b\u003e\u003c/summary\u003e\r\n\r\nGiven the following docker-compose.yaml, what is the hostname and port that pgadmin should use to connect to the postgres database?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe container name with the postgres database is `postgres`, located at port `5432`, so the answer is `postgres:5432`.\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 3. Trip Segmentation Count\u003c/b\u003e\u003c/summary\u003e\r\n\r\nDuring the period of October 1st 2019 (inclusive) and November 1st 2019 (exclusive), how many trips, respectively, happened:\r\n\r\n- Up to 1 mile\r\n- In between 1 (exclusive) and 3 miles (inclusive),\r\n- In between 3 (exclusive) and 7 miles (inclusive),\r\n- In between 7 (exclusive) and 10 miles (inclusive),\r\n- Over 10 miles\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n\tCASE\r\n\t\tWHEN TRIP_DISTANCE \u003c= 1 THEN '1: \u003c1'\r\n\t\tWHEN TRIP_DISTANCE \u003e 1\r\n\t\tAND TRIP_DISTANCE \u003c= 3 THEN '2: 1-3'\r\n\t\tWHEN TRIP_DISTANCE \u003e 3\r\n\t\tAND TRIP_DISTANCE \u003c= 7 THEN '3: 3-7'\r\n\t\tWHEN TRIP_DISTANCE \u003e 7\r\n\t\tAND TRIP_DISTANCE \u003c= 10 THEN '4: 7-10'\r\n\t\tWHEN TRIP_DISTANCE \u003e 10 THEN '5: 10+'\r\n\t\tELSE 'unknown'\r\n\tEND AS TRIP_DISTANCE_GROUP,\r\n\tCOUNT(*) AS TRIP_COUNT\r\nFROM\r\n\tGREEN_TAXI_TRIPS\r\nWHERE\r\n\tDATE_TRUNC('day', LPEP_DROPOFF_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'\r\nGROUP BY\r\n\tTRIP_DISTANCE_GROUP;\r\n```\r\nResult:\r\n\r\n![query result for question 3](./module_1/homework/hw1_q3.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 4. Longest trip for each day\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhich was the pick up day with the longest trip distance? Use the pick up time for your calculations.\r\n\r\nTip: For every day, we only care about one single trip with the longest distance.\r\n\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n\tDATE_TRUNC('day', LPEP_PICKUP_DATETIME) AS DATE,\r\n\tMAX(TRIP_DISTANCE) AS MAX_DISTANCE\r\nFROM\r\n\tGREEN_TAXI_TRIPS\r\nWHERE\r\n\tDATE_TRUNC('day', LPEP_PICKUP_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'\r\nGROUP BY\r\n\tDATE_TRUNC('day', LPEP_PICKUP_DATETIME)\r\nORDER BY\r\n\tMAX_DISTANCE DESC\r\nLIMIT\r\n\t1;\r\n```\r\n\r\nResult:\r\n\r\n![query result for question 4](./module_1/homework/hw1_q4.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 5. Three biggest pickup zones\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhich were the top pickup locations with over 13,000 in total_amount (across all trips) for 2019-10-18?\r\n\r\nConsider only lpep_pickup_datetime when filtering by date.\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n\tZONES.\"Zone\",\r\n\tROUND(CAST(TOTAL_AMOUNT_PER_ZONE AS NUMERIC), 2) AS TOTAL_AMOUNT_PER_ZONE\r\nFROM\r\n\t(\r\n\t\tSELECT\r\n\t\t\t\"PULocationID\",\r\n\t\t\tSUM(TOTAL_AMOUNT) AS TOTAL_AMOUNT_PER_ZONE\r\n\t\tFROM\r\n\t\t\tGREEN_TAXI_TRIPS\r\n\t\tWHERE\r\n\t\t\tDATE_TRUNC('day', LPEP_PICKUP_DATETIME) = '2019-10-18'\r\n\t\tGROUP BY\r\n\t\t\t\"PULocationID\"\r\n\t) AS TOTAL_AMOUNT_AGG\r\n\tJOIN ZONES ON \"PULocationID\" = \"LocationID\"\r\nWHERE\r\n\tTOTAL_AMOUNT_PER_ZONE \u003e 13000;\r\n```\r\n\r\nResult:\r\n\r\n![query result for question 5](./module_1/homework/hw1_q5.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 6. Largest tip\u003c/b\u003e\u003c/summary\u003e\r\n\r\nFor the passengers picked up in October 2019 in the zone name \"East Harlem North\" which was the drop off zone that had the largest tip?\r\n\r\nNote: it's tip , not trip\r\n\r\nWe need the name of the zone, not the ID.\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n\tDZ.\"Zone\" AS \"DOZone\",\r\n\tMAX(TRIPS.TIP_AMOUNT) AS MAX_TIP\r\nFROM\r\n\tGREEN_TAXI_TRIPS AS TRIPS\r\n\tLEFT JOIN ZONES AS PZ ON TRIPS.\"PULocationID\" = PZ.\"LocationID\"\r\n\tLEFT JOIN ZONES AS DZ ON TRIPS.\"DOLocationID\" = DZ.\"LocationID\"\r\nWHERE\r\n\tDATE_TRUNC('day', LPEP_PICKUP_DATETIME) BETWEEN '2019-10-01' AND '2019-10-31'\r\n\tAND PZ.\"Zone\" = 'East Harlem North'\r\nGROUP BY\r\n\tDZ.\"Zone\"\r\nORDER BY\r\n\tMAX_TIP DESC\r\nLIMIT\r\n\t1;\r\n```\r\n\r\nResult:\r\n\r\n![query result for question 6](./module_1/homework/hw1_q6.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 7. Terraform Workflow\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhich of the following sequences, respectively, describes the workflow for:\r\n\r\n- Downloading the provider plugins and setting up backend,\r\n- Generating proposed changes and auto-executing the plan\r\n- Remove all resources managed by terraform`\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe required file is [here](./module_1/homework/main.tf).\r\n\r\nThe bash commands for the described workflow are the following:\r\n\r\n```bash\r\n$ terraform init\r\n$ terraform apply -auto-approve\r\n$ terraform destroy\r\n```\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n### Module 2: Orchestration with Kestra\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 1. File Size\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWithin the execution for `Yellow` Taxi data for the year `2020` and month `12`: what is the uncompressed file size (i.e. the output file `yellow_tripdata_2020-12.csv` of the extract task)?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nIn the GCS Bucket I can see that the uncompressed file size for the specified file is 128.3 MB.\r\n\r\n![file sizes in the GCS Bucket](./module_2/homework/hw2_q1.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 2. Rendered Value\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhat is the rendered value of the variable `file` when the inputs `taxi` is set to `green`, `year` is set to `2020`, and `month` is set to `04` during execution?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe variable `file` is defined as follows: `\"{{inputs.taxi}}_tripdata_{{inputs.year}}-{{inputs.month}}.csv\"`. When rendered with the specified inputs, this generates the value `green_tripdata_2020-04.csv`. This is also visible in the GCS Bucket: \r\n\r\n![file names in the GCS Bucket](./module_2/homework/hw2_q2.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 3. Number of rows (yellow, 2020)\u003c/b\u003e\u003c/summary\u003e\r\n\r\nHow many rows are there for the `Yellow` Taxi data for all CSV files in the year 2020?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  COUNT(*) AS row_count\r\nFROM\r\n  `dez-2025.taxi_data.yellow_tripdata`\r\nWHERE\r\n  filename LIKE \"%2020%\";\r\n```\r\nResult:\r\n\r\n![query result for question 3](./module_2/homework/hw2_q3.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 4. Number of rows (green, 2020)\u003c/b\u003e\u003c/summary\u003e\r\n\r\nHow many rows are there for the `Green` Taxi data for all CSV files in the year 2020?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  COUNT(*) AS row_count\r\nFROM\r\n  `dez-2025.taxi_data.green_tripdata`\r\nWHERE\r\n  filename LIKE \"%2020%\";\r\n```\r\n\r\nResult:\r\n\r\n![query result for question 4](./module_2/homework/hw2_q4.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 5. Number of rows (yellow, March 2021)\u003c/b\u003e\u003c/summary\u003e\r\n\r\nHow many rows are there for the `Yellow` Taxi data for the March 2021 CSV file?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  COUNT(*) AS row_count\r\nFROM\r\n  `dez-2025.taxi_data.yellow_tripdata`\r\nWHERE\r\n  filename = \"yellow_tripdata_2021-03.csv\";\r\n```\r\n\r\nResult:\r\n\r\n![query result for question 4](./module_2/homework/hw2_q5.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 6. Timezone for trigger\u003c/b\u003e\u003c/summary\u003e\r\n\r\nHow would you configure the timezone to New York in a Schedule trigger?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nIn the [Kestra Documentation on Schedule Triggers](https://kestra.io/docs/workflow-components/triggers/schedule-trigger) we can find the following information:\r\n\r\n![Screenshot from the kestra documentation](./module_2/homework/hw2_q6.png)\r\n\u003c/details\u003e\r\n\r\n\r\n### Module 3: Data Warehouse\r\n\r\nQueries used for setting up the tables:\r\n\r\nExternal Table:\r\n```SQL\r\nCREATE OR REPLACE EXTERNAL TABLE\r\n  `dez-2025.taxi_data.yellow_tripdata_2024_external` \r\nOPTIONS ( \r\n  format = 'PARQUET',\r\n  uris = ['gs://taxi-data-files/yellow_tripdata_2024-*.parquet'] \r\n  );\r\n```\r\n\r\nMaterialized Table:\r\n```SQL\r\nCREATE OR REPLACE TABLE\r\n  `dez-2025.taxi_data.yellow_tripdata_2024` AS (\r\n  SELECT\r\n    *\r\n  FROM\r\n    `dez-2025.taxi_data.yellow_tripdata_2024_external`);\r\n```\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 1. Count of records for the 2024 Yellow Taxi Data\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhat is the count of records for the 2024 Yellow Taxi Data?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  COUNT(*) AS row_count\r\nFROM\r\n  `dez-2025.taxi_data.yellow_tripdata_2024`;\r\n```\r\n\r\nResult:\r\n\r\n![number of rows in the dataset](./module_3/homework/hw3_q1.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 2. Estimated amount of data\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWrite a query to count the distinct number of PULocationIDs for the entire dataset on both the tables.\r\nWhat is the estimated amount of data that will be read when this query is executed on the External Table and the Table?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  COUNT(DISTINCT PULocationID) AS pu_location_count\r\nFROM\r\n  `dez-2025.taxi_data.yellow_tripdata_2024`;\r\n```\r\n\r\nResult:\r\n\r\nAs shown in the screenshots below, BigQuery gives an accurate estimate for the materialized table (155 MB), but cannot generate an estimate for the data processed when querying the external table, because the data is not stored in BigQuery.\r\n\r\nMaterialized table:\r\n\r\n![estimated bytes processed - materialized table](./module_3/homework/hw3_q2_1.png)\r\n\r\nExternal table:\r\n\r\n![estimated bytes processed - external table](./module_3/homework/hw3_q2_2.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 3. Why are the estimated number of Bytes different?\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWrite a query to retrieve the PULocationID from the table (not the external table) in BigQuery. Now write a query to retrieve the PULocationID and DOLocationID on the same table. Why are the estimated number of Bytes different?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nData storage in BigQuery is columnar (rather than row-oriented). This means that querying additional columns adds to the data volume. BigQuery only needs to retrieve the rows from those columns that are explicitly selected.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 4. How many records have a fare_amount of 0?\u003c/b\u003e\u003c/summary\u003e\r\n\r\nHow many records have a fare_amount of 0?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  COUNTIF(fare_amount = 0) AS trips_without_fare,\r\n  COUNT(*) AS all_trips,\r\n  COUNTIF(fare_amount = 0) / COUNT(*) * 100 AS perc_without_fare\r\nFROM\r\n  `dez-2025.taxi_data.yellow_tripdata_2024`;\r\n```\r\n\r\nResult:\r\n\r\n![amount of trips without a fare](./module_3/homework/hw3_q4.png)\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 5. The best strategy to make an optimized table in Big Query\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhat is the best strategy to make an optimized table in Big Query if your query will always filter based on tpep_dropoff_datetime and order the results by VendorID (Create a new table with this strategy)\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nPartitioning can reduce the bytes processed when filtering on the partitioned column. Clustering orders the records by the selected column. Therefore, for the type of queries described, it would be most appropriate to partition by `tpep_dropoff_datetime` and cluster on the `VendorID`.\r\n\r\nQuery:\r\n```SQL\r\nCREATE OR REPLACE TABLE\r\n  `dez-2025.taxi_data.yellow_tripdata_2024_partitioned_clustered`\r\nPARTITION BY\r\n  TIMESTAMP_TRUNC(tpep_dropoff_datetime, DAY)\r\nCLUSTER BY\r\n  VendorID AS (\r\n  SELECT\r\n    *\r\n  FROM\r\n    `dez-2025.taxi_data.yellow_tripdata_2024`);\r\n```\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 6. Estimated processed bytes\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWrite a query to retrieve the distinct VendorIDs between tpep_dropoff_datetime 2024-03-01 and 2024-03-15 (inclusive)\r\n\r\nUse the materialized table you created earlier in your from clause and note the estimated bytes. Now change the table in the from clause to the partitioned table you created for question 5 and note the estimated bytes processed. What are these values? \r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  DISTINCT VendorID\r\nFROM\r\n  `dez-2025.taxi_data.yellow_tripdata_2024_partitioned_clustered`\r\nWHERE\r\n  tpep_dropoff_datetime BETWEEN '2024-03-01'\r\n  AND '2024-03-15'; \r\n```\r\n\r\nResult:\r\n\r\nWithout partitioning:\r\n\r\n![estimated bytes without partitioning](./module_3/homework/hw3_q6_1.png)\r\n\r\nWith partitioning:\r\n\r\n![estimated bytes with partitioning](./module_3/homework/hw3_q6_2.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 7. Where is the data for external tables stored?\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhere is the data stored in the External Table you created?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe data is stored in the parquet files in the GCS Bucket. For external tables, BigQuery only provides the interface to explore the data.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 8. Always clustering\u003c/b\u003e\u003c/summary\u003e\r\n\r\nIt is best practice in Big Query to always cluster your data.\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nFalse.\r\n\r\nClustering can help improve especially filter and aggregate queries. Clusters are particularly helpful for columns with high cardinality (many distinct values). However, they also need to be maintained (via automatic re-clustering) and if the amount of data is small (\u003c 1 GB) it is not advisable to cluster the table.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 9. Bytes read in SELECT COUNT(*)\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWrite a SELECT count(*) query FROM the materialized table you created. How many bytes does it estimate will be read? Why?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe estimate is 0, because the result of this query was cached when I previously ran it for question 1.\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  COUNT(*)\r\nFROM\r\n  `dez-2025.taxi_data.yellow_tripdata_2024` ;\r\n```\r\n\r\nResult:\r\n\r\n![estimated bytes queried](./module_3/homework/hw3_q9.png)\r\n\r\n\u003c/details\u003e\r\n\r\n### Module 4: Analytics Engineering\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 1. Understanding dbt model resolution\u003c/b\u003e\u003c/summary\u003e\r\n\r\nProvided you've got the following sources.yaml\r\n\r\n```\r\nversion: 2\r\n\r\nsources:\r\n  - name: raw_nyc_tripdata\r\n    database: \"{{ env_var('DBT_BIGQUERY_PROJECT', 'dtc_zoomcamp_2025') }}\"\r\n    schema:   \"{{ env_var('DBT_BIGQUERY_SOURCE_DATASET', 'raw_nyc_tripdata') }}\"\r\n    tables:\r\n      - name: ext_green_taxi\r\n      - name: ext_yellow_taxi\r\n```\r\n\r\nwith the following env variables setup where `dbt` runs:\r\n\r\n```\r\nexport DBT_BIGQUERY_PROJECT=myproject\r\nexport DBT_BIGQUERY_DATASET=my_nyc_tripdata\r\n```\r\n\r\nWhat does this .sql model compile to?\r\n\r\n```SQL\r\nselect * \r\nfrom {{ source('raw_nyc_tripdata', 'ext_green_taxi' ) }}\r\n```\r\n\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nSince the environment variables take precedence over the default value, the model would compile to:\r\n\r\n```SQL\r\nselect *\r\nfrom myproject.raw_nyc_tripdata.ext_green_taxi\r\n```\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 2. dbt Variables \u0026 Dynamic Models\u003c/b\u003e\u003c/summary\u003e\r\n\r\nSay you have to modify the following dbt_model (`fct_recent_taxi_trips.sql`) to enable Analytics Engineers to dynamically control the date range.\r\n\r\n\u003cul\u003e\r\n    \u003cli\u003eIn development, you want to process only the last 7 days of trips\u003c/li\u003e\r\n    \u003cli\u003eIn production, you need to process the last 30 days for analytics\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n```SQL\r\nselect *\r\nfrom {{ ref('fact_taxi_trips') }}\r\nwhere pickup_datetime \u003e= CURRENT_DATE - INTERVAL '30' DAY\r\n```\r\n\r\nWhat would you change to accomplish that in a such way that command line arguments takes precedence over ENV_VARs, which takes precedence over DEFAULT value?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\nThe variables would need to be nested inside the Jinja macro in the following way: `{{ CLI var(\"var\", ENV var(\"VAR\", default)) }}`. The correct answer is therefore:\r\n\r\nUpdate the WHERE clause to `pickup_datetime \u003e= CURRENT_DATE - INTERVAL '{{ var(\"days_back\", env_var(\"DAYS_BACK\", \"30\")) }}' DAY`.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 3. dbt Data Lineage and Execution\u003c/b\u003e\u003c/summary\u003e\r\n\r\nConsidering the data lineage below and that `taxi_zone_lookup` is the only materialization build (from a .csv seed file):\r\n\r\n![data lineage diagram](./module_4/homework/homework_q2.png)\r\n\r\nSelect the option that does NOT apply for materializing `fct_taxi_monthly_zone_revenue`.\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\nOut of the given options, only the one that specifies the staging folder would not apply for materializing `fct_taxi_monthly_zone_revenue`, because the table `dim_zone_lookup` would not get built, since it is not downstream from the models in the staging folder. The correct answer is therefore:\r\n\r\n`dbt run --select models/staging/+`\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 4. dbt Macros and Jinja\u003c/b\u003e\u003c/summary\u003e\r\n\r\nConsider you're dealing with sensitive data (e.g.: PII), that is \u003cb\u003eonly available to your team and very selected few individuals\u003c/b\u003e, in the `raw layer` of your DWH (e.g: a specific BigQuery dataset or PostgreSQL schema),\r\n\r\n\u003cul\u003e\r\n    \u003cli\u003eAmong other things, you decide to obfuscate/masquerade that data through your staging models, and make it available in a different schema (a \u003ccode\u003estaging layer\u003c/code\u003e) for other Data/Analytics Engineers to explore\u003c/li\u003e\r\n    \u003cli\u003eAnd \u003cb\u003eoptionally\u003c/b\u003e, yet another layer (\u003ccode\u003eservice layer\u003c/code\u003e), where you'll build your dimension (\u003ccode\u003edim_\u003c/code\u003e) and fact (\u003ccode\u003efct_\u003c/code\u003e) tables (assuming the Star Schema dimensional modeling) for Dashboarding and for Tech Product Owners/Managers\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\nYou decide to make a macro to wrap a logic around it:\r\n\r\n```SQL\r\n{% macro resolve_schema_for(model_type) -%}\r\n\r\n    {%- set target_env_var = 'DBT_BIGQUERY_TARGET_DATASET'  -%}\r\n    {%- set stging_env_var = 'DBT_BIGQUERY_STAGING_DATASET' -%}\r\n\r\n    {%- if model_type == 'core' -%} {{- env_var(target_env_var) -}}\r\n    {%- else -%}                    {{- env_var(stging_env_var, env_var(target_env_var)) -}}\r\n    {%- endif -%}\r\n\r\n{%- endmacro %}\r\n```\r\n\r\nAnd use on your staging, dim_ and fact_ models as:\r\n\r\n```\r\n{{ config(\r\n    schema=resolve_schema_for('core'), \r\n) }}\r\n```\r\n\r\nThat all being said, regarding macro above, \u003cb\u003eselect all statements that are true to the models using it\u003c/b\u003e.\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nSince there is no default set for the environment variable `target_env_var`, it needs to be defined in the environment, otherwise the macro won't work. If this variable is set, then it will be used for any model that is defined as `core` (in this case `staging`, `dim_` and `fact_` models). All other models will use the value from `stging_env_var` and if undefined, will fall back to `target_env_var`.\r\n\r\nThe following statements are therefore true:\r\n\u003cul\u003e\r\n\t\u003cli\u003eSetting a value for \u003ccode\u003eDBT_BIGQUERY_TARGET_DATASET\u003c/code\u003e env var is mandatory, or it'll fail to compile\u003c/li\u003e\r\n\t\u003cli\u003eWhen using \u003ccode\u003ecore\u003c/code\u003e, it materializes in the dataset defined in \u003ccode\u003eDBT_BIGQUERY_TARGET_DATASET\u003c/code\u003e\u003c/li\u003e\r\n\t\u003cli\u003eWhen using \u003ccode\u003estg\u003c/code\u003e, it materializes in the dataset defined in \u003ccode\u003eDBT_BIGQUERY_STAGING_DATASET\u003c/code\u003e, or defaults to \u003ccode\u003eDBT_BIGQUERY_TARGET_DATASET\u003c/code\u003e\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 5. Taxi Quarterly Revenue Growth\u003c/b\u003e\u003c/summary\u003e\r\n\r\n\u003col\u003e\r\n    \u003cli\u003eCreate a new model \u003ccode\u003efct_taxi_trips_quarterly_revenue.sql\u003c/code\u003e\u003c/li\u003e\r\n    \u003cli\u003eCompute the Quarterly Revenues for each year for based on total_amount\u003c/li\u003e\r\n    \u003cli\u003eCompute the Quarterly YoY (Year-over-Year) revenue growth\u003c/li\u003e\r\n\u003c/ol\u003e\r\n\r\n\u003cul\u003e\r\n    \u003cli\u003ee.g.: In 2020/Q1, Green Taxi had -12.34% revenue growth compared to 2019/Q1\u003c/li\u003e\r\n    \u003cli\u003ee.g.: In 2020/Q4, Yellow Taxi had +34.56% revenue growth compared to 2019/Q4\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\nConsidering the YoY Growth in 2020, which were the yearly quarters with the best (or less worse) and worst results for green, and yellow\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe file with the new model for YoY Growth is [here](module_4/homework/models/core/fct_taxi_trips_quarterly_revenue.sql).\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  *,\r\n  (SAFE_DIVIDE(quarterly_revenue, prev_year_revenue)-1)*100 AS yoy_growth\r\nFROM (\r\n  SELECT\r\n    year_quarter,\r\n    service_type,\r\n    SUM(total_amount) AS quarterly_revenue,\r\n    LAG(SUM(total_amount), 4) OVER(PARTITION BY service_type ORDER BY year_quarter) AS prev_year_revenue,\r\n  FROM\r\n    `dez-2025.taxi_data_prod.fact_trips`\r\n  WHERE\r\n    year IN (2019,\r\n      2020)\r\n  GROUP BY\r\n    service_type,\r\n    year_quarter)\r\n```\r\n\r\nThese are the best and worst results for green and yellow cabs:\r\n\u003cul\u003e\r\n\u003cli\u003egreen: {best: 2020/Q1, worst: 2020/Q2}, yellow: {best: 2020/Q1, worst: 2020/Q2}\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 6. P97/P95/P90 Taxi Monthly Fare\u003c/b\u003e\u003c/summary\u003e\r\n\r\n\u003col\u003e\r\n    \u003cli\u003eCreate a new model \u003ccode\u003efct_taxi_trips_monthly_fare_p95.sql\u003c/code\u003e\u003c/li\u003e\r\n    \u003cli\u003eFilter out invalid entries (fare_amount \u003e 0, trip_distance \u003e 0, and payment_type_description in ('Cash', 'Credit Card'))\u003c/li\u003e\r\n    \u003cli\u003eCompute the \u003cb\u003econtinous percentile\u003c/b\u003e of \u003ccode\u003efare_amount\u003c/code\u003e partitioning by service_type, year and and month\u003c/li\u003e\r\n\u003c/ol\u003e\r\n\r\nNow, what are the values of p97, p95, p90 for Green Taxi and Yellow Taxi, in April 2020?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe file with the new model for the `fare_amount` percentiles is [here](module_4/homework/models/core/fct_taxi_trips_monthly_fare_p95.sql).\r\n\r\nQuery:\r\n```SQL\r\nWITH\r\n  prep AS (\r\n  SELECT\r\n    service_type,\r\n    year,\r\n    month,\r\n    PERCENTILE_CONT(fare_amount, 0.97) OVER(PARTITION BY service_type, year, month) AS p97,\r\n    PERCENTILE_CONT(fare_amount, 0.95) OVER(PARTITION BY service_type, year, month) AS p95,\r\n    PERCENTILE_CONT(fare_amount, 0.90) OVER(PARTITION BY service_type, year, month) AS p90\r\n  FROM\r\n    `dez-2025.taxi_data_prod.fact_trips`\r\n  WHERE\r\n    fare_amount \u003e 0\r\n    AND trip_distance \u003e 0\r\n    AND payment_type_description IN ('Cash',\r\n      'Credit card'))\r\nSELECT\r\n  service_type,\r\n  year,\r\n  month,\r\n  MAX(p97) AS P97,\r\n  MAX(p95) AS P95,\r\n  MAX(p90) AS P90,\r\nFROM\r\n  prep\r\nWHERE\r\n  year = 2020\r\n  AND month = 4\r\nGROUP BY\r\n  service_type,\r\n  year,\r\n  month;\r\n```\r\n\r\nI'm getting the following output:\r\n![percentile query output](./module_4/homework/hw4_q6.png)\r\n\r\nThis is closest to this option from the homework assignment:\r\n\u003cul\u003e\r\n\u003cli\u003egreen: {p97: 55.0, p95: 45.0, p90: 26.5}, yellow: {p97: 31.5, p95: 25.5, p90: 19.0}\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 7. Top #Nth longest P90 travel time Location for FHV\u003c/b\u003e\u003c/summary\u003e\r\n\r\nPrerequisites:\r\n\r\n\u003cul\u003e\r\n    \u003cli\u003eCreate a staging model for FHV Data (2019), and \u003cb\u003eDO NOT\u003c/b\u003e add a deduplication step, just filter out the entries \u003ccode\u003ewhere dispatching_base_num is not null\u003c/code\u003e\u003c/li\u003e\r\n    \u003cli\u003eCreate a core model for FHV Data (\u003ccode\u003edim_fhv_trips.sql\u003c/code\u003e) joining with \u003ccode\u003edim_zones\u003c/code\u003e.\u003c/li\u003e\r\n    \u003cli\u003eAdd some new dimensions \u003ccode\u003eyear\u003c/code\u003e (e.g.: 2019) and \u003ccode\u003emonth\u003c/code\u003e (e.g.: 1, 2, ..., 12), based on \u003ccode\u003epickup_datetime\u003c/code\u003e, to the core model to facilitate filtering for your queries\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\nNow...\r\n\r\n\u003col\u003e\r\n    \u003cli\u003eCreate a new model \u003ccode\u003efct_fhv_monthly_zone_traveltime_p90.sql\u003c/code\u003e\u003c/li\u003e\r\n\t\u003cli\u003eFor each record in \u003ccode\u003edim_fhv_trips.sql\u003c/code\u003e, compute the timestamp_diff in seconds between dropoff_datetime and pickup_datetime - we'll call it \u003ccode\u003etrip_duration\u003c/code\u003e for this exercise\u003c/li\u003e\r\n\t\u003cli\u003eCompute the \u003cb\u003econtinuous\u003c/b\u003e \u003ccode\u003ep90\u003c/code\u003e of \u003ccode\u003etrip_duration\u003c/code\u003e partitioning by year, month, pickup_location_id, and dropoff_location_id\u003c/li\u003e\r\n\u003c/ol\u003e\r\n\r\nFor the Trips that \u003cb\u003erespectively\u003c/b\u003e started from \u003ccode\u003eNewark Airport\u003c/code\u003e, \u003ccode\u003eSoHo\u003c/code\u003e, and \u003ccode\u003eYorkville East\u003c/code\u003e, in November 2019, what are \u003cb\u003edropoff_zones\u003c/b\u003e with the 2nd longest p90 trip_duration ?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe file with the new model for the `P90` continuous percentiles is [here](module_4/homework/models/core/fct_fhv_monthly_zone_traveltime_p90.sql).\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n  *\r\nFROM (\r\n  SELECT\r\n    *,\r\n    ROW_NUMBER() OVER(PARTITION BY month, year, pickup_zone ORDER BY P90 DESC) AS row_number\r\n  FROM\r\n    `dez-2025.taxi_data_prod.fct_fhv_monthly_zone_traveltime_p90`\r\n  WHERE\r\n    year = 2019\r\n    AND month = 11\r\n    AND pickup_zone IN (\"Newark Airport\",\r\n      \"SoHo\",\r\n      \"Yorkville East\"))\r\nWHERE\r\n  row_number \u003c 3;\r\n```\r\n\r\nI'm getting the following output:\r\n![percentile query output](./module_4/homework/hw4_q7.png)\r\n\r\nTherefore, the correct answer is:\r\n\u003cul\u003e\r\n\u003cli\u003eLaGuardia Airport, Chinatown, Garment District\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\u003c/details\u003e\r\n\r\n### Module 5: Batch Processing and Spark\r\n\r\nThe code related to all these questions is in this [notebook](./module_5/homework/250309_homework.ipynb).\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 1. Install Spark and PySpark\u003c/b\u003e\u003c/summary\u003e\r\n\r\n- Install Spark\r\n- Run PySpark\r\n- Create a local spark session\r\n- Execute spark.version\r\n\r\nWhat's the output?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nThe output is: `3.3.2`.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 2. Yellow October 2024\u003c/b\u003e\u003c/summary\u003e\r\n\r\nRead the October 2'24 Yellow Taxi Data into a Spark Dataframe. Repartition the Dataframe into 4 partitions and save it to parquet.\r\n\r\nWhat is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nAll four files have a size of about 25.4 MB.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 3. Count records\u003c/b\u003e\u003c/summary\u003e\r\n\r\nHow many taxi trips were there on the 15th of October? Consider only trips that started on the 15th of October.\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT \r\n    MIN(tpep_pickup_datetime) AS first_trip,\r\n    MAX(tpep_pickup_datetime) AS last_trip,\r\n    COUNT(*) trip_count\r\nFROM \r\n    yellow_taxis_oct_24\r\nWHERE\r\n    date(tpep_pickup_datetime) == '2024-10-15'\r\n```\r\n\r\nOutput:\r\n\r\n![trip count query output](./module_5/homework/hw5_q3.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 4. Longest trip\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWhat is the length of the longest trip in the dataset in hours?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n    MAX(trip_duration) as max_trip_duration\r\n    FROM\r\n        (SELECT\r\n            TIMESTAMPDIFF(HOUR, tpep_pickup_datetime, tpep_dropoff_datetime) AS trip_duration\r\n        FROM\r\n            yellow_taxis_oct_24);\r\n```\r\n\r\nOutput:\r\n\r\n![longest trip query output](./module_5/homework/hw5_q4.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 5. User Interface\u003c/b\u003e\u003c/summary\u003e\r\n\r\nSpark’s User Interface which shows the application's dashboard runs on which local port?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nIt runs on `localhost:4040`.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 6. Least frequent pickup location zone\u003c/b\u003e\u003c/summary\u003e\r\n\r\nLoad the zone lookup data into a temp view in Spark:\r\n\r\n```bash\r\nwget https://d37ci6vzurychx.cloudfront.net/misc/taxi_zone_lookup.csv\r\n```\r\n\r\nUsing the zone lookup data and the Yellow October 2024 data, what is the name of the LEAST frequent pickup location Zone?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nQuery:\r\n```SQL\r\nSELECT\r\n    Zone,\r\n    COUNT(*) AS trip_count\r\n    FROM\r\n        yellow_taxis_zones_joined\r\n    GROUP BY\r\n        Zone\r\n    ORDER BY\r\n        trip_count\r\n    LIMIT 5;\r\n```\r\n\r\nOutput:\r\n\r\n![least frequent pickup zone query output](./module_5/homework/hw5_q6.png)\r\n\r\n\u003c/details\u003e\r\n\r\n\r\n### Module 6: Streaming with Kafka and Flink\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 1. Redpanda version\u003c/b\u003e\u003c/summary\u003e\r\n\r\nLet's find out the version of redpandas. For that, check the output of the command `rpk help` inside the container. The name of the container is `redpanda-1`. Find out what you need to execute based on the `help` output.\r\n\r\nWhat's the version, based on the output of the command you executed? (copy the entire version)\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nWhen running `rpk --version` inside the redpandas docker container, I get the following output: `rpk version v24.2.18 (rev f9a22d4430)`.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 2. Creating a topic\u003c/b\u003e\u003c/summary\u003e\r\n\r\nBefore we can send data to the redpanda server, we need to create a topic. We do it also with the `rpk` command we used previously for figuring out the version of redpandas. Read the output of `help` and based on it, create a topic with name `green-trips`.\r\n\r\nWhat's the output of the command for creating a topic? Include the entire output in your answer.\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nWhen I run `rpk topic create green-trips` I get the following output:\r\n\r\n\r\n|TOPIC |STATUS|\r\n|---|---|\r\n|green-trips |OK|\r\n\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 3. Connecting to the Kafka server\u003c/b\u003e\u003c/summary\u003e\r\n\r\nWe need to make sure we can connect to the server, so later we can send some data to its topics\r\n\r\nFirst, let's install the kafka connector (up to you if you want to have a separate virtual environment for that)\r\n\r\n```bash\r\npip install kafka-python\r\n```\r\n\r\nYou can start a jupyter notebook in your solution folder or create a script\r\n\r\nLet's try to connect to our server:\r\n\r\n```python\r\nimport json\r\n\r\nfrom kafka import KafkaProducer\r\n\r\ndef json_serializer(data):\r\n    return json.dumps(data).encode('utf-8')\r\n\r\nserver = 'localhost:9092'\r\n\r\nproducer = KafkaProducer(\r\n    bootstrap_servers=[server],\r\n    value_serializer=json_serializer\r\n)\r\n\r\nproducer.bootstrap_connected()\r\n```\r\n\r\nProvided that you can connect to the server, what's the output of the last command?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nWhen I run this code in a Jupyter Notebook, I get the output `True`.\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 4. Sending the Trip Data\u003c/b\u003e\u003c/summary\u003e\r\n\r\nNow we need to send the data to the green-trips topic. Read the data, and keep only these columns:\r\n\u003cul\u003e\r\n    \u003cli\u003e\u003ccode\u003e'lpep_pickup_datetime'\u003c/code\u003e,\u003c/li\u003e\r\n    \u003cli\u003e\u003ccode\u003e'lpep_dropoff_datetime'\u003c/code\u003e,\u003c/li\u003e\r\n    \u003cli\u003e\u003ccode\u003e'PULocationID'\u003c/code\u003e,\u003c/li\u003e\r\n    \u003cli\u003e\u003ccode\u003e'DOLocationID'\u003c/code\u003e,\u003c/li\u003e\r\n    \u003cli\u003e\u003ccode\u003e'passenger_count'\u003c/code\u003e,\u003c/li\u003e\r\n    \u003cli\u003e\u003ccode\u003e'trip_distance'\u003c/code\u003e,\u003c/li\u003e\r\n    \u003cli\u003e\u003ccode\u003e'tip_amount'\u003c/code\u003e\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\nNow send all the data using this code:\r\n\r\n```python\r\nproducer.send(topic_name, value=message)\r\n```\r\n\r\nFor each row (`message`) in the dataset. In this case, `message` is a dictionary.\r\n\r\nAfter sending all the messages, flush the data:\r\n\r\n```python\r\nproducer.flush()\r\n```\r\n\r\nUse `from time import time` to see the total time\r\n\r\n```python\r\nfrom time import time\r\n\r\nt0 = time()\r\n\r\n# ... your code\r\n\r\nt1 = time()\r\ntook = t1 - t0\r\n```\r\n\r\nHow much time did it take to send the entire dataset and flush?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\nSending the data took 33.7 seconds. The code can be accessed [here](./module_6/homework/250316_connecting_to_kafka.ipynb).\r\n\r\n\u003c/details\u003e\r\n\r\n\u003cdetails\u003e\u003csummary\u003e\u003cb\u003eQuestion 5. Build a Sessionization Window\u003c/b\u003e\u003c/summary\u003e\r\n\r\nNow we have the data in the Kafka stream. It's time to process it.\r\n\r\n\u003cul\u003e\r\n    \u003cli\u003eCopy \u003ccode\u003eaggregation_job.py\u003c/code\u003e and rename it to \u003ccode\u003esession_job.py\u003c/code\u003e\u003c/li\u003e\r\n    \u003cli\u003eHave it read from \u003ccode\u003egreen-trips\u003c/code\u003e fixing the schema\u003c/li\u003e\r\n    \u003cli\u003eUse a session window with a gap of 5 minutes\u003c/li\u003e\r\n    \u003cli\u003eUse \u003ccode\u003elpep_dropoff_datetime\u003c/code\u003e time as your watermark with a 5 second tolerance\u003c/li\u003e\r\n\u003c/ul\u003e\r\n\r\nWhich pickup and drop off locations have the longest unbroken streak of taxi trips?\r\n\r\n\u003cb\u003eAnswer:\u003c/b\u003e\r\n\r\n\r\n\r\n\u003c/details\u003e\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falessine%2Fdata-engineering-zoomcamp","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Falessine%2Fdata-engineering-zoomcamp","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Falessine%2Fdata-engineering-zoomcamp/lists"}