{"id":15157933,"url":"https://github.com/jorbush/data_engineer_technical_test","last_synced_at":"2026-01-21T14:02:19.892Z","repository":{"id":255696718,"uuid":"853413691","full_name":"jorbush/data_engineer_technical_test","owner":"jorbush","description":"technical test for a data engineer position","archived":false,"fork":false,"pushed_at":"2024-09-08T10:07:22.000Z","size":287,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-07T14:47:42.712Z","etag":null,"topics":["dbt","postgresql","python","sql"],"latest_commit_sha":null,"homepage":"","language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/jorbush.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-06T15:55:13.000Z","updated_at":"2024-09-08T10:07:54.000Z","dependencies_parsed_at":"2024-11-03T04:02:10.386Z","dependency_job_id":"7cdea61c-2343-46e3-a172-8256ddbefa64","html_url":"https://github.com/jorbush/data_engineer_technical_test","commit_stats":{"total_commits":12,"total_committers":1,"mean_commits":12.0,"dds":0.0,"last_synced_commit":"acd0be5bba0f2cdf3158b75a85d0c2030f9cf3ee"},"previous_names":["jorbush/data_engineer_technical_test"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/jorbush/data_engineer_technical_test","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jorbush%2Fdata_engineer_technical_test","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jorbush%2Fdata_engineer_technical_test/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jorbush%2Fdata_engineer_technical_test/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jorbush%2Fdata_engineer_technical_test/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/jorbush","download_url":"https://codeload.github.com/jorbush/data_engineer_technical_test/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/jorbush%2Fdata_engineer_technical_test/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28634786,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-21T04:47:28.174Z","status":"ssl_error","status_checked_at":"2026-01-21T04:47:22.943Z","response_time":86,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["dbt","postgresql","python","sql"],"created_at":"2024-09-26T20:20:42.894Z","updated_at":"2026-01-21T14:02:19.876Z","avatar_url":"https://github.com/jorbush.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# Data Engineer Technical Test\n\n## Part 1: SQL and Query Optimization\n\n### Exercise 1: Writing Queries\n\n#### Database schema for a sales system:\n- **Customers** (customer_id, name, email, phone)\n- **Orders** (order_id, customer_id, order_date, total_amount)\n- **Order_Details** (order_detail_id, order_id, product_id, quantity, price)\n- **Products** (product_id, name, category, price)\n\n#### Instructions:\n1. Write an SQL query to obtain the total amount spent by each customer in the last 6 months. The query should return: `customer_id`, `name`, `total_spent`.\n2. Optimize the above query to ensure it is efficient, considering indexes and other optimization techniques.\n\n#### Delivery:\n- Include the SQL script for the initial query and the optimized version.\n- Briefly explain the optimizations you made and why.\n\n---\n\n### Solution for Exercise 1\n\nThe SQL script for the initial query is as follows (`first_solution.sql`):\n```sql\nSELECT c.customer_id, c.name, SUM(o.total_amount) AS total_spent\nFROM customers c\nJOIN orders o ON c.customer_id = o.customer_id\nWHERE o.order_date \u003e= (NOW() - INTERVAL '6 months')\nGROUP BY c.customer_id\nORDER BY total_spent DESC;\n```\nIn the above query, we are joining the `customers` and `orders` tables and filtering the orders placed in the last 6 months. We then group the results by `customer_id` and calculate the total amount spent by each customer, sorting the results by total amount spent in descending order to see the highest spenders first (although this is not necessary).\n\nThe orders of the last 6 months (`show_order_last_six_months.sql`):\n\n![Orders of the last 6 months](./images/show_order_last_six_months.png)\n\nThe result of the initial query:\n\n![Result of the initial query](./images/first_solution_result.png)\n\nTime taken to execute the initial query: **64ms**\n\n![Execution time of the initial query](./images/execution_time_first_solution.png)\n\nTo optimize the query, we can consider the following:\n- **Indexing**: This can improve the query performance significantly for JOIN operations and WHERE conditions.\n- **Use session variables**: We can use session variables to avoid calculating the same value multiple times.\n- **Remove unnecessary operations**: Removing DESC sorting can improve performance.\n\nThe optimized version of the query is as follows (`solution_optimized.sql`):\n```sql\nCREATE INDEX idx_orders_customer_id ON Orders (customer_id);\nCREATE INDEX idx_orders_order_date ON Orders (order_date);\n\nWITH params AS (\n    SELECT NOW() - INTERVAL '6 months' AS date_limit\n)\nSELECT c.customer_id, c.name, SUM(o.total_amount) AS total_spent\nFROM customers c\nJOIN orders o ON c.customer_id = o.customer_id\nJOIN params p ON o.order_date \u003e= p.date_limit\nGROUP BY c.customer_id;\n```\n\nThe result of the optimized query:\n\n![Result of the optimized query](./images/solution_optimized_result.png)\n\nTime taken to execute the optimized query: **46ms**\n\n![Execution time of the optimized query](./images/execution_time_solution_optimized.png)\n\n### Exercise 2: SQL Query Optimization\n\nYou are given the following SQL query, which is taking too long to execute:\n\n```sql\nSELECT\n    p.category,\n    SUM(od.quantity * od.price) AS total_sales\nFROM\n    Products p\nJOIN\n    Order_Details od ON p.product_id = od.product_id\nJOIN\n    Orders o ON od.order_id = o.order_id\nWHERE\n    o.order_date \u003e= '2024-01-01'\nGROUP BY\n    p.category;\n```\n#### Instructions:\n- Optimize the above query.\n- Describe the modifications you made and justify why they improve performance.\n\n#### Delivery:\n- Include the optimized SQL script.\n- Explain the applied improvements.\n\n---\n\n### Solution for Exercise 2\n\nThe provided SQL query is slow, taking **93ms** to execute.\n\n![Execution time of the given query](./images/not_optimized_query_result.png)\n\nFollowing the same approach as in **Exercise 1**, we can optimize the query by creating appropriate indexes:\n\n- **`product_id`** on `Order_Details` to improve the `JOIN` with `Products`.\n- **`order_id`** on `Order_Details` to optimize the `JOIN` with `Orders`.\n- **`order_date`** on `Orders` to accelerate the date filter.\n\n```sql\nCREATE INDEX idx_order_details_product_id ON Order_Details (product_id);\nCREATE INDEX idx_order_details_order_id ON Order_Details (order_id);\nCREATE INDEX idx_orders_order_date ON Orders (order_date);\n```\n\nThese indexes reduce the need for full table scans, improving lookup times and speeding up JOIN operations.\n\nThe full optimized query is in `optimized_query.sql`.\n\nThe optimized query execution time was reduced to **46ms**, as shown below:\n\n![Optimized query result](./images/optimized_query_result.png)\n\n## Part 2: Data Modeling\n\n### Exercise 3: Data Modeling\n\nWe provide the following descriptions of the entities in an inventory management system:\n\n1. **Products**: Items that can be bought or sold, including details such as price, category, and supplier.\n2. **Suppliers**: Entities that provide products.\n3. **Sales**: Transactions where products are sold to customers.\n\n#### Instructions:\n1. Design a data model that captures these entities and their relationships.\n2. Explain whether you would use a 3NF, star, or snowflake modeling approach, and justify your choice.\n\n#### Delivery:\n- A diagram of the data model (it can be a simple drawing or a digital representation).\n- A brief explanation of the chosen modeling approach.\n\n---\n\n### Solution for Exercise 3\n\nThe data model for the inventory management system is designed as follows:\n\n![Data model for the inventory management system](./part_2/data_model.png)\n\nThe data model consists of three entities: `Products`, `Suppliers` and `Sales`.\n\n- The `Products` entity contains details about the items available for sale, such as `product_id`, `name`, `operation` (an `ENUM`, can be `sold` or `bought`), `category` (a VARCHAR because we don't now all the existent categories) and `supplier_id` (a foreign key to the `Suppliers` entity).\n- The `Suppliers` entity contains information about the entities that provide products, such as `supplier_id`, `name`, `email` and `phone`.\n- The `Sales` entity represents transactions where products are sold to customers. It contains details such as `sale_id`, `product_id` (a foreign key to the `Products` entity), `sale_date` and `price`.\n\nThe relationships between these entities are as follows:\n\n- Each `Product` can have multiple `Sales` and `Sale` can have one `Product`.\n- Each `Product` can have one `Supplier` and one `Supplier` can supply multiple `Products`.\n\nThe modeling approach chosen is a **3NF (Third Normal Form)** model because it reduces data redundancy and ensures data integrity, making it perfect for transactional systems such as this one. The other two approaches are more suitable for data warehousing and business intelligence systems.\n\n## Part 3: DBT (Data Build Tool)\n\n### Exercise 4: Creating DBT Models\n\nAssume you have a database with the following tables: `raw_customers`, `raw_orders`, `raw_order_details`.\n\n#### Instructions:\n1. Create a DBT model that:\n    - Normalizes the data from `raw_customers`.\n    - Calculates total sales per customer using `raw_orders` and `raw_order_details`.\n2. Describe how you would implement tests (unit tests) to ensure the quality of the data in these models.\n\n#### Delivery:\n- Include the relevant `.sql` and `.yml` files for DBT.\n- Explain the testing process and any relevant configurations.\n\n---\n\n### Solution for Exercise 4\n\nUsing the same `sales` db from  **Part 1**, I have initialized a new DBT project using Python:\n\n```bash\npython -m venv venv\nsource venv/bin/activate\npip install -r requirements.txt\ndbt init\n```\n\nThis create a `profiles.yml` file in the `.dbt` directory with the following content:\n\n```yml\nmy_dbt_profile:\n  outputs:\n    dev:\n      dbname: sales\n      host: localhost\n      pass: postgres\n      port: 5432\n      schema: orders\n      threads: 1\n      type: postgres\n      user: postgres\n  target: dev\n```\n\nI have created the following models:\n\n- `customers.sql`: Normalizes the data from `raw_customers`.\n- `costumer_sales.sql`: Calculates total sales per customer using `raw_orders` and `raw_order_details`.\n\nThe tests are implemented in the `schema.yml` file:\n\n```yml\nversion: 2\n\nsources:\n  - name: raw\n    database: sales\n    schema: public\n    tables:\n      - name: customers\n      - name: orders\n      - name: products\n      - name: order_details\n\nmodels:\n  - name: customers\n    columns:\n      - name: customer_id\n        tests:\n          - unique\n          - not_null\n      - name: email\n        tests:\n          - unique\n          - not_null\n\n  - name: customer_sales\n    columns:\n      - name: customer_id\n        tests:\n          - unique\n          - not_null\n          - relationships:\n              to: ref('customers')\n              field: customer_id\n      - name: total_sales\n        tests:\n          - not_null\n```\n\nThe tests ensure that the `customer_id` and `email` columns in the `customers` table are unique and not null. The `customer_id` column in the `customer_sales` table is unique, not null and has a relationship with the `customer_id` column in the `customers` table. The `total_sales` column in the `customer_sales` table is not null.\n\nTo run the tests, use the following command:\n\n```bash\ndbt clean \u0026\u0026 dbt compile \u0026\u0026 dbt run \u0026\u0026 dbt test\n```\n\n## Tools Used\n\n- **Database**: PostgreSQL\n- PgAdmin 4\n- **DBT**: Data Build Tool\n- Python\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjorbush%2Fdata_engineer_technical_test","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fjorbush%2Fdata_engineer_technical_test","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fjorbush%2Fdata_engineer_technical_test/lists"}