{"id":22814241,"url":"https://github.com/devinterview-io/sql-ml-interview-questions","last_synced_at":"2026-01-08T07:47:31.332Z","repository":{"id":216162637,"uuid":"740619146","full_name":"Devinterview-io/sql-ml-interview-questions","owner":"Devinterview-io","description":"🟣 SQL interview questions and answers to help you prepare for your next machine learning and data science interview in 2024.","archived":false,"fork":false,"pushed_at":"2024-01-08T18:02:09.000Z","size":14,"stargazers_count":7,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-02-06T02:48:58.840Z","etag":null,"topics":["ai-interview-questions","coding-interview-questions","coding-interviews","data-science","data-science-interview","data-science-interview-questions","data-scientist-interview","interview-practice","interview-preparation","machine-learning","machine-learning-and-data-science","machine-learning-interview","machine-learning-interview-questions","software-engineer-interview","sql-ml","sql-ml-interview-questions","sql-ml-questions","sql-ml-tech-interview","technical-interview-questions"],"latest_commit_sha":null,"homepage":null,"language":null,"has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Devinterview-io.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null}},"created_at":"2024-01-08T18:01:32.000Z","updated_at":"2024-11-12T09:17:52.000Z","dependencies_parsed_at":"2024-01-08T19:31:30.612Z","dependency_job_id":"97e65294-f239-4625-b107-f5e0a40eaea0","html_url":"https://github.com/Devinterview-io/sql-ml-interview-questions","commit_stats":null,"previous_names":["devinterview-io/sql-ml-interview-questions"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fsql-ml-interview-questions","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fsql-ml-interview-questions/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fsql-ml-interview-questions/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Devinterview-io%2Fsql-ml-interview-questions/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Devinterview-io","download_url":"https://codeload.github.com/Devinterview-io/sql-ml-interview-questions/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246385414,"owners_count":20768672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["ai-interview-questions","coding-interview-questions","coding-interviews","data-science","data-science-interview","data-science-interview-questions","data-scientist-interview","interview-practice","interview-preparation","machine-learning","machine-learning-and-data-science","machine-learning-interview","machine-learning-interview-questions","software-engineer-interview","sql-ml","sql-ml-interview-questions","sql-ml-questions","sql-ml-tech-interview","technical-interview-questions"],"created_at":"2024-12-12T13:07:50.856Z","updated_at":"2026-01-08T07:47:31.325Z","avatar_url":"https://github.com/Devinterview-io.png","language":null,"funding_links":[],"categories":[],"sub_categories":[],"readme":"# 55 Fundamental SQL in Machine Learning Interview Questions in 2025.\n\n\u003cdiv\u003e\n\u003cp align=\"center\"\u003e\n\u003ca href=\"https://devinterview.io/questions/machine-learning-and-data-science/\"\u003e\n\u003cimg src=\"https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/github-blog-img%2Fmachine-learning-and-data-science-github-img.jpg?alt=media\u0026token=c511359d-cb91-4157-9465-a8e75a0242fe\" alt=\"machine-learning-and-data-science\" width=\"100%\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\n#### You can also find all 55 answers here 👉 [Devinterview.io - SQL](https://devinterview.io/questions/machine-learning-and-data-science/sql-ml-interview-questions)\n\n\u003cbr\u003e\n\n## 1. What are the different types of _JOIN_ operations in SQL?\n\n**INNER JOIN**, **LEFT JOIN**, **RIGHT JOIN**, and **FULL JOIN** are different SQL join types, each with its distinct characteristics.\n\n### Join Types at a Glance\n\n- **INNER JOIN**: Returns matching records from both tables.\n- **LEFT JOIN**: Retrieves all records from the left table and matching ones from the right.\n- **RIGHT JOIN**: Gets all records from the right table and matching ones from the left.\n- **FULL JOIN**: Includes all records when there is a match in either of the tables.\n\n### Visual Representation\n\n![SQL Joins](https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/sql%2Fsql-joins-min.png?alt=media\u0026token=143a48c2-6ebe-4b27-9012-adf8a7ba8948)\n\n### Code Example: SQL Joins\n\nHere is the SQL code:\n\n```sql\n-- CREATE TABLES\nCREATE TABLE employees (\n    id INT PRIMARY KEY,\n    name VARCHAR(100),\n    department_id INT\n);\n\nCREATE TABLE departments (\n    id INT PRIMARY KEY,\n    name VARCHAR(100)\n);\n\n-- INSERT SOME DATA\nINSERT INTO employees (id, name, department_id) VALUES\n  (1, 'John', 1), \n  (2, 'Alex', 2), \n  (3, 'Lisa', 1), \n  (4, 'Mia', 1);\n\nINSERT INTO departments (id, name) VALUES \n  (1, 'HR'), \n  (2, 'Finance'), \n  (3, 'IT');\n\n-- INNER JOIN\nSELECT employees.name, departments.name as department\nFROM employees\nINNER JOIN departments ON employees.department_id = departments.id;\n\n-- LEFT JOIN\nSELECT employees.name, departments.name as department\nFROM employees\nLEFT JOIN departments ON employees.department_id = departments.id;\n\n-- RIGHT JOIN\nSELECT employees.name, departments.name as department\nFROM employees\nRIGHT JOIN departments ON employees.department_id = departments.id;\n\n-- FULL JOIN\nSELECT employees.name, departments.name as department\nFROM employees\nFULL JOIN departments ON employees.department_id = departments.id;\n```\n\u003cbr\u003e\n\n## 2. Explain the difference between _WHERE_ and _HAVING_ clauses.\n\nThe **WHERE** and **HAVING** clauses are both used in SQL, but they serve distinct purposes, \n\n### Key Distinctions\n\n- **WHERE**: Filters records based on conditions for individual rows.\n- **HAVING**: Filters the results of aggregate functions, such as `COUNT`, `SUM`, `AVG`, and others, for groups of rows defined by the GROUP BY clause.\n\n### Code Example: Basic Usage\n\nHere is the SQL code:\n\n```sql\n-- WHERE: Simple data filtering\nSELECT product_type, AVG(product_price) AS avg_price\nFROM products\nWHERE product_price \u003e 100\nGROUP BY product_type;\n\n-- HAVING: Filtered results post-aggregation\nSELECT order_id, SUM(order_total) AS total_amount\nFROM orders\nGROUP BY order_id\nHAVING SUM(order_total) \u003e 1000;\n```\n\u003cbr\u003e\n\n## 3. How would you write a SQL query to select _distinct values_ from a column?\n\nWhen you have duplicates in a column, you can use the `DISTINCT` clause to **retrieved unique values**.\n\nFor instance: \n\n```sql\nSELECT UNIQUE_COLUMN\nFROM YOUR_TABLE\nORDER BY UNIQUE_COLUMN;\n```\n\nIn this query, replace `UNIQUE_COLUMN` with the column name from which you want distinct values, and `YOUR_TABLE` with your specific table name.\n\n### Practical Example: Using `DISTINCT`\n\nLet's say you have a table `students_info` with a `grade` column indicating the grade level of students. You want to find all unique grade levels.\n\nHere's the corresponding SQL query:\n\n```sql\nSELECT DISTINCT grade\nFROM students_info\nORDER BY grade;\n```\n\nExecuting this query would return a list of unique grade levels in ascending order.\n\n### When to Use `DISTINCT`\n\n- **Unique Records**: When you only want to see and count unique values within a specific column or set of columns.\n  \n   ```sql\n   SELECT COUNT(DISTINCT column_name) FROM table_name;\n   ```\n   \n- **Criteria Comparison**: Using `IN` and `NOT IN` can involve multiple selections; `DISTINCT` ensures the return of unique results.\n\n- **Insight into Overlapping Data**: Useful for data analysis tasks where you want to identify shared information between rows.\n\n- **Subset Selection**: When you are working with large tables and want to zero in on unique records within a specific range, such as for pagination.\n\u003cbr\u003e\n\n## 4. What does _GROUP BY_ do in a SQL query?\n\n**GROUP BY** is a powerful clause in Structured Query Language (SQL) that allows for data summarization and grouping.\n\n### Key Functions\n\n- **Aggregation**: Performs tasks like sum, count, average, among others within subsets (groups).\n- **Grouping**: Identifies data subsets based on predetermined commonalities.\n- **Filtering**: Enables filtering both pre- and post-aggregation.\n\n### When to Use GROUP BY\n\n- **Summarizing Data**: For instance, calculating a 'Total Sales' from individual transactions.\n- **Categorization**: Such as counting the number of 'Customers' or 'Products' within specific groups (like regions or categories).\n- **Data Integrity Checks**: To identify potential duplicates or check for data consistency.\n- **Combining with Aggregate Functions**: Pairing with functions such as `COUNT`, `SUM`, `AVG`, `MAX`, and `MIN` for more sophisticated calculations.\n\n### The Mechanism Behind GROUP BY\n\n- **Division into Groups**: The system sorts the result set by the specified columns in the `GROUP BY` clause and groups rows that have the same group column values. This step creates a distinct group for each unique combination of 'group by' columns.\n- **Aggregation within Groups**: The system then applies the aggregation function (or functions) to each group independently, summarizing the data within each group.\n- **Result Generation**: After the groups are processed, the final result set is produced.\n\n### Code Example: GROUP BY in Action\n\nHere is the SQL code:\n\n```sql\nSELECT SUM(Revenue), Region\nFROM Sales\nGROUP BY Region;\n```\n\nIn this example, the `Sales` table is grouped by `Region`, and the sum of `Revenue` is calculated for each group.\n\n### Potential Challenges with GROUP BY\n\n- **Single-Column Limitation**: Without employing **additional techniques**, such as using subqueries or rollup or cube extensions, data can be grouped on only one column.\n- **Data Types Consideration**: When grouping by certain data types, such as dates or floating points, results may not be as expected due to inherent characteristics of those types.\n\n### Advanced Techniques with GROUP BY\n\n- **Rollup and Cube**: Extensions providing multi-level summaries.\n\t- ROLLUP: Computes higher-level subtotals, moving from right to left in the grouping columns.\n\t- CUBE: Computes all possible subtotals.\n\n- **Grouping Sets**: Defines multiple groups in one query, e.g., grouping by year, month, and day in a date column.\n\u003cbr\u003e\n\n## 5. How can you _aggregate data_ in SQL (e.g., _COUNT_, _AVG_, _SUM_, _MAX_, _MIN_)?\n\nAggregating data in SQL is **essential** for making sense of large data sets. Common aggregate functions include `COUNT`, `SUM`, `AVG` (mean), `MIN`, and `MAX`.\n\n### Syntax\n\nHere is an example of the SQL code:\n\n```sql\nSELECT AGG_FUNCTION(column_name)\nFROM table_name\nGROUP BY column_name;\n```\n\n- `AGG_FUNCTION`: Replace with any of the aggregate operations.\n- `column_name`: The specific column to which the function will be applied.\n\nIf you don't use a `GROUP BY` clause, the query will apply the aggregate function to the **entire result set**.\n\n### Examples\n\n#### Without `GROUP BY`\n\n```sql\nSELECT COUNT(id) AS num_orders\nFROM orders;\n```\n\n#### With `GROUP BY`\n\n```sql\nSELECT customer_id, COUNT(id) AS num_orders\nFROM orders\nGROUP BY customer_id;\n```\n\nIn this example, the `COUNT` aggregates the number of orders for each unique customer ID.\n\n### Considerations\n\n- **Null Values**: Most aggregates ignore nulls, but you can use `COUNT(*)` to include them.\n- **Multiple Functions**: It's possible to include multiple aggregate functions in one query.\n- **Data Type Compatibility**: Ensure that the chosen aggregate function is compatible with the data type of the selected column. For instance, you can't calculate the mean of a text field.\n\n### Code Example: Aggregating Data in SQL\n\nHere is the SQL code:\n\n```sql\nCREATE TABLE orders (id INT, customer_id INT, total_amount DECIMAL(10, 2));\n\nINSERT INTO orders (id, customer_id, total_amount)\nVALUES \n\t(1, 101, 25.00),\n\t(2, 102, 35.50),\n\t(3, 101, 42.25),\n\t(4, 103, 20.75),\n\t(5, 102, 60.00);\n\n-- Total Number of Orders\nSELECT COUNT(id) AS num_orders\nFROM orders;\n\n-- Number of Orders per Customer\nSELECT customer_id, COUNT(id) AS num_orders\nFROM orders\nGROUP BY customer_id;\n\n-- Total Sales\nSELECT SUM(total_amount) AS total_sales\nFROM orders;\n\n-- Average Order Value\nSELECT AVG(total_amount) AS avg_order_value\nFROM orders;\n\n-- Highest Ordered Value\nSELECT MAX(total_amount) AS max_order_value\nFROM orders;\n\n-- Lowest Ordered Value\nSELECT MIN(total_amount) AS min_order_value\nFROM orders;\n```\n\u003cbr\u003e\n\n## 6. Describe a _subquery_ and its typical use case.\n\nA **subquery** consists of a complete SQL statement nested within another query. It's often used for complex filtering, calculations, and data retrieval.\n\nSubqueries are broadly classified into two types:\n\n- **Correlated**: They depend on the outer query's results. Each time the outer query iterates, the subquery is re-evaluated with the updated outer result. It can be less efficient as it often involves repeated subquery evaluation.\n- **Uncorrelated**: These are self-contained and don't rely on the outer query. They are typically executed only once and their result is used throughout the outer query.\n\n### Common Use Cases\n\n- **Filtering with Aggregates**: Subqueries can be used in combination with aggregate functions to filter group-level results based on specific criteria. For instance, you can retrieve departments with an average salary above a certain threshold.\n\n- **Multi-Criteria Filtering**: Subqueries are often handy when traditional `WHERE`, `IN`, or `EXISTS` clauses can't accommodate complex, multi-criteria filters.\n\n- **Data Integrity Checks**: Subqueries can help identify inconsistent data by comparing values to related tables.\n\n- **Hierarchical Data Queries**: With the advent of Common Table Expressions (CTEs) and recursive queries in modern SQL standards, a direct use of subqueries for hierarchical data searches is now uncommon - CTEs are the preferred means of such queries.\n\n- **Data Retention**: Subqueries can be used to identify specific records to be deleted or retained based on certain conditions.\n\n### Common Use Cases\n\n#### Multi-Criteria Filtering\n   - **Task**: Return all customers from a specific city who have placed orders within the last month.\n   - **Code**:\n    ```sql\n    SELECT * FROM Customers\n    WHERE City = 'London'\n    AND CustomerID IN (SELECT CustomerID FROM Orders WHERE OrderDate \u003e DATEADD(month, -1, GETDATE()))\n    ```\n\n#### Data Integrity Checks\n   - **Task**: Retrieve customers with inconsistent states in the Customer and Order tables.\n   - **Code**:\n    ```sql\n    SELECT * FROM Customer\n    WHERE State NOT IN (SELECT DISTINCT State FROM Orders)\n    ```\n\n#### Data Retention\n   - **Task**: Archive orders older than three years.\n   - **Code**:\n    ```sql\n    DELETE FROM Orders\n    WHERE OrderID IN (SELECT OrderID FROM Orders WHERE OrderDate \u003c DATEADD(year, -3, GETDATE()))\n    ```\n\u003cbr\u003e\n\n## 7. Can you explain the use of _indexes_ in databases and how they relate to Machine Learning?\n\nDatabase **indexes** enable systems to retrieve data more efficiently by offering a faster look-up mechanism. This optimization technique is directly pertinent to **Machine Learning**.\n\n### Indexes in Databases\n\nDatabases traditionally use **B-Tree** indexes, but are equipped with several index types, catering to varying data environments and query patterns.\n\n- **B-Tree (Balanced Tree)**: Offers balanced search capabilities, ensuring nodes are at least half-full.\n- **Hash**: Employed for point queries, dedicating fixed-size chunks to keys.\n- **Bitmap**: Particularly suitable for low-cardinality data where keys are better represented as bits.\n- **Text Search**: Facilitates efficient text matching.\n\n### Key Concepts of B-Trees\n\n- **Node Structure**: Contains keys and pointers. Leaf nodes harbor actual data, enabling direct access.\n- **Data Positioning**: Organizes data in a sorted, multi-level structure to expedite lookups.\n- **Range Queries**: Suited for both singular and **range-based** queries.\n\n### Machine Learning Query Scenarios\n\n- **Similarity Look-Up**: A dataset with user preferences can be indexed to expedite locating individuals with matching profiles, advantageous in applications such as recommendation systems.\n- **Range-Based Searches**: For datasets containing time-specific information, like a sales record, B-Trees excel in furnishing time-ordered data within designated intervals.\n\n### Code Example: Implementing B-Trees for Range Queries\n\nHere is the Python code:\n\n  ```python\n  class Node:\n      def __init__(self, keys=[], children=[]):\n          self.keys = keys\n          self.children = children\n\n  # Perform range query on tree rooted at 'node'\n  def range_query(node, start, end):\n      # Base case: Leaf node\n      if not node.children:\n          return [key for key in node.keys if start \u003c= key \u003c= end]\n      # Locate appropriate child node\n      index = next((i for i, key in enumerate(node.keys) if key \u003e= start), len(node.children) - 1)\n      return range_query(node.children[index], start, end)\n  ```\n\u003cbr\u003e\n\n## 8. How would you _optimize_ a SQL query that seems to be running slowly?\n\nWhen a SQL query is sluggish, various optimization techniques can be employed to enhance its speed. Let's focus on the **logical and physical** design aspects of the **database structure and the query itself**.\n\n### Key Optimization Techniques\n\n#### 1. Query Optimization\n\n- **Simplify Complex Queries**: Break the query into smaller parts for better readability and performance. Use common table expressions or derived tables to modularize SQL logic. Alternatively, you can use temporary tables. \n- **Limit Result Set**: Use `TOP`, `LIMIT`, or `ROWNUM`/`ROWID` to restrict the number of records returned.\n- **Reduce JOIN Complexity**: Replace multiple JOINs with fewer, multi-table JOINs and **explicit JOIN** notation.\n\n\n#### 2. Indexing\n\n- **Proper Indexing**: Select suitable columns for indexing to speed up data retrieval. Use composite indexes for frequent combinations of columns in WHERE or JOIN conditions.\n- **Avoid Over-Indexing**: Numerous indexes can slow down write operations and data modifications. Strike a balance.\n\n#### 3. Schema and Data Design\n\n- **Normalization**: Ensure the database is in an optimal normal form, which can reduce redundancy, maintain data integrity, and minimize disk space.\n- **Data Types**: Use appropriate data types for columns to conserve space and support efficient data operations.\n\n#### 4. Under The Hood: The Query Plan\n\n- **Analyze Query Execution Plan**: Look at the query execution plan, generated by the SQL query optimizer, to identify bottlenecks and improve them. Many RDBMS provide commands and tools to access the query execution plan.\n\n#### 5. More Ideas from SQL Performance Tuning\n\n- **Test Under Load**:  Simulate the production environment and monitor query response times to identify performance issues.\n- **Limit Data Reallocations in tempdb**: Data reallocation operations such as INSERT INTO.. SELECT FROM can be resource-intensive on tempdb.\n- **Partition Data**: Split large tables into smaller, more manageable segments to speed up query performance.\n\n#### Tools and Techniques for Query Analysis\n\n- **Profiling Tools**: Use graphical query builders and visual execution plan tools provided by much RDBMS to examine data flow and performance.\n- **Query Plan Viewer**: Databases such as SQL Server have a graphical representation of query execution plans.\n- **Index Analysis**: Some databases, like MySQL and SQL Server, provide tools to check the efficiency of indexes and suggest index changes through Index Tuning Wizard and Optimizer Index Advisor.\n\n### Practical Steps for Query Optimization\n\n1. **Determine the Performance Problem**: Understand what specific aspect of the query is underperforming.\n2. **Profile Your Query**: Use **EXPLAIN** (or its equivalent on other databases) to see the query plan and identify potential bottlenecks.\n3. **Analyze Query Execution Time**: Use database tools to analyze real execution time and get insights into I/O, CPU, and memory usage.\n4. **Identify the Bottleneck**: Focus on the slowest part of the query or most resource-intensive part, for example, I/O or CPU.\n5. **Tune That Portion**: Make changes to the query or the table structure or consider using versioned views or indexed views. Take time to understand the reason it is being slow and focus your efforts on correcting that.\n\u003cbr\u003e\n\n## 9. How do you handle _missing values_ in a SQL dataset?\n\nHandling **missing values** is crucial for accurate analysis in SQL. Let's look at the various techniques for managing them.\n\n### Removing Records\n\nOne of the simplest ways to deal with missing values is to discard rows with NULLs.\n\n#### Examples\n\nHere's a SQL query that deletes rows containing NULL in the column `age`:\n\n```sql\nDELETE FROM students\nWHERE age IS NULL;\n```\n\n### Direct Replacement\n\nReplace missing values with specific defaults using `COALESCE` or `CASE` statements.\n\n#### Examples\n\nIf `grade` can have NULL values and you want to treat them as \"ungraded\":\n\n```sql\nSELECT student_name, COALESCE(grade, 'Ungraded') AS actual_grade\nFROM student_grades;\n```\n\nAn example using `CASE`:\n\n```sql\nSELECT book_title,\n       CASE WHEN publication_year IS NULL THEN 'Unknown'\n            ELSE publication_year\n       END AS year\nFROM books;\n```\n\n### Using Aggregates\n\nApply SQL aggregate functions to compute statistics without explicitly removing NULLs. For example, `COUNT` ignores NULLs on a column.\n\n```sql\nSELECT department, COUNT(*) AS total_students\nFROM students\nGROUP BY department;\n```\n\n### Flexible Joins\n\nDepending on your specific situation, you might want to include or exclude missing values when joining tables.\n\n#### Examples\n\nUsing `LEFT JOIN`:\n\n```sql\nSELECT s.student_id, s.name, e.enrollment_date\nFROM students s\nLEFT JOIN enrollments e ON s.student_id = e.student_id;\n```\n\nUsing `INNER JOIN`:\n\n```sql\nSELECT s.student_id, s.name, e.enrollment_date\nFROM students s\nINNER JOIN enrollments e ON s.student_id = e.student_id\nWHERE e.enrollment_date IS NOT NULL;\n```\n\n### Handle Missing Date Fields\n\nIf **Date** fields are missing, the appropriate strategy would depend on the context.\n\n1. **Replace with Defaults**: For missing dates, you can use a default, such as the current date, or another specific date.\n\n2. **Remove or Flag**: Another option, based on context, is to either delete the record with the missing date or flag it for later review.\n\n#### Examples\n\nFor replacing with the current date:\n\n```sql\nSELECT action_id, COALESCE(action_date, CURRENT_DATE) AS actual_date\nFROM actions;\n```\n\n3. **Impute from Adjacent Data**: In time series data, it's often useful to fill in missing dates with the nearest available data point to maintain a continuous date sequence. This can be done using window functions.\n\n#### Examples\n\nUsing `LAG()` to fill missing dates with the previous non-missing date:\n\n```sql\nSELECT action_id,\n       COALESCE(action_date, LAG(action_date) OVER (ORDER BY action_id)) AS imputed_date\nFROM actions;\n```\n\n### Advanced Techniques\n\n1. **Using Temp Tables**: You can create a temporary table, excluding rows with NULLs, and then work with this cleaner dataset.\n\nExample:\n\n```sql\nCREATE TEMPORARY TABLE clean_students AS\nSELECT *\nFROM students\nWHERE age IS NOT NULL;\n\n-- Perform further tasks using \"clean_students\" table\n```\n\n2. **Machine Learning Methods**: Advanced SQL engines supporting ML functionalities might offer methods like imputation based on models.\n\n3. **Dynamic Imputation**: For scenarios involving complex rules or sequences, you might consider using stored procedures to dynamically impute missing values.\n\u003cbr\u003e\n\n## 10. Write a SQL query that _joins_ two tables and retrieves only the rows with matching keys.\n\n### Problem Statement\n\nThe task is to perform a **SQL join** operation between two tables and retrieve the rows where the keys match.\n\n### Solution\n\nTo accomplish this task, use the following SQL query.\n\n#### MySQL\n\n```sql\nSELECT * \nFROM table1\nINNER JOIN table2 ON table1.key = table2.key;\n```\n\n#### PostgreSQL\n\n```sql\nSELECT * \nFROM table1\nINNER JOIN table2 USING (key);\n```\n\n#### Oracle\n\n```sql\nSELECT *\nFROM table1\nJOIN table2 ON table1.key = table2.key;\n```\n\n#### SQL Server\n\n```sql\nSELECT *\nFROM table1\nJOIN table2 ON table1.key = table2.key;\n```\n\n### Key Points\n\n- **`INNER JOIN`**: Retrieves the matching rows from both tables based on the specified condition.\n- **`ON`, `USING`**: Specifies the column(s) used for joining.\n- **`SELECT`**: You can specify individual columns instead of `*` based on requirement.\n- **Table Aliases**: When dealing with long table names, aliases (e.g., `t1`, `t2`) provide a more concise syntax.\n\u003cbr\u003e\n\n## 11. How would you _merge_ multiple result sets in SQL without duplicates?\n\nWhen you need to **combine** the result sets of multiple SELECT queries without **duplicates**, use the **UNION** set operator. If you want to include duplicates, you can use **UNION ALL**. \n\nHere is a visual representation of how these set operations work:\n\n![Union vs Union All](https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/sql%2Funion-and-union-all-in-sql%20(1).jpg?alt=media\u0026token=661a27b1-acab-49f0-8456-274315349d98)\n\n### Code Example: Using UNION\n\nHere is some  SQL code:\n\n```sql\nSELECT employee_id \nFROM full_time_employees \n\nUNION \n\nSELECT intern_id \nFROM interns;\n```\n\nThis code retrieves a combined list of unique employee IDs from both `full_time_employees` and `interns` tables.\n\u003cbr\u003e\n\n## 12. Create a SQL query to _pivot_ a table transforming rows into columns.\n\n### Problem Statement\n\n\"Pivoting\" a table in SQL is the process of **reorganizing** and **transforming** row data into columnar data, commonly used for reporting or data analysis.\n\n### Solution\n\nTwo methods for pivoting data in SQL are:\n\n1. **Static Pivot**: When the distinct values of the pivoted column are known in advance.\n2. **Dynamic Pivot**: When the distinct values are not known in advance and need to be determined at runtime.\n\n#### Key Considerations\n\n- **Pivot Column Values**: Aware vs. Unaware of distinct values.\n- **Performance Impact**: Dynamic pivoting often involves complex operations at runtime.\n- **SQL Compatibility**: Dynamic pivoting can be limited in certain SQL dialects.\n\nHere is an example table named `salesdata`:\n\n| Date      | Product   | Quantity | Amount |\n|-----------|-----------|----------|--------|\n| 1/1/2020  | Apples    | 10       | 50     |\n| 1/1/2020  | Oranges   | 8        | 40     |\n| 2/1/2020  | Apples    | 12       | 60     |\n| 2/1/2020  | Oranges   | 15       | 75     |\n\n#### Static Pivot\n\nThe `PIVOT` keyword is used in SQL Server, and `crosstab()` is used in PostgreSQL.\n\n#### Implementing Static Pivot\n\n**PostgreSQL**:\n\n```sql\nSELECT *\nFROM crosstab(\n  'SELECT date, amount, product FROM salesdata ORDER BY 1,3',\n  'SELECT DISTINCT product FROM salesdata ORDER BY 1'\n) AS ct (\"Date\" date, \"Apples\" int, \"Oranges\" int);\n```\n\n**SQL Server**:\n\n```sql\nSELECT *\nFROM (SELECT Date, Product, Amount\n      FROM salesdata) AS SourceTable\nPIVOT (SUM(Amount) FOR Product IN ([Apples], [Oranges])) AS PivotTable;\n```\n\n#### Dynamic Pivot\n\nFor **SQL Server**, a stored procedure is necessary, as it dynamically constructs the query based on the distinct values.\n\n#### Implementing Dynamic Pivot\n\n**SQL Server**:\n\n- Create a stored procedure:\n\n```sql\nCREATE PROCEDURE dynamicPivot\nAS\nBEGIN\n  DECLARE @cols AS NVARCHAR(MAX), @query AS NVARCHAR(MAX);\n  SELECT @cols = STUFF((SELECT DISTINCT ',' + QUOTENAME(Product) FROM salesdata FOR XML PATH('')), 1, 1, '');\n  SET @query = 'SELECT Date, ' + @cols + ' FROM (SELECT Date, Product, Amount FROM salesdata) AS SourceTable PIVOT (SUM(Amount) FOR Product IN (' + @cols + ' )) AS PivotTable;';\n  EXEC sp_executesql @query;\nEND;\n```\n\n- Execute the stored procedure:\n\n```sql\nEXEC dynamicPivot;\n```\n\u003cbr\u003e\n\n## 13. Explain the importance of _data normalization_ in SQL and how it affects Machine Learning models.\n\n**Data normalization** is a crucial foundational step in preparing datasets for efficient storage and improved analysis. It is related to the **First Normal Form (1NF)** in relational databases and is essential for maintaining data integrity.\n\n### Why is Data Normalization Important?\n\n- **Data Consistency**: It avoids redundancy and the potential for update anomalies. With normalized data, updates are made in a single place, ensuring consistency throughout the database.\n- **Data Integrity**: Foreign key constraints can be applied effectively only when data is normalized.\n- **Query Performance**: Normalized tables are often smaller, leading to better performance.\n\n### Implications for Machine Learning\n\n- **Feature Engineering**: Normalized data ensures that feature scaling is consistent, which is often a prerequisite for machine learning algorithms like $k$-means clustering and algorithms that require gradient descent. If features are not normalized, certain features might have undue importance during model training.\n- **Ease of Integration**: Normalized data is easier to incorporate into machine learning pipelines. Many machine learning libraries assume, and, in some cases, require normalized data.\n- **Reduction of Overfitting**: Normalized data can help with overfitting issues in certain algorithms. If different features span different ranges, the model may give undue importance to the one with the larger scale.\n- **Enhanced Model Interpretability**: Normalized data can give more intuitive interpretations of coefficients, especially in linear models.\n\n### Code Example: Normalizing Data in SQL\n\nHere is the SQL code:\n\n```sql\n-- Create tables in First Normal Form (1NF)\nCREATE TABLE Driver (\n    DriverID int PRIMARY KEY,\n    Name varchar(255), \n    Age int\n);\n\nCREATE TABLE Car (\n    CarID int PRIMARY KEY,\n    Model varchar(255),\n    Make varchar(255),\n    Year int,\n    DriverID int,\n    FOREIGN KEY (DriverID) REFERENCES Driver(DriverID)\n);\n\n-- Normalization to 3NF\nCREATE TABLE Driver (\n    DriverID int PRIMARY KEY,\n    Name varchar(255), \n    Age int\n);\n\nCREATE TABLE Car (\n    CarID int PRIMARY KEY,\n    Model varchar(255),\n    Make varchar(255),\n    Year int,\n    DriverID int,\n    FOREIGN KEY (DriverID) REFERENCES Driver(DriverID)\n);\n```\n\u003cbr\u003e\n\n## 14. How can you extract _time-based features_ from a SQL _datetime_ field for use in a Machine Learning model?\n\nExtracting **time-based features** from a SQL `datetime` field is essential for time series analysis. These features can be used to predict future events, study patterns, and make data-driven decisions.\n\n### Time-Based Features:\n\n1. **Year**: Extract the year using the SQL function `EXTRACT`.\n2. **Month**: Use `EXTRACT` to retrieve the month.\n3. **Day**: Similar to month and year, employ `EXTRACT` for the day.\n4. **Day of Week**: Utilize `EXTRACT` with the `DOW` or `DAYOFWEEK` options.\n5. **Weekend**: A binary feature indicating whether the day falls on a weekend.\n\n#### Example: SQL Queries for Time-Based Features\n\nAssuming a `sales` table with a `transaction_date` column, here are the SQL queries:\n\n```sql\n-- Year\nSELECT EXTRACT(YEAR FROM transaction_date) AS transaction_year FROM sales;\n\n-- Month\nSELECT EXTRACT(MONTH FROM transaction_date) AS transaction_month FROM sales;\n\n-- Day\nSELECT EXTRACT(DAY FROM transaction_date) AS transaction_day FROM sales;\n\n-- Day of Week\nSELECT EXTRACT(DOW FROM transaction_date) AS transaction_dayofweek FROM sales;\n\n-- Weekend\nSELECT CASE WHEN EXTRACT(DOW FROM transaction_date) IN (0, 6) THEN 1 ELSE 0 END AS is_weekend FROM sales;\n```\n\n### Time Period Features:\n\n1. **Time of Day**: Use `EXTRACT` with `HOUR` to split the day into different segments.\n2. **Time of Day (Cyclical)**: Normalize the time into a 24-hour cycle using trigonometric functions like `SIN` and `COS`, which can better capture patterns.\n\n#### Example: Creating a Cyclical Time Feature\n\n```sql\nWITH time AS (\n  SELECT\n    EXTRACT(HOUR FROM transaction_date) AS hour,\n    EXTRACT(MINUTE FROM transaction_date) AS minute\n  FROM sales\n)\nSELECT\n  SIN((hour + minute / 60) * 2 * PI() / 24) AS time_of_day_sin,\n  COS((hour + minute / 60) * 2 * PI() / 24) AS time_of_day_cos\nFROM time;\n```\n\n### Additional Features:\n\n1. **Time Since Last Event**: Use a subquery to calculate the time difference between the current event and the previous one.\n2. **Time Until Next Event**: Employ a similar subquery to determine the time remaining until the subsequent event.\n\n#### Example: Calculating Time Since the Previous Event\n\n```sql\nWITH ranked_sales AS (\n  SELECT\n    transaction_date,\n    ROW_NUMBER() OVER (ORDER BY transaction_date) AS row_num\n  FROM sales\n)\nSELECT\n  transaction_date - LAG(transaction_date) OVER (ORDER BY transaction_date) AS time_since_prev_event\nFROM ranked_sales;\n```\n\nThese time-based and time period features can enhance the predictive power of your machine learning models.\n\u003cbr\u003e\n\n## 15. What are SQL _Window Functions_ and how can they be used for Machine Learning _feature engineering_?\n\n**Window functions** in SQL allow for computations across **specific data windows** rather than the entire dataset. This makes them highly useful for ML feature engineering, providing advanced capabilities for data aggregation and ordering.\n\n### Benefits for Machine Learning \n\nWindow functions are optimized for efficient handling of large datasets. Their scope can be fine-tuned using **PARTITION BY** and ordering operators like **ORDER BY**, making them perfect for time series calculations, customer cohorts, and data denoising.\n\n1. **Calculation of Lag/Lead Values**\n\n   Which are useful in constructing **time-serial features** like deltas and moving averages.\n\n2. **Data Ranking**\n\n   This assists in creating features like **quantiles**, which are common in distributions. \\[1\\.0, -2.0, 1.0, 1.0, 0.5, -1.5, 0.5, ...], for example.\n\n3. **Data Accumulation and Running Sums**\n\n   This is often used in **time series** feature engineering, for example, a rolling sum over the past 7 days or to calculate an **exponential moving average**.\n\n4. **Identification of Data Groups**\n\n   This helps in creating features that are sensitive to **group-level** distinctiveness (e.g., buying habits of certain customers).\n\n5. **Advanced Data Imputation**\n\n   While missing data is a common challenge in datasets, approaches like **forward-filling\" or \"back-filling** can help in this regard.\n\n6. **Smoother Kernel Calculation**\n\n   Functions like **ROW_NUMBER** along with **OVER (ORDER BY...)** operator can compute rolling averages on a **smaller window**, leading to a less noisy distribution, which can be specially beneficial if your goal is to accurately predict a trend amidst other fluctuations.\n\n7. **Efficient Sampling**\n\n   This is useful in balancing datasets for classifications. By partitioning datasets and then using **INTEGER RATIO** or **FRACTIONAL RATIO**, you can ensure the partitioned datasets are uniformly sampled.\n\n\n### PASAD Unit Example\n\nConsider the following query that utilizes a window function, **ROW_NUMBER** along with **PARTITION BY**, to assign section numbers to a set of records in a table ordered by a certain criterion.\n\n```sql\nSELECT \n    id, \n    seq,   -- Sequence within the section\n    section_no,\n    attribute\nFROM \n(\n    SELECT \n        id,\n        attribute,\n        ROW_NUMBER() OVER(PARTITION BY attribute ORDER BY seq) as seq,\n        (ROW_NUMBER() OVER(ORDER BY attribute, seq))::float / \n        (COUNT(*) OVER (PARTITION BY attribute)) AS order_ratio,\n    FROM table1\n)\n```\n\u003cbr\u003e\n\n\n\n#### Explore all 55 answers here 👉 [Devinterview.io - SQL](https://devinterview.io/questions/machine-learning-and-data-science/sql-ml-interview-questions)\n\n\u003cbr\u003e\n\n\u003ca href=\"https://devinterview.io/questions/machine-learning-and-data-science/\"\u003e\n\u003cimg src=\"https://firebasestorage.googleapis.com/v0/b/dev-stack-app.appspot.com/o/github-blog-img%2Fmachine-learning-and-data-science-github-img.jpg?alt=media\u0026token=c511359d-cb91-4157-9465-a8e75a0242fe\" alt=\"machine-learning-and-data-science\" width=\"100%\"\u003e\n\u003c/a\u003e\n\u003c/p\u003e\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevinterview-io%2Fsql-ml-interview-questions","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdevinterview-io%2Fsql-ml-interview-questions","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdevinterview-io%2Fsql-ml-interview-questions/lists"}