https://github.com/sipemu/duckdb_sql_recipes
https://github.com/sipemu/duckdb_sql_recipes
Last synced: about 1 month ago
JSON representation
- Host: GitHub
- URL: https://github.com/sipemu/duckdb_sql_recipes
- Owner: sipemu
- Created: 2024-12-16T09:05:55.000Z (5 months ago)
- Default Branch: main
- Last Pushed: 2024-12-16T16:06:46.000Z (5 months ago)
- Last Synced: 2025-02-14T22:18:07.368Z (3 months ago)
- Size: 13.7 KB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# DuckDB SQL Recipes
A collection of useful SQL recipes and macros for DuckDB, focusing on common data analysis patterns and utilities.
## Overview
This repository contains ready-to-use SQL recipes for DuckDB, making it easier to perform common data analysis tasks. The recipes are organized by domain and use case.
## Table of Contents
- [Overview](#overview)
- [Recipes](#recipes)
- [Time Series Analysis](#time-series-analysis)
- [Data Quality Statistics](#data-quality-statistics-timeseriesdataqualitysql)
- [Data Preparation](#data-preparation-timeseriesdata_preparationsql)
- [Contributing](#contributing)
- [License](#license)## Recipes
### Time Series Analysis
Data Quality Statistics
#### Data Quality Statistics ([`timeseries/timeseries_quality_metrics.sql`](timeseries/timeseries_quality_metrics.sql))
A collection of macros for analyzing and filtering time series data quality. The main macro computes comprehensive statistics, and additional utility macros help identify and handle problematic series.
**Usage:**
```sql
SELECT * FROM compute_timeseries_quality_metrics(timeseries_tbl, {'product_id': product_id, 'store_id': store_id}, date_column, sales_value, 'day');
```**Parameters:**
- **tbl_name**: Name of the table or subquery to analyze (string)
- **hierarchy_cols**: Struct of column names and values that define the time series grouping
- **time_col**: Date/timestamp column for the time series
- **target_col**: The metric column to analyze
- **freq**: Frequency of the time series (string). Valid values are, e.g. 'day', 'week', 'month', 'quarter', 'year' (see [DuckDB Date Functions](https://duckdb.org/docs/sql/functions/datepart.html)).**Output Metrics:**
- **sum**: Total sum of the target variable
- **avg**: Average value
- **std**: Standard deviation
- **unique_values**: Number of distinct values in the target column
- **length**: Number of actual data points
- **start_date**: First date in the series
- **end_date**: Last date in the series
- **n_zeros**: Count of zero values
- **perc_zeros**: Percentage of zero values
- **n_nan**: Count of NaN values
- **perc_nan**: Percentage of NaN values
- **n_null**: Count of NULL values
- **perc_null**: Percentage of NULL values
- **expected_length**: Expected number of data points based on date range
- **n_gaps**: Number of missing data points in the series
- **n_gaps_to_max_date**: Number of days between series end date and the maximum end date across all series
- **n_leading_zeros**: Number of consecutive zeros at the start of the series
- **n_ending_zeros**: Number of consecutive zeros at the end of the series##### Utility Macros
###### Count Short Series
Identifies series with fewer than m values.```sql
SELECT * FROM count_short_series(timeseries_summary_tbl, 30);
```**Parameters:**
- **summary_table**: Summary table of the time series data
- **min_length**: Minimum length of the series**Output:**
- **n_short_series**: Number of series with length < m
- **perc_short_series**: Percentage of series with length < m###### Count Constant Series
Identifies series with only one unique value.```sql
SELECT * FROM count_constant_series(timeseries_summary_tbl);
```
**Parameters:**
- **summary_table**: Summary table of the time series data
**Output:**
- **n_constant_series**: Number of constant series
- **perc_constant_series**: Percentage of constant seriesData Preparation
#### Data Preparation ([`timeseries/data_preparation.sql`](timeseries/data_preparation.sql))
A macro that fills gaps in daily time series data by generating missing timestamps and filling target values with NULL.
**Usage:**
```sql
SELECT * FROM fill_time_gaps(timeseries_tbl, {'product_id': product_id, 'store_id': store_id}, date_column, sales_value);
```**Parameters:**
- **tbl_name**: Name of the table or subquery to process (string)
- **hierarchy_cols**: Struct of column names and values that define the time series grouping
- **time_col**: Date/timestamp column for the time series
- **target_col**: The metric column to fill with NULL for missing dates**Output:**
- Returns the original data with additional rows for missing dates
- Missing values are filled with NULL
- Results are ordered by hierarchy columns and date##### Utility Macros
###### Drop Short Series
Removes series with fewer than m values from the dataset.```sql
SELECT * FROM drop_short_series(timeseries_summary_tbl, timeseries_tbl, 30);
```**Parameters:**
- **summary_table**: Summary table of the time series data
- **min_length**: Minimum length of the series###### Drop Constant Series
Removes constant series from the dataset.```sql
SELECT * FROM drop_constant_series(timeseries_summary_tbl, timeseries_tbl);
```**Parameters:**
- **summary_table**: Summary table of the time series data###### Remove Leading Zeros
Removes sequences of zeros at the start of each time series.```sql
SELECT * FROM remove_leading_zeros(timeseries_tbl, {'product_id': product_id, 'store_id': store_id}, date_column, sales_value);
```**Parameters:**
- **tbl_name**: Name of the table or subquery to process
- **hierarchy_cols**: Struct of column names and values that define the time series grouping
- **time_col**: Date/timestamp column for the time series
- **target_col**: The metric column to check for zeros**Output:**
- Returns the original data with leading zeros removed
- Keeps all data points after the first non-zero value in each series
- Maintains original column structure## Contributing
Feel free to contribute additional SQL recipes by submitting a pull request. Please ensure your recipes are well-documented and include example usage.
## License
This project is open source and available under the MIT License.