An open API service indexing awesome lists of open source software.

https://github.com/chanupadeshan/basic-data-analytics

basic data analysis project using sql
https://github.com/chanupadeshan/basic-data-analytics

data-analysis eda exploratory-data-analysis sql window-function

Last synced: 1 day ago
JSON representation

basic data analysis project using sql

Awesome Lists containing this project

README

          

# basic-data-analytics

This repository contains simple example SQL scripts and CSV datasets to explore a small data warehouse-style project using SQL Server (mssql).

Contents

- `datasets/` - sample CSV files (gold.dim_customers.csv, gold.dim_products.csv, gold.fact_sales.csv)

- `scripts/` - useful SQL scripts:
- `initialize_database.sql` — creates the `DataWarehouseAnalytics` database, `gold` schema, tables, and bulk-loads data from CSV files (this will DROP the database if it exists)
- `database_exploration.sql` — lists tables and columns
- `date_exploration.sql` — date range and age queries
- `dimentsions_exploration.sql` — (typo: "dimentions") explore dimension tables
- `magnitute_analysis.sql` — aggregation and revenue queries
- `measures_exploration.sql` — measures and summary report
- `ranking_analysis.sql` — top/bottom ranking queries

## EDA primer — Dimensions, Measures and common explorations

🧩 **Dimensions vs. Measures**

EDA starts by distinguishing between two types of data fields:

- **Dimensions** – Qualitative or categorical fields that describe data.
Examples: Category, Product, Region, Gender, Customer_ID

- **Measures** – Quantitative or numerical fields that can be aggregated.
Examples: Sales, Quantity, Profit, Age

➡️ Rule of thumb: If a column is numeric and makes sense to sum or average, it’s a measure. Otherwise, it’s a dimension.

📏 **Dimensions Exploration**

Dimension exploration helps identify unique values and distribution across categorical fields.

Common SQL operations:
- `SELECT DISTINCT column_name`

- `COUNT(*)` / `COUNT(DISTINCT column_name)`

- `GROUP BY column_name` with `COUNT()`, `AVG()`, `SUM()`
- Percentiles / frequency distributions (e.g., COUNT() / total)

📆 **Date Exploration**

Dates are crucial for understanding time trends and activity periods.

Common SQL operations:

- MIN(order_date), MAX(order_date) — find dataset span
- DATEDIFF / DATEPART — compute durations or extract year/month/week
- GROUP BY YEAR(order_date), MONTH(order_date) — time series aggregations
- Rolling/window functions for moving averages (e.g., OVER(ORDER BY order_date ROWS BETWEEN ...))

📈 **Measures Exploration**

Measure exploration summarizes numerical columns using aggregations.

Common SQL operations:

- SUM(sales_amount), AVG(price), MIN(), MAX()
- COUNT(DISTINCT order_number) — orders vs items
- Distribution checks (histograms, percentiles) using NTILE or PERCENTILE_CONT

📊 **Magnitude Analysis**

Magnitude analysis connects measures with dimensions to show how metrics vary across categories.

Common SQL operations:

- GROUP BY category -> SUM(sales_amount) to see category revenue
- ORDER BY SUM(sales_amount) DESC to find largest contributors
- JOIN fact -> dimension tables to attribute measures to descriptive fields

🏆 **Ranking Analysis**

Ranking is used to find top or bottom performers in a dataset.

Common SQL operations:

- ROW_NUMBER(), RANK(), DENSE_RANK() with ORDER BY SUM(...) DESC/ASC
- Use a CTE/window function then filter by rank (e.g., WHERE rank <= 10)

## Part 2: Advanced Data Analytics

For more advanced SQL warehouse exploration, see [advanced-data-analytics](https://github.com/chanupadeshan/advanced-data-analytics).