{"id":22732928,"url":"https://github.com/vishrut-b/end-to-end-data-analytics-with-python-and-sql","last_synced_at":"2026-02-07T02:02:35.231Z","repository":{"id":267445420,"uuid":"901245269","full_name":"vishrut-b/End-to-End-Data-Analytics-with-Python-and-SQL","owner":"vishrut-b","description":"This project involves the data cleaning and SQL-based analytics of a retail orders dataset using Python and SQL. It focuses on preprocessing data, followed by detailed analytics to extract insights on sales trends and product performance.","archived":false,"fork":false,"pushed_at":"2024-12-10T11:02:31.000Z","size":275,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-04-23T17:13:59.920Z","etag":null,"topics":["data-analysis","python","retail","sql","sql-server","sqlalchemy"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/vishrut-b.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-12-10T09:59:22.000Z","updated_at":"2024-12-15T15:22:05.000Z","dependencies_parsed_at":"2024-12-10T12:18:55.900Z","dependency_job_id":"af6e8a47-f8b1-453e-af33-13e632856a9c","html_url":"https://github.com/vishrut-b/End-to-End-Data-Analytics-with-Python-and-SQL","commit_stats":null,"previous_names":["vishrut-b/end-to-end-data-analytics-with-python-and-sql"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vishrut-b%2FEnd-to-End-Data-Analytics-with-Python-and-SQL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vishrut-b%2FEnd-to-End-Data-Analytics-with-Python-and-SQL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vishrut-b%2FEnd-to-End-Data-Analytics-with-Python-and-SQL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/vishrut-b%2FEnd-to-End-Data-Analytics-with-Python-and-SQL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/vishrut-b","download_url":"https://codeload.github.com/vishrut-b/End-to-End-Data-Analytics-with-Python-and-SQL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":250477810,"owners_count":21437049,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["data-analysis","python","retail","sql","sql-server","sqlalchemy"],"created_at":"2024-12-10T20:11:55.419Z","updated_at":"2026-02-07T02:02:35.205Z","avatar_url":"https://github.com/vishrut-b.png","language":"Jupyter Notebook","readme":"# Retail Orders Data Cleaning and SQL Data Analytics\n\nThis project involves data cleaning, transformation, and SQL-based analytics using Python (Pandas, SQLAlchemy) and SQL queries. The project uses a retail dataset to perform data wrangling, followed by performing various SQL queries to extract meaningful insights from the data. \n\n## Table of Contents\n- [Project Overview](#project-overview)\n- [Data Cleaning](#data-cleaning)\n- [SQL Data Analytics](#sql-data-analytics)\n- [Technologies Used](#technologies-used)\n- [Setup and Installation](#setup-and-installation)\n- [Usage](#usage)\n- [Contributors](#contributors)\n\n## Project Overview\n\nIn this project, we are working with a dataset of retail orders that includes various product and order information such as product IDs, prices, discount percentages, regions, categories, and sales data. We perform the following steps:\n\n1. **Data Cleaning**: Preprocessing the raw data to standardize column names, handle missing values, and create new calculated columns (e.g., `discount_amount`, `sold_price`, `profit`).\n2. **SQL Data Analytics**: Use SQL queries to analyze the cleaned data, uncover key insights, and generate reports on sales trends, revenue generation, and product performance.\n\n## Data Cleaning\n\nThe data cleaning process involves the following steps:\n\n1. **Download the Dataset**: Using Kaggle API to download the dataset and extract the CSV file.\n2. **Read the CSV File**: Load the data into a Pandas DataFrame.\n3. **Column Name Standardization**: Clean and standardize column names by converting them to lowercase and replacing spaces with underscores.\n4. **Feature Engineering**:\n   - Calculate the `discount_amount` as an actual value rather than a percentage.\n   - Compute the `sold_price` by subtracting the `discount_amount` from the `list_price`.\n   - Calculate the `profit` by subtracting `cost_price` from `sold_price`.\n5. **Data Type Conversion**: Convert the `order_date` column to a proper datetime format.\n6. **Drop Unnecessary Columns**: Drop columns that are no longer needed after feature engineering.\n\n**Code Snippet** (for Data Cleaning):\n\n```python\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport kaggle\nimport zipfile\n\n# Download and extract dataset\nget_ipython().system('kaggle datasets download ankitbansal06/retail-orders -f orders.csv')\nzip_ref = zipfile.ZipFile('orders.csv.zip')\nzip_ref.extractall()\nzip_ref.close()\n\n# Read CSV into DataFrame and clean columns\ndf = pd.read_csv('orders.csv', na_values=['Not Available', 'unknown'])\ndf.columns = df.columns.str.lower().str.replace(' ', '_')\n\n# Feature Engineering\ndf['discount_amount'] = df['list_price'] * df['discount_percent'] * 0.01\ndf['sold_price'] = df['list_price'] - df['discount_amount']\ndf['profit'] = df['sold_price'] - df['cost_price']\n\n# Convert order_date to datetime\ndf['order_date'] = pd.to_datetime(df['order_date'], format=\"%Y-%m-%d\")\n\n# Drop unnecessary columns\n\n\ndf = df.drop(columns=['list_price', 'cost_price', 'discount_percent'])\n\n# Connect to SQL database using SQLAlchemy\nimport sqlalchemy as sal\nimport pymysql\nengine = sal.create_engine('mssql://Pirouette/master?driver=ODBC+DRIVER+17+FOR+SQL+SERVER')\nconn = engine.connect()\n```\n\n# SQL Data Analytics\n\nAfter cleaning and transforming the data, we use SQL queries to perform the following analyses:\n\n- Top 10 Highest Revenue Generating Products: Identify products that generate the highest total sales.\n- Top 5 Highest Selling Products in Each Region: Identify the top-selling products by region.\n- Month-over-Month Sales Growth (2022 vs 2023): Compare sales growth month-over-month between 2022 and 2023.\n- Highest Sales Month by Category: Identify which month had the highest sales for each product category.\n- Subcategory with Highest Growth by Profit: Identify which subcategory had the highest profit growth from 2022 to 2023.\n\n```SQL\n-- Find top 10 highest revenue generating products\nSELECT TOP 10 product_id, SUM(sold_price) AS total_sales\nFROM df_orders\nGROUP BY product_id\nORDER BY total_sales DESC;\n\n-- Find top 5 highest selling products in each region\nWITH cte AS (\n    SELECT region, product_id, SUM(sold_price) AS total_sales\n    FROM df_orders\n    GROUP BY region, product_id\n)\nSELECT * \nFROM (\n    SELECT *, ROW_NUMBER() OVER (PARTITION BY region ORDER BY total_sales DESC) AS rn\n    FROM cte\n) AS ranked\nWHERE rn \u003c= 5;\n\n-- Month-over-month growth comparison for '22 and '23 sales\nWITH cte AS (\n    SELECT \n        YEAR(order_date) AS order_year, \n        MONTH(order_date) AS order_month, \n        SUM(sold_price) AS total_sales\n    FROM df_orders\n    GROUP BY YEAR(order_date), MONTH(order_date)\n)\nSELECT \n    order_month,\n    SUM(CASE WHEN order_year = 2022 THEN total_sales ELSE 0 END) AS sales_2022,\n    SUM(CASE WHEN order_year = 2023 THEN total_sales ELSE 0 END) AS sales_2023\nFROM cte\nGROUP BY order_month\nORDER BY order_month;\n\n-- Highest sales month by category\nWITH cte AS (\n    SELECT category, FORMAT(order_date, 'yyyy-MM') AS year_month, SUM(sold_price) AS total_sales\n    FROM df_orders\n    GROUP BY category, FORMAT(order_date, 'yyyy-MM')\n)\nSELECT category, year_month, total_sales\nFROM (\n    SELECT *, ROW_NUMBER() OVER (PARTITION BY category ORDER BY total_sales DESC) AS rn\n    FROM cte\n) AS ranked\nWHERE rn = 1;\n\n-- Subcategory with highest growth by profit from 2022 to 2023\nWITH cte AS (\n    SELECT sub_category, SUM(profit) AS total_profit, YEAR(order_date) AS order_year\n    FROM df_orders\n    GROUP BY sub_category, YEAR(order_date)\n),\ncte2 AS (\n    SELECT \n        sub_category,\n        SUM(CASE WHEN order_year = 2022 THEN total_profit ELSE 0 END) AS profit_2022,\n        SUM(CASE WHEN order_year = 2023 THEN total_profit ELSE 0 END) AS profit_2023\n    FROM cte\n    GROUP BY sub_category\n)\nSELECT TOP 1 *, \n       ((profit_2023 - profit_2022) * 100.0) / profit_2022 AS growth_percentage\nFROM cte2\nORDER BY growth_percentage DESC;\n```\n# Technologies Used\n- Python: For data cleaning and transformation using libraries like Pandas, Numpy, and Matplotlib.\n- SQL: For querying and performing data analytics using SQL queries on a Microsoft SQL Server.\n- SQLAlchemy: For establishing a connection between Python and the SQL Server database.\n- Kaggle API: For downloading the dataset directly into the project directory.\n\n# Setup and Installation\n- Prerequisites\n- Python 3.x\n- Anaconda or virtual environment (optional)\n- Necessary libraries:\n  - pandas\n  - numpy\n  - matplotlib\n  - kaggle\n  - sqlalchemy\n  - pymysql\n \n  Feel free to reach out to me here: vishrutbezbarua@gmail.com, in cas you need any help!\n","funding_links":[],"categories":[],"sub_categories":[],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvishrut-b%2Fend-to-end-data-analytics-with-python-and-sql","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvishrut-b%2Fend-to-end-data-analytics-with-python-and-sql","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvishrut-b%2Fend-to-end-data-analytics-with-python-and-sql/lists"}