An open API service indexing awesome lists of open source software.

https://github.com/avc-prog/data_science_and_analysis_with_python_and_sql

In this repository, I present a collection of projects focused on data analysis and science, featuring real-world datasets and one fictitious dataset for the sake of practice. The projects showcase various data analysis and data science techniques and serve as practical examples, using Excel, Tableau, Power BI, Python, and SQL.
https://github.com/avc-prog/data_science_and_analysis_with_python_and_sql

data-analysis-python data-analytics data-science data-visualization databases

Last synced: 4 months ago
JSON representation

In this repository, I present a collection of projects focused on data analysis and science, featuring real-world datasets and one fictitious dataset for the sake of practice. The projects showcase various data analysis and data science techniques and serve as practical examples, using Excel, Tableau, Power BI, Python, and SQL.

Awesome Lists containing this project

README

          

This repository showcases my work in data analysis and data science. Each project involves working with messy datasets, applying SQL and Python for data cleaning, and building predictive models to generate insights and solve problems.
---

It's important to mention that this is an ever-evolving repository, where the tasks presented may not be fully completed yet. However, work in progress will continue to be added over time.

I believe that the best way to improve is through trial and error, and as such, you may encounter mistakes or less-than-perfect solutions within the code. Rather than hiding them, I’ve intentionally left them in place to hold myself accountable.

Also, there are various approaches and comments highlighting what was done and assumed.

## Skills & Techniques Used:

### **Data Preprocessing & Cleaning**
- Handling missing values (imputation, removal, interpolation)
- Correcting inconsistent string values & data types
- Feature engineering (creating new columns, transforming variables)
- Handling outliers & scaling numerical features
- Working (and creating) with datetime features for time series analysis

### **SQL & Database Management**
- Writing queries (joins, window functions, aggregations, common table expressions, stored procedures, transactions, and string manipulations)

### **Exploratory Data Analysis (EDA)**
- Visualizing distributions, correlations, and trends
- Generating insights through graphs and statistical summaries
- Detecting patterns & anomalies in data

### **Machine Learning Models**
- **Supervised Learning**: Linear Regression, Logistic Regression, Random Forest, Decision Tree, XGBoost, Gradient Boosting, KNN, and Neural Networks
- **Unsupervised Learning**: K-Means Clustering and PCA for dimensionality reduction
- **Time Series Forecasting**: ARIMA, SARIMA, and Exponential Smoothing

---

## Project Overviews

### **Project 1: Pokémon**
- **Objective**: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- **Dataset**: (https://github.com/KeithGalli/pandas) (csv file is also available in the Project 1 folder)
- **Key Tasks**:
- Performed **data cleaning & feature engineering**
- Conducted **exploratory data analysis (EDA)**
- Used **machine learning models** (I used all the models mentioned above and extrapolated which ones work and which don't for acquiring useful information)
- Applied **SQL for data processing and analysis** as an alternative to Python Pandas and Pyspark
- **Structure**: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

### **Project 2: Finance**
- **Objective**: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results. It was purposefully made with a small number of rows to emphasize that a large sample of data is necessary to make the machine learning models work properly, as those can be decieving in certain instances.
- **Dataset**: ChatGPT generated data (available on the project 2 folder as a csv file)
- **Key Tasks**:
- Performed **data cleaning & feature engineering**
- Conducted **exploratory data analysis (EDA)**
- Used **machine learning models** (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied **SQL for data processing and analysis** as an alternative to Python Pandas and Pyspark
- **Structure**: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

### **Project 3: Soccer Analysis**
- **Objective**: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- **Dataset**: (csv files available in the Project 3 folder)
- **Key Tasks**:
- Performed **data cleaning & feature engineering**
- Conducted **exploratory data analysis (EDA)**
- Built and fine-tuned **machine learning models** (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied **SQL for data processing and analysis** as an alternative to Python Pandas and Pyspark
- **Structure**: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

### **Project 4: Car Sales Analysis**
- **Objective**: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- **Dataset**: (https://www.kaggle.com/datasets/safaeahb/car-sales-analysis-dashboard/data?select=car+sales.csv) (csv file is also available in the Project 4 folder)
- **Key Tasks**:
- Performed **data cleaning & feature engineering**
- Conducted **exploratory data analysis (EDA)**
- Built and fine-tuned **machine learning models** (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied **SQL for data processing and analysis** as an alternative to Python Pandas and Pyspark
- **Structure**: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

### **Project 5: Healthcare Insurance Analysis**
- **Objective**: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- **Dataset**: (https://github.com/KeithGalli/Regression-Example) (csv file is also available in the Project 5 folder)
- **Key Tasks**:
- Performed **data cleaning & feature engineering**
- Conducted **exploratory data analysis (EDA)**
- Built and fine-tuned **machine learning models** (I used all the models mentioned above, except the time series ones, and extrapolated which ones work and which don't for acquiring useful information)
- Applied **SQL for data processing and analysis** as an alternative to Python Pandas and Pyspark
- **Structure**: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

### **Project 6: Online Retail Analysis**
- **Objective**: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- **Dataset**: (https://archive.ics.uci.edu/dataset/502/online+retail+ii))
- **Key Tasks**:
- Performed **data cleaning & feature engineering**
- Conducted **exploratory data analysis (EDA)**
- Built and fine-tuned **machine learning models**
- Applied **SQL for data processing and analysis** as an alternative to Python Pandas and Pyspark
- **Structure**: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

### **Project 7: Telecommunications Analysis**
- **Objective**: Cleaning and creating visualizations for the data and using statistics and various data science models to extrapolate results
- **Dataset**: (https://github.com/harshbg/Telecom-Churn-Data-Analysis/blob/master/Telecom%20Churn.csv) (csv file is also available in the Project 7 folder)
- **Key Tasks**:
- Performed **data cleaning & feature engineering**
- Conducted **exploratory data analysis (EDA)**
- Used **machine learning models** (I used all the models mentioned above and extrapolated which ones work and which don't for acquiring useful information)
- Applied **SQL for data processing and analysis** as an alternative to Python Pandas and Pyspark
- **Structure**: It has 3 main headers: Data Analysis (Using Pandas), Data Science (Using Pandas), Transfering the data to MySQL

## Simple Models Folder
This folder contains implementations of commonly used machine learning models (which conrrespond to the headers), including:
- Linear Regression
- Logistic Regression
- Random Forest, Decision Tree
- Gradient Boosting
- Neural Networks (Basic MLP)
- K-Means
- Principal Component Analysis (PCA)
- Time Series Forecasting (ARIMA, SARIMA, Exponential Smoothing)