https://github.com/kaushik-puttaswamy/exploratory-data-analysis-using-python
The primary goal of this exploratory data analysis is to create a reliable and effective Python program. Based on the training data, the program should select the four best-fit ideal functions from a set of 50 functions and then map the test data to these chosen functions while taking a deviation criterion into account.
https://github.com/kaushik-puttaswamy/exploratory-data-analysis-using-python
bokeh expolatory-data-analysis matplotlib numpy pandas python seaborn sqlalchemy
Last synced: 11 months ago
JSON representation
The primary goal of this exploratory data analysis is to create a reliable and effective Python program. Based on the training data, the program should select the four best-fit ideal functions from a set of 50 functions and then map the test data to these chosen functions while taking a deviation criterion into account.
- Host: GitHub
- URL: https://github.com/kaushik-puttaswamy/exploratory-data-analysis-using-python
- Owner: Kaushik-Puttaswamy
- Created: 2023-10-15T21:20:12.000Z (over 2 years ago)
- Default Branch: main
- Last Pushed: 2025-03-18T22:22:40.000Z (11 months ago)
- Last Synced: 2025-03-18T23:22:11.064Z (11 months ago)
- Topics: bokeh, expolatory-data-analysis, matplotlib, numpy, pandas, python, seaborn, sqlalchemy
- Language: Python
- Homepage:
- Size: 8.08 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 📘 Exploratory-Data-Analysis-using-Python
## 🔍 Introduction
This project focuses on analyzing the performance of electronic components using Python. The tasks include data storage and retrieval, exploratory data analysis (EDA), model selection, data collection, data cleansing, data visualization, data mapping, deviation calculation, and unit testing. Three datasets are provided:
• 📂 Train dataset (used to select ideal functions)
• 📂 Ideal dataset (contains 50 ideal functions)
• 📂 Test dataset (used for validation and mapping)
This analysis helps predict equipment failures by comparing real-world component performance against optimal readings.
## 🎯 Problem Definition
The main task is to develop a Python program that:
1. 🏆 Selects the four best-fitting functions from 50 available ideal functions using training data.
2. 🔗 Uses the test data to map x-y pairs to one of the four ideal functions.
3. 📊 Stores the mapping results along with deviation calculations.
## 🎯 Aim and Objectives
The project aims to create a reliable and efficient Python program that:
• 🏆 Selects four ideal functions based on least squares error.
• 🔗 Maps test data to these ideal functions while considering deviation constraints.
• 📈 Evaluates performance using R-squared values and other error metrics.
## ❓ Research Questions
• 📌 How can we obtain the four best-fit ideal functions using the least squares method?
• 📌 What are the best alternative evaluation metrics for selection?
• 📌 Do alternative metrics yield the same ideal function choices as the least squares method?
• 📌 What are the R-squared values for the selected functions with test data?
• 📌 How does deviation change after mapping test data to ideal functions?
## 🏗 Structure of the Study
This research is structured into three main sections:
• 📖 Introduction: Overview, problem definition, objectives, research questions.
• 🛠 Investigation Method: EDA, database storage, function selection, and mapping.
• 📌 Conclusion: Summary of results, future scope, and recommendations.
## 📊 Exploratory Data Analysis (EDA)
EDA techniques were applied to analyze dataset properties and relationships. Various visualizations were used, including box plots, scatter plots, and correlation matrices.
### 📂 Train Dataset Analysis
• 📊 Boxplot of Train Dataset

• 📈 Scatter plot with Regression Line

### 📂 Ideal Dataset Analysis
• 📊 Boxplot of Ideal Dataset

### 📂 Test Dataset Analysis
• 📊 Boxplot After Removing Duplicates

### 🗄 Data Storage and Retrieval
SQLite was used for storing datasets, accessed via SQLAlchemy ORM in Python.
#### 📌 Database Schema
🏷 Training Data Table

🏷 Ideal Functions Table

🏷 Test Data Mapping Table

## 🔎 Finding Ideal Functions
📐 Least Squares Analysis
Bar charts were generated for each function:
• 📊 Least Squares Bar Chart (Y1 Train Data)
.png)
• 📊 Least Squares Bar Chart (Y2 Train Data)
.png)
• 📊 Least Squares Bar Chart (Y3 Train Data)
.png)
• 📊 Least Squares Bar Chart (Y4 Train Data)
.png)
## 📈 Regression Line Comparisons
• Scatter plot of Training vs. Ideal Functions

## 📉 Mean Squared Error Method
• 📊 R-Squared Values for Test Data Mapping

## 🔗 Mapping Test Dataset with Ideal Functions
• 📊 Absolute Maximum Deviation Bar Chart


• 📈 Scatter Plot of Mapped Test Data

## 🏁 Conclusion
The project successfully developed a Python program that:
• ✅ Selected the four best-fitting functions using least squares error.
• 🔗 Mapped test data points to these ideal functions.
• 📊 Evaluated the deviation between actual and ideal data.
## 🚀 Future Scope
• 🔎 Investigate alternative evaluation metrics.
• 📈 Improve accuracy using advanced machine learning techniques.
• 🤖 Automate parameter tuning for better function selection.