https://github.com/dawidolko/datafusion-app-python
Project as part of the Data Warehousing subject.
https://github.com/dawidolko/datafusion-app-python
academic-project data dataprocessing extraction gui loading project pysimplegui python transformation
Last synced: 4 months ago
JSON representation
Project as part of the Data Warehousing subject.
- Host: GitHub
- URL: https://github.com/dawidolko/datafusion-app-python
- Owner: dawidolko
- License: mit
- Created: 2025-03-12T19:25:08.000Z (over 1 year ago)
- Default Branch: main
- Last Pushed: 2026-01-17T01:07:02.000Z (5 months ago)
- Last Synced: 2026-01-17T13:52:03.806Z (5 months ago)
- Topics: academic-project, data, dataprocessing, extraction, gui, loading, project, pysimplegui, python, transformation
- Language: Python
- Homepage: http://datafusion.dawidolko.pl/
- Size: 14.7 MB
- Stars: 0
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- Funding: .github/FUNDING.yml
- License: LICENSE
Awesome Lists containing this project
README
# DataFusion-App-Python
> ๐ **Powerful Data Analysis and Machine Learning GUI Application** - Build comprehensive data science platforms with Python, PySimpleGUI, and advanced analytics capabilities
## ๐ Description
Welcome to the **DataFusion App** repository! This user-friendly Python GUI application provides a comprehensive environment for real-world data analysis and machine learning. The application processes two distinct datasets: the UCI Adult Income dataset and the UCI Chronic Kidney Disease dataset, offering users powerful tools for data exploration, cleaning, transformation, statistical analysis, and predictive modeling.
Built with PySimpleGUI for an intuitive interface and leveraging industry-standard libraries like Pandas, Scikit-learn, Matplotlib, and Seaborn, this project demonstrates best practices in data science workflows, GUI development, and modular application architecture. Perfect for learning data analysis, machine learning algorithms, and building interactive data science applications.
## ๐ Repository Structure
```
DataFusion-App-Python/
โโโ ๐ database/ # Raw datasets
โ โโโ ๐ adult.csv # UCI Adult Income Dataset
โ โโโ ๐ chronic.csv # UCI Chronic Kidney Disease Dataset
โ โโโ ๐ README.md # Dataset documentation
โโโ ๐ docs/ # Project documentation
โ โโโ ๐ description.docx # Detailed project description
โ โโโ ๐ user-guide.pdf # User manual
โ โโโ ๐ฌ analysis-report.pdf # Analysis results
โโโ ๐ src/ # Application source code
โ โโโ ๐ฏ main.py # GUI entry point and main application
โ โโโ ๐ฆ data_handler.py # Data loading and processing
โ โโโ ๐ visualization.py # Plotting and visualization
โ โโโ ๐ค ml_models.py # Machine learning algorithms
โ โโโ ๐ statistics.py # Statistical analysis functions
โ โโโ ๐งน preprocessing.py # Data cleaning and transformation
โ โโโ ๐ผ๏ธ assets/ # Application assets
โ โ โโโ screen-app.png # Application screenshot
โ โโโ ๐ requirements.txt # Python dependencies
โโโ ๐ LICENSE # MIT License
โโโ ๐ README.md # Project documentation
```
## ๐ Getting Started
### 1. Clone the Repository
```bash
git clone https://github.com/dawidolko/DataFusion-App-Python.git
cd DataFusion-App-Python
```
### 2. Create Virtual Environment
```bash
# Create virtual environment
python -m venv venv
# Activate virtual environment
# On Linux/macOS:
source venv/bin/activate
# On Windows:
venv\Scripts\activate
```
### 3. Install Dependencies
```bash
# Install required packages
pip install -r src/requirements.txt
```
### 4. Start the Application
```bash
# Run the main application
python src/main.py
```
- The GUI application will launch automatically
## โ๏ธ System Requirements
### **Essential Tools:**
- **Python** (version 3.8 or higher)
- **pip** package manager
- **Virtual environment** (venv or virtualenv)
- **Git** for version control
### **Development Environment:**
- **Code Editor** (VS Code, PyCharm, Sublime Text)
- **Python Debugger** for development
- **Jupyter Notebook** (optional, for data exploration)
### **Required Python Libraries:**
- **PySimpleGUI** - GUI framework
- **Pandas** - Data manipulation and analysis
- **NumPy** - Numerical computing
- **Scikit-learn** - Machine learning algorithms
- **Matplotlib** - Data visualization
- **Seaborn** - Statistical data visualization
- **Scipy** - Scientific computing
### **Recommended Tools:**
- **Git** for version control
- **Python Linter** (pylint, flake8)
- **Black** for code formatting
- **pytest** for testing
## โจ Key Features
### **๐ฅ๏ธ Interactive GUI Interface**
- Simple and intuitive PySimpleGUI-based interface
- Perform complex data operations without coding
- User-friendly menu navigation
- Real-time operation feedback
- Progress indicators for long-running tasks
### **๐ Data Extraction and Transformation**
- Load multiple dataset formats (CSV, Excel, JSON)
- Handle missing data with multiple strategies
- Data normalization and standardization
- Encode categorical variables (one-hot, label encoding)
- Feature engineering and creation
- Data type conversion and validation
### **๐ Statistical Analysis**
- Calculate descriptive statistics (mean, median, mode, standard deviation)
- Quartiles and percentiles analysis
- Correlation matrix generation
- Distribution analysis and testing
- Hypothesis testing capabilities
- Outlier detection and handling
### **๐ค Machine Learning Algorithms**
#### **Classification Models:**
- **Decision Trees** - Rule-based classification
- **k-Nearest Neighbors (k-NN)** - Instance-based learning
- **Logistic Regression** - Probabilistic classification
- Model evaluation with accuracy, precision, recall, F1-score
- Confusion matrix visualization
#### **Clustering:**
- **K-Means Clustering** - Unsupervised grouping
- Elbow method for optimal cluster selection
- Cluster visualization and analysis
- Silhouette score evaluation
#### **Association Rules:**
- **Apriori Algorithm** - Pattern discovery
- Frequent itemset mining
- Rule generation with confidence and support
- Market basket analysis
### **๐ Data Visualization**
- **Histograms** - Distribution visualization
- **Scatter Plots** - Relationship exploration
- **Box Plots** - Statistical summary visualization
- **Heatmaps** - Correlation matrices
- **Bar Charts** - Categorical data comparison
- **Line Graphs** - Trend analysis
- Interactive plot customization
- Export visualizations to image files
### **๐ง Modular Architecture**
- Clean separation of concerns
- Easy to maintain and extend
- Independent module testing
- Reusable components
- Well-documented code
### **๐ Educational Focus**
- Ideal for learning data science workflows
- Real-world dataset examples
- Complete analysis pipelines
- Documented best practices
- Step-by-step guided processes
## ๐ ๏ธ Technologies Used
- **Python 3.8+** - Core programming language
- **PySimpleGUI** - GUI framework for desktop applications
- **Pandas** - Data manipulation and analysis library
- **NumPy** - Fundamental package for numerical computing
- **Scikit-learn** - Machine learning library
- **Matplotlib** - Comprehensive plotting library
- **Seaborn** - Statistical data visualization
- **Scipy** - Scientific computing tools
## ๐ Datasets
### **UCI Adult Income Dataset**
Demographic and employment data for income classification tasks:
- **Purpose:** Predict whether income exceeds $50K/year
- **Features:** Age, workclass, education, occupation, hours per week, etc.
- **Target:** Binary classification (>50K, <=50K)
- **Records:** ~48,000 entries
### **UCI Chronic Kidney Disease Dataset**
Medical parameters for diagnosing chronic kidney disease:
- **Purpose:** Binary classification of kidney disease presence
- **Features:** Blood pressure, specific gravity, albumin, blood glucose, etc.
- **Target:** CKD or not CKD
- **Records:** 400 medical cases
Both datasets are included in the `database/` directory with complete documentation.
## ๐ Usage Guide
### **1. Loading Data**
Launch the application and select "Load Dataset" from the menu. Choose between:
- Adult Income Dataset
- Chronic Kidney Disease Dataset
- Custom CSV file
### **2. Data Exploration**
Use the data exploration tools to:
- View dataset summary and statistics
- Check for missing values
- Explore data distributions
- Analyze feature correlations
### **3. Data Preprocessing**
Apply preprocessing operations:
- Handle missing values (drop, fill, interpolate)
- Normalize or standardize features
- Encode categorical variables
- Create new features
### **4. Statistical Analysis**
Generate statistical insights:
- Calculate descriptive statistics
- Create correlation matrices
- Perform distribution tests
- Identify outliers
### **5. Machine Learning**
Train and evaluate models:
- Select algorithm (Classification/Clustering/Association Rules)
- Configure model parameters
- Train on dataset
- Evaluate performance metrics
- Visualize results
### **6. Visualization**
Create insightful visualizations:
- Generate various plot types
- Customize appearance
- Export to image files
- Compare multiple features
## ๐ผ๏ธ Application Screenshot
[
](src/assets/screen-app.png)
## ๐ค Contributing
Contributions are highly welcomed! Here's how you can help:
- ๐ **Report bugs** - Found an issue? Let us know!
- ๐ก **Suggest improvements** - Have ideas for better features?
- ๐ง **Submit pull requests** - Share your enhancements and solutions
- ๐ **Improve documentation** - Help make the project clearer
Feel free to open issues or reach out through GitHub for any questions or suggestions.
## ๐จโ๐ป Author
Created by **[Dawid Olko](https://github.com/dawidolko)** - Part of the data science and machine learning series.
## ๐ License
This project is open source and available under the [MIT License](https://opensource.org/licenses/MIT).
---