{"id":28490726,"url":"https://github.com/dawidolko/datafusion-app-python","last_synced_at":"2026-02-15T13:04:01.384Z","repository":{"id":282101935,"uuid":"947495559","full_name":"dawidolko/DataFusion-App-Python","owner":"dawidolko","description":"Project as part of the Data Warehousing subject.","archived":false,"fork":false,"pushed_at":"2026-01-17T01:07:02.000Z","size":15408,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":1,"default_branch":"main","last_synced_at":"2026-01-17T13:52:03.806Z","etag":null,"topics":["academic-project","data","dataprocessing","extraction","gui","loading","project","pysimplegui","python","transformation"],"latest_commit_sha":null,"homepage":"http://datafusion.dawidolko.pl/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/dawidolko.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":".github/FUNDING.yml","license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null,"notice":null,"maintainers":null,"copyright":null,"agents":null,"dco":null,"cla":null},"funding":{"github":["dawidolko"],"patreon":null,"open_collective":null,"ko_fi":null,"tidelift":null,"community_bridge":null,"liberapay":null,"issuehunt":null,"lfx_crowdfunding":null,"polar":null,"buy_me_a_coffee":null,"thanks_dev":null,"custom":null}},"created_at":"2025-03-12T19:25:08.000Z","updated_at":"2026-01-17T01:07:06.000Z","dependencies_parsed_at":"2025-03-29T22:22:54.162Z","dependency_job_id":"efa4983a-9956-4cb4-8ff9-ae44be9f2833","html_url":"https://github.com/dawidolko/DataFusion-App-Python","commit_stats":null,"previous_names":["dawidolko/datatransmuter-app-python","dawidolko/adultetl-app-python","dawidolko/datafusion-app-python"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/dawidolko/DataFusion-App-Python","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dawidolko%2FDataFusion-App-Python","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dawidolko%2FDataFusion-App-Python/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dawidolko%2FDataFusion-App-Python/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dawidolko%2FDataFusion-App-Python/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/dawidolko","download_url":"https://codeload.github.com/dawidolko/DataFusion-App-Python/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/dawidolko%2FDataFusion-App-Python/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":29478938,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-02-15T11:35:25.641Z","status":"ssl_error","status_checked_at":"2026-02-15T11:34:57.128Z","response_time":118,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["academic-project","data","dataprocessing","extraction","gui","loading","project","pysimplegui","python","transformation"],"created_at":"2025-06-08T07:10:05.142Z","updated_at":"2026-02-15T13:04:01.378Z","avatar_url":"https://github.com/dawidolko.png","language":"Python","funding_links":["https://github.com/sponsors/dawidolko"],"categories":[],"sub_categories":[],"readme":"# DataFusion-App-Python  \n\n\u003e 🚀 **Powerful Data Analysis and Machine Learning GUI Application** - Build comprehensive data science platforms with Python, PySimpleGUI, and advanced analytics capabilities\n\n## 📋 Description\n\nWelcome to the **DataFusion App** repository! This user-friendly Python GUI application provides a comprehensive environment for real-world data analysis and machine learning. The application processes two distinct datasets: the UCI Adult Income dataset and the UCI Chronic Kidney Disease dataset, offering users powerful tools for data exploration, cleaning, transformation, statistical analysis, and predictive modeling.\n\nBuilt with PySimpleGUI for an intuitive interface and leveraging industry-standard libraries like Pandas, Scikit-learn, Matplotlib, and Seaborn, this project demonstrates best practices in data science workflows, GUI development, and modular application architecture. Perfect for learning data analysis, machine learning algorithms, and building interactive data science applications.\n\n## 📁 Repository Structure\n\n```\n\nDataFusion-App-Python/\n├── 📁 database/ # Raw datasets\n│ ├── 📊 adult.csv # UCI Adult Income Dataset\n│ ├── 📊 chronic.csv # UCI Chronic Kidney Disease Dataset\n│ └── 📖 README.md # Dataset documentation\n├── 📁 docs/ # Project documentation\n│ ├── 📝 description.docx # Detailed project description\n│ ├── 📚 user-guide.pdf # User manual\n│ └── 🔬 analysis-report.pdf # Analysis results\n├── 📁 src/ # Application source code\n│ ├── 🎯 main.py # GUI entry point and main application\n│ ├── 📦 data_handler.py # Data loading and processing\n│ ├── 📊 visualization.py # Plotting and visualization\n│ ├── 🤖 ml_models.py # Machine learning algorithms\n│ ├── 📈 statistics.py # Statistical analysis functions\n│ ├── 🧹 preprocessing.py # Data cleaning and transformation\n│ ├── 🖼️ assets/ # Application assets\n│ │ └── screen-app.png # Application screenshot\n│ └── 📋 requirements.txt # Python dependencies\n├── 📄 LICENSE # MIT License\n└── 📖 README.md # Project documentation\n\n```\n\n## 🚀 Getting Started\n\n### 1. Clone the Repository\n\n```bash\ngit clone https://github.com/dawidolko/DataFusion-App-Python.git\ncd DataFusion-App-Python\n```\n\n### 2. Create Virtual Environment\n\n```bash\n# Create virtual environment\npython -m venv venv\n\n# Activate virtual environment\n# On Linux/macOS:\nsource venv/bin/activate\n\n# On Windows:\nvenv\\Scripts\\activate\n```\n\n### 3. Install Dependencies\n\n```bash\n# Install required packages\npip install -r src/requirements.txt\n```\n\n### 4. Start the Application\n\n```bash\n# Run the main application\npython src/main.py\n```\n\n- The GUI application will launch automatically\n\n## ⚙️ System Requirements\n\n### **Essential Tools:**\n\n- **Python** (version 3.8 or higher)\n- **pip** package manager\n- **Virtual environment** (venv or virtualenv)\n- **Git** for version control\n\n### **Development Environment:**\n\n- **Code Editor** (VS Code, PyCharm, Sublime Text)\n- **Python Debugger** for development\n- **Jupyter Notebook** (optional, for data exploration)\n\n### **Required Python Libraries:**\n\n- **PySimpleGUI** - GUI framework\n- **Pandas** - Data manipulation and analysis\n- **NumPy** - Numerical computing\n- **Scikit-learn** - Machine learning algorithms\n- **Matplotlib** - Data visualization\n- **Seaborn** - Statistical data visualization\n- **Scipy** - Scientific computing\n\n### **Recommended Tools:**\n\n- **Git** for version control\n- **Python Linter** (pylint, flake8)\n- **Black** for code formatting\n- **pytest** for testing\n\n## ✨ Key Features\n\n### **🖥️ Interactive GUI Interface**\n\n- Simple and intuitive PySimpleGUI-based interface\n- Perform complex data operations without coding\n- User-friendly menu navigation\n- Real-time operation feedback\n- Progress indicators for long-running tasks\n\n### **📊 Data Extraction and Transformation**\n\n- Load multiple dataset formats (CSV, Excel, JSON)\n- Handle missing data with multiple strategies\n- Data normalization and standardization\n- Encode categorical variables (one-hot, label encoding)\n- Feature engineering and creation\n- Data type conversion and validation\n\n### **📈 Statistical Analysis**\n\n- Calculate descriptive statistics (mean, median, mode, standard deviation)\n- Quartiles and percentiles analysis\n- Correlation matrix generation\n- Distribution analysis and testing\n- Hypothesis testing capabilities\n- Outlier detection and handling\n\n### **🤖 Machine Learning Algorithms**\n\n#### **Classification Models:**\n\n- **Decision Trees** - Rule-based classification\n- **k-Nearest Neighbors (k-NN)** - Instance-based learning\n- **Logistic Regression** - Probabilistic classification\n- Model evaluation with accuracy, precision, recall, F1-score\n- Confusion matrix visualization\n\n#### **Clustering:**\n\n- **K-Means Clustering** - Unsupervised grouping\n- Elbow method for optimal cluster selection\n- Cluster visualization and analysis\n- Silhouette score evaluation\n\n#### **Association Rules:**\n\n- **Apriori Algorithm** - Pattern discovery\n- Frequent itemset mining\n- Rule generation with confidence and support\n- Market basket analysis\n\n### **📊 Data Visualization**\n\n- **Histograms** - Distribution visualization\n- **Scatter Plots** - Relationship exploration\n- **Box Plots** - Statistical summary visualization\n- **Heatmaps** - Correlation matrices\n- **Bar Charts** - Categorical data comparison\n- **Line Graphs** - Trend analysis\n- Interactive plot customization\n- Export visualizations to image files\n\n### **🔧 Modular Architecture**\n\n- Clean separation of concerns\n- Easy to maintain and extend\n- Independent module testing\n- Reusable components\n- Well-documented code\n\n### **📚 Educational Focus**\n\n- Ideal for learning data science workflows\n- Real-world dataset examples\n- Complete analysis pipelines\n- Documented best practices\n- Step-by-step guided processes\n\n## 🛠️ Technologies Used\n\n- **Python 3.8+** - Core programming language\n- **PySimpleGUI** - GUI framework for desktop applications\n- **Pandas** - Data manipulation and analysis library\n- **NumPy** - Fundamental package for numerical computing\n- **Scikit-learn** - Machine learning library\n- **Matplotlib** - Comprehensive plotting library\n- **Seaborn** - Statistical data visualization\n- **Scipy** - Scientific computing tools\n\n## 📚 Datasets\n\n### **UCI Adult Income Dataset**\n\nDemographic and employment data for income classification tasks:\n\n- **Purpose:** Predict whether income exceeds $50K/year\n- **Features:** Age, workclass, education, occupation, hours per week, etc.\n- **Target:** Binary classification (\u003e50K, \u003c=50K)\n- **Records:** ~48,000 entries\n\n### **UCI Chronic Kidney Disease Dataset**\n\nMedical parameters for diagnosing chronic kidney disease:\n\n- **Purpose:** Binary classification of kidney disease presence\n- **Features:** Blood pressure, specific gravity, albumin, blood glucose, etc.\n- **Target:** CKD or not CKD\n- **Records:** 400 medical cases\n\nBoth datasets are included in the `database/` directory with complete documentation.\n\n## 📖 Usage Guide\n\n### **1. Loading Data**\n\nLaunch the application and select \"Load Dataset\" from the menu. Choose between:\n\n- Adult Income Dataset\n- Chronic Kidney Disease Dataset\n- Custom CSV file\n\n### **2. Data Exploration**\n\nUse the data exploration tools to:\n\n- View dataset summary and statistics\n- Check for missing values\n- Explore data distributions\n- Analyze feature correlations\n\n### **3. Data Preprocessing**\n\nApply preprocessing operations:\n\n- Handle missing values (drop, fill, interpolate)\n- Normalize or standardize features\n- Encode categorical variables\n- Create new features\n\n### **4. Statistical Analysis**\n\nGenerate statistical insights:\n\n- Calculate descriptive statistics\n- Create correlation matrices\n- Perform distribution tests\n- Identify outliers\n\n### **5. Machine Learning**\n\nTrain and evaluate models:\n\n- Select algorithm (Classification/Clustering/Association Rules)\n- Configure model parameters\n- Train on dataset\n- Evaluate performance metrics\n- Visualize results\n\n### **6. Visualization**\n\nCreate insightful visualizations:\n\n- Generate various plot types\n- Customize appearance\n- Export to image files\n- Compare multiple features\n\n## 🖼️ Application Screenshot\n\n[\u003cimg src=\"src/assets/screen-app.png\" width=\"80%\" alt=\"DataFusion App Interface\"/\u003e](src/assets/screen-app.png)\n\n## 🤝 Contributing\n\nContributions are highly welcomed! Here's how you can help:\n\n- 🐛 **Report bugs** - Found an issue? Let us know!\n- 💡 **Suggest improvements** - Have ideas for better features?\n- 🔧 **Submit pull requests** - Share your enhancements and solutions\n- 📖 **Improve documentation** - Help make the project clearer\n\nFeel free to open issues or reach out through GitHub for any questions or suggestions.\n\n## 👨‍💻 Author\n\nCreated by **[Dawid Olko](https://github.com/dawidolko)** - Part of the data science and machine learning series.\n\n## 📄 License\n\nThis project is open source and available under the [MIT License](https://opensource.org/licenses/MIT).\n\n---\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdawidolko%2Fdatafusion-app-python","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdawidolko%2Fdatafusion-app-python","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdawidolko%2Fdatafusion-app-python/lists"}