https://github.com/allenleizhao/model_validation_strategy_comparison
A comparison of cross-validation techniques and classification models using synthetic data. Includes evaluation of Logistic Regression, Random Forest, and SVM with K-Fold and Repeated K-Fold methods.
https://github.com/allenleizhao/model_validation_strategy_comparison
classification cross-validation data-science machine-learning model-evaluation python sklearn
Last synced: about 2 months ago
JSON representation
A comparison of cross-validation techniques and classification models using synthetic data. Includes evaluation of Logistic Regression, Random Forest, and SVM with K-Fold and Repeated K-Fold methods.
- Host: GitHub
- URL: https://github.com/allenleizhao/model_validation_strategy_comparison
- Owner: AllenLeiZhao
- License: mit
- Created: 2025-06-06T00:20:07.000Z (about 1 year ago)
- Default Branch: main
- Last Pushed: 2025-06-06T00:30:13.000Z (about 1 year ago)
- Last Synced: 2025-06-22T10:03:26.944Z (about 1 year ago)
- Topics: classification, cross-validation, data-science, machine-learning, model-evaluation, python, sklearn
- Language: Jupyter Notebook
- Homepage:
- Size: 58.6 KB
- Stars: 1
- Watchers: 0
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# π Repeated K-Fold Evaluation of Classifiers with Synthetic Data
This project investigates the use of **Repeated K-Fold Cross-Validation** on a synthetic classification dataset. By applying multiple machine learning models and systematically increasing the number of repeats, we analyze how performance stability improves with more validation cycles.
---
## π― Project Highlights
- Generated synthetic data using `make_classification` from `sklearn.datasets`
- Compared model performance under standard K-Fold and Repeated K-Fold settings
- Evaluated multiple classifiers: Logistic Regression, Random Forest, and SVM (RBF Kernel)
- Visualized accuracy trends across repeat counts using boxplots
---
## π Visualizations
Key visual output:

---
## β
Techniques Used
- Synthetic dataset creation (`sklearn.datasets.make_classification`)
- Cross-validation strategies: `KFold`, `RepeatedKFold`, and `cross_val_score`
- Classification Models: Logistic Regression, Random Forest, SVM (RBF)
- Feature scaling with `StandardScaler`
- Model evaluation: Accuracy, Standard Error of Mean (SEM)
- Visualization: `matplotlib.pyplot`
---
## π Files
- `/code/` β Python Notebook (`.ipynb`) containing all experiments
- `/assets/` β Plots
- `README.md` β You are here
---
## π Key Findings
- Single 10-fold accuracy for logistic regression: **~86.8%**
- Repeating folds up to 15 times smooths performance fluctuations significantly
- Random Forest and SVM with RBF kernel outperform linear models:
- Random Forest Accuracy: **~92.1%**
- SVM (RBF) Accuracy: **~96.5%**
- SVM performance suggests non-linear relationships in the data
---
## πββοΈ About Me
I'm currently pursuing a Masterβs in Analytics with hands-on experience in machine learning and data visualization. My projects combine technical depth with practical interpretation using tools like **Python**, **R**, **Tableau**, and **Looker Studio**.
---
## π¬ Contact
Feel free to connect via [LinkedIn](https://www.linkedin.com/in/allen-lei-zhao/) or reach out via email: `allen.lei.zhao@gmail.com`.