Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/dmarks84/ind_project_data-science-london-scikit-learn--kaggle

Independent Project - Kaggle Competition -- I worked on the Data Science London data set for the Data Science London + Scikit-learn competition.
https://github.com/dmarks84/ind_project_data-science-london-scikit-learn--kaggle

classification cross-validation data-modeling data-reporting data-visualization dataframes eda grid-search matplotlib numpy pandas python sklearn statistics supervised-ml

Last synced: 11 days ago
JSON representation

Independent Project - Kaggle Competition -- I worked on the Data Science London data set for the Data Science London + Scikit-learn competition.

Host: GitHub
URL: https://github.com/dmarks84/ind_project_data-science-london-scikit-learn--kaggle
Owner: dmarks84
License: bsd-3-clause
Created: 2024-01-26T21:02:46.000Z (11 months ago)
Default Branch: main
Last Pushed: 2024-02-27T22:43:14.000Z (10 months ago)
Last Synced: 2024-11-05T19:22:26.737Z (about 2 months ago)
Topics: classification, cross-validation, data-modeling, data-reporting, data-visualization, dataframes, eda, grid-search, matplotlib, numpy, pandas, python, sklearn, statistics, supervised-ml
Language: Jupyter Notebook
Homepage: https://www.kaggle.com/code/danpmarks/dm84-data-science-london-sklearn
Size: 3.84 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

## Ind_Project_Data-Science-London-Scikit-learn--Kaggle

## Screenshot
![screenshot](https://github.com/dmarks84/Ind_Project_Data-Science-London-Scikit-learn--Kaggle/blob/main/london_screenshot.png?raw=true)

## Summary
I worked on the Data Science London data set for the Data Science London + Scikit-learn competition. The dataset provided 40 features that we used to predict a binary classification. I performed basic exploratory data analysis before creating a combination of pipelines. Each pipeline incorporated one of three scalers-- StandardScalar, MinMaxScaler, or MaxAbsScaler-- as well as Principal Component Analysis and then a classifier model/estimator. I used GridSearchCV to determine the best parameters for each unique model type, and iterated through each to determine the best model on the validation data (a subset of the training data provided).

## Results
### Model Selected
There as some variation as several models produced very similar results. For the submission, the best version was a Multilayer Perceptron Classifier with default parameters, and PCA utilizing 40 components with MinMax Scaler to start. On other runs, a Support Vector Classifier with C equal to 10 and gamma equal to 1 come out on top. That classifier version used the MinMaxScaler and PCA utilized 40 components.

### Scores
The best score on the training set was as high as 1.0 and for most models about 0.8, with a few above 0.9. On the validation set, the best scores were between 0.89-0.90 as well. Not fantastic but it's not bad in either case.

## Skills (Developed & Applied)
Programming, Python, Statistics, Numpy, Pandas, Matplotlib, Scikit-learn, Dataframes, Data Modeling, EDA, Data Visualization, Data Reporting, Classification, Supervised ML, Cross Validation, Grid search