Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/dmarks84/ind_project_data-science-london-scikit-learn--kaggle
Independent Project - Kaggle Competition -- I worked on the Data Science London data set for the Data Science London + Scikit-learn competition.
https://github.com/dmarks84/ind_project_data-science-london-scikit-learn--kaggle
classification cross-validation data-modeling data-reporting data-visualization dataframes eda grid-search matplotlib numpy pandas python sklearn statistics supervised-ml
Last synced: 11 days ago
JSON representation
Independent Project - Kaggle Competition -- I worked on the Data Science London data set for the Data Science London + Scikit-learn competition.
- Host: GitHub
- URL: https://github.com/dmarks84/ind_project_data-science-london-scikit-learn--kaggle
- Owner: dmarks84
- License: bsd-3-clause
- Created: 2024-01-26T21:02:46.000Z (11 months ago)
- Default Branch: main
- Last Pushed: 2024-02-27T22:43:14.000Z (10 months ago)
- Last Synced: 2024-11-05T19:22:26.737Z (about 2 months ago)
- Topics: classification, cross-validation, data-modeling, data-reporting, data-visualization, dataframes, eda, grid-search, matplotlib, numpy, pandas, python, sklearn, statistics, supervised-ml
- Language: Jupyter Notebook
- Homepage: https://www.kaggle.com/code/danpmarks/dm84-data-science-london-sklearn
- Size: 3.84 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
## Ind_Project_Data-Science-London-Scikit-learn--Kaggle
## Screenshot
![screenshot](https://github.com/dmarks84/Ind_Project_Data-Science-London-Scikit-learn--Kaggle/blob/main/london_screenshot.png?raw=true)## Summary
I worked on the Data Science London data set for the Data Science London + Scikit-learn competition. The dataset provided 40 features that we used to predict a binary classification. I performed basic exploratory data analysis before creating a combination of pipelines. Each pipeline incorporated one of three scalers-- StandardScalar, MinMaxScaler, or MaxAbsScaler-- as well as Principal Component Analysis and then a classifier model/estimator. I used GridSearchCV to determine the best parameters for each unique model type, and iterated through each to determine the best model on the validation data (a subset of the training data provided).## Results
### Model Selected
There as some variation as several models produced very similar results. For the submission, the best version was a Multilayer Perceptron Classifier with default parameters, and PCA utilizing 40 components with MinMax Scaler to start. On other runs, a Support Vector Classifier with C equal to 10 and gamma equal to 1 come out on top. That classifier version used the MinMaxScaler and PCA utilized 40 components.### Scores
The best score on the training set was as high as 1.0 and for most models about 0.8, with a few above 0.9. On the validation set, the best scores were between 0.89-0.90 as well. Not fantastic but it's not bad in either case.## Skills (Developed & Applied)
Programming, Python, Statistics, Numpy, Pandas, Matplotlib, Scikit-learn, Dataframes, Data Modeling, EDA, Data Visualization, Data Reporting, Classification, Supervised ML, Cross Validation, Grid search