https://github.com/justmarkham/scikit-learn-tips
:robot::zap: 50 scikit-learn tips
https://github.com/justmarkham/scikit-learn-tips
data-school data-science machine-learning python scikit-learn
Last synced: about 2 months ago
JSON representation
:robot::zap: 50 scikit-learn tips
- Host: GitHub
- URL: https://github.com/justmarkham/scikit-learn-tips
- Owner: justmarkham
- Created: 2020-03-26T13:36:57.000Z (about 5 years ago)
- Default Branch: master
- Last Pushed: 2022-09-05T14:51:34.000Z (almost 3 years ago)
- Last Synced: 2025-04-01T05:37:26.903Z (2 months ago)
- Topics: data-school, data-science, machine-learning, python, scikit-learn
- Language: Jupyter Notebook
- Homepage: https://scikit-learn.tips
- Size: 282 KB
- Stars: 1,729
- Watchers: 117
- Forks: 435
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🤖⚡ scikit-learn tips
New tips are posted on [LinkedIn](https://www.linkedin.com/in/justmarkham/), [Twitter](https://twitter.com/justmarkham), and [Facebook](https://www.facebook.com/DataScienceSchool/).
👉 [Sign up to receive 2 video tips by email every week!](https://scikit-learn.tips) 👈
## List of all tips
Click
to discuss the tip on **LinkedIn**, click
to view the **Jupyter notebook** for a tip, or click
to watch the tip video on **YouTube:**
\# | Description | Links
--- | --- | ---
1 | Use `ColumnTransformer` to apply different preprocessing to different columns |![]()
![]()
![]()
2 | Seven ways to select columns using `ColumnTransformer` |![]()
![]()
![]()
3 | What is the difference between "fit" and "transform"? |![]()
![]()
4 | Use "fit_transform" on training data, but "transform" (only) on testing/new data |![]()
![]()
5 | Four reasons to use scikit-learn (not pandas) for ML preprocessing |![]()
![]()
6 | Encode categorical features using `OneHotEncoder` or `OrdinalEncoder` |![]()
![]()
![]()
7 | Handle unknown categories with `OneHotEncoder` by encoding them as zeros |![]()
![]()
![]()
8 | Use `Pipeline` to chain together multiple steps |![]()
![]()
![]()
9 | Add a missing indicator to encode "missingness" as a feature |![]()
![]()
![]()
10 | Set a "random_state" to make your code reproducible |![]()
![]()
![]()
11 | Impute missing values using `KNNImputer` or `IterativeImputer` |![]()
![]()
![]()
12 | What is the difference between `Pipeline` and `make_pipeline`? |![]()
![]()
![]()
13 | Examine the intermediate steps in a `Pipeline` |![]()
![]()
![]()
14 | `HistGradientBoostingClassifier` natively supports missing values |![]()
![]()
![]()
15 | Three reasons not to use drop='first' with `OneHotEncoder` |![]()
![]()
16 | Use `cross_val_score` and `GridSearchCV` on a `Pipeline` |![]()
![]()
![]()
17 | Try `RandomizedSearchCV` if `GridSearchCV` is taking too long |![]()
![]()
![]()
18 | Display `GridSearchCV` or `RandomizedSearchCV` results in a DataFrame |![]()
![]()
![]()
19 | Important tuning parameters for `LogisticRegression` |![]()
![]()
20 | Plot a confusion matrix |![]()
![]()
![]()
21 | Compare multiple ROC curves in a single plot |![]()
![]()
![]()
22 | Use the correct methods for each type of `Pipeline` |![]()
![]()
23 | Display the intercept and coefficients for a linear model |![]()
![]()
![]()
24 | Visualize a decision tree two different ways |![]()
![]()
![]()
25 | Prune a decision tree to avoid overfitting |![]()
![]()
![]()
26 | Use stratified sampling with `train_test_split` |![]()
![]()
![]()
27 | Two ways to impute missing values for a categorical feature |![]()
![]()
![]()
28 | Save a model or `Pipeline` using joblib |![]()
![]()
![]()
29 | Vectorize two text columns in a `ColumnTransformer` |![]()
![]()
![]()
30 | Four ways to examine the steps of a `Pipeline` |![]()
![]()
![]()
31 | Shuffle your dataset when using `cross_val_score` |![]()
![]()
![]()
32 | Use AUC to evaluate multiclass problems |![]()
![]()
![]()
33 | Use `FunctionTransformer` to convert functions into transformers |![]()
![]()
![]()
34 | Add feature selection to a `Pipeline` |![]()
![]()
![]()
35 | Don't use `.values` when passing a pandas object to scikit-learn |![]()
![]()
![]()
36 | Most parameters should be passed as keyword arguments |![]()
![]()
![]()
37 | Create an interactive diagram of a `Pipeline` in Jupyter |![]()
![]()
![]()
38 | Get the feature names output by a `ColumnTransformer` |![]()
![]()
![]()
39 | Load a toy dataset into a DataFrame |![]()
![]()
![]()
40 | Estimators only print parameters that have been changed |![]()
![]()
![]()
41 | Drop the first category from binary features (only) with `OneHotEncoder` |![]()
![]()
![]()
42 | Passthrough some columns and drop others in a `ColumnTransformer` |![]()
![]()
![]()
43 | Use `OrdinalEncoder` instead of `OneHotEncoder` with tree-based models |![]()
![]()
![]()
44 | Speed up `GridSearchCV` using parallel processing |![]()
![]()
![]()
45 | Create feature interactions using `PolynomialFeatures` |![]()
![]()
![]()
46 | Ensemble multiple models using `VotingClassifer` or `VotingRegressor` |![]()
![]()
![]()
47 | Tune the parameters of a `VotingClassifer` or `VotingRegressor` |![]()
![]()
![]()
48 | Access part of a `Pipeline` using slicing |![]()
![]()
![]()
49 | Tune multiple models simultaneously with `GridSearchCV` |![]()
![]()
![]()
50 | Adapt this pattern to solve many Machine Learning problems |![]()
![]()
You can interact with all of these notebooks online using **Binder:**
**Note:** Some of the tips do not include any code, and can only be viewed on LinkedIn.
## Who creates these tips?
Hi! I'm Kevin Markham, the founder of [Data School](https://www.dataschool.io). I've been teaching data science in Python since 2014. I create these tips because I love using scikit-learn and I want to help others use it more effectively.
## How can I get better at scikit-learn?
I teach three courses:
- **Course 1:** [Introduction to Machine Learning in Python with scikit-learn](https://courses.dataschool.io/introduction-to-machine-learning-with-scikit-learn) (4 hours, free)
- **Course 2:** [Building an Effective Machine Learning Workflow with scikit-learn](https://courses.dataschool.io/building-an-effective-machine-learning-workflow-with-scikit-learn) (8 hours, paid)
- **Course 3:** [Machine Learning with Text in Python](https://www.dataschool.io/learn/) (14 hours, paid)👉 [Find out which course is right for you!](https://www.dataschool.io/ml-courses/) 👈
## Do you have any other tips?
Yes! In 2019, I posted [100 pandas tricks](https://www.dataschool.io/python-pandas-tips-and-tricks/). I also created a video featuring my [top 25 pandas tricks](https://www.dataschool.io/python-pandas-tricks/).
*© 2020-2021 [Data School](https://www.dataschool.io). All rights reserved.*