Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/justmarkham/scikit-learn-tips
:robot::zap: 50 scikit-learn tips
https://github.com/justmarkham/scikit-learn-tips
data-school data-science machine-learning python scikit-learn
Last synced: 4 days ago
JSON representation
:robot::zap: 50 scikit-learn tips
- Host: GitHub
- URL: https://github.com/justmarkham/scikit-learn-tips
- Owner: justmarkham
- Created: 2020-03-26T13:36:57.000Z (almost 5 years ago)
- Default Branch: master
- Last Pushed: 2022-09-05T14:51:34.000Z (over 2 years ago)
- Last Synced: 2025-01-11T16:05:56.423Z (11 days ago)
- Topics: data-school, data-science, machine-learning, python, scikit-learn
- Language: Jupyter Notebook
- Homepage: https://scikit-learn.tips
- Size: 282 KB
- Stars: 1,724
- Watchers: 118
- Forks: 434
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
README
# 🤖⚡ scikit-learn tips
New tips are posted on [LinkedIn](https://www.linkedin.com/in/justmarkham/), [Twitter](https://twitter.com/justmarkham), and [Facebook](https://www.facebook.com/DataScienceSchool/).
👉 [Sign up to receive 2 video tips by email every week!](https://scikit-learn.tips) 👈
## List of all tips
Click to discuss the tip on **LinkedIn**, click to view the **Jupyter notebook** for a tip, or click to watch the tip video on **YouTube:**
\# | Description | Links
--- | --- | ---
1 | Use `ColumnTransformer` to apply different preprocessing to different columns |
2 | Seven ways to select columns using `ColumnTransformer` |
3 | What is the difference between "fit" and "transform"? |
4 | Use "fit_transform" on training data, but "transform" (only) on testing/new data |
5 | Four reasons to use scikit-learn (not pandas) for ML preprocessing |
6 | Encode categorical features using `OneHotEncoder` or `OrdinalEncoder` |
7 | Handle unknown categories with `OneHotEncoder` by encoding them as zeros |
8 | Use `Pipeline` to chain together multiple steps |
9 | Add a missing indicator to encode "missingness" as a feature |
10 | Set a "random_state" to make your code reproducible |
11 | Impute missing values using `KNNImputer` or `IterativeImputer` |
12 | What is the difference between `Pipeline` and `make_pipeline`? |
13 | Examine the intermediate steps in a `Pipeline` |
14 | `HistGradientBoostingClassifier` natively supports missing values |
15 | Three reasons not to use drop='first' with `OneHotEncoder` |
16 | Use `cross_val_score` and `GridSearchCV` on a `Pipeline` |
17 | Try `RandomizedSearchCV` if `GridSearchCV` is taking too long |
18 | Display `GridSearchCV` or `RandomizedSearchCV` results in a DataFrame |
19 | Important tuning parameters for `LogisticRegression` |
20 | Plot a confusion matrix |
21 | Compare multiple ROC curves in a single plot |
22 | Use the correct methods for each type of `Pipeline` |
23 | Display the intercept and coefficients for a linear model |
24 | Visualize a decision tree two different ways |
25 | Prune a decision tree to avoid overfitting |
26 | Use stratified sampling with `train_test_split` |
27 | Two ways to impute missing values for a categorical feature |
28 | Save a model or `Pipeline` using joblib |
29 | Vectorize two text columns in a `ColumnTransformer` |
30 | Four ways to examine the steps of a `Pipeline` |
31 | Shuffle your dataset when using `cross_val_score` |
32 | Use AUC to evaluate multiclass problems |
33 | Use `FunctionTransformer` to convert functions into transformers |
34 | Add feature selection to a `Pipeline` |
35 | Don't use `.values` when passing a pandas object to scikit-learn |
36 | Most parameters should be passed as keyword arguments |
37 | Create an interactive diagram of a `Pipeline` in Jupyter |
38 | Get the feature names output by a `ColumnTransformer` |
39 | Load a toy dataset into a DataFrame |
40 | Estimators only print parameters that have been changed |
41 | Drop the first category from binary features (only) with `OneHotEncoder` |
42 | Passthrough some columns and drop others in a `ColumnTransformer` |
43 | Use `OrdinalEncoder` instead of `OneHotEncoder` with tree-based models |
44 | Speed up `GridSearchCV` using parallel processing |
45 | Create feature interactions using `PolynomialFeatures` |
46 | Ensemble multiple models using `VotingClassifer` or `VotingRegressor` |
47 | Tune the parameters of a `VotingClassifer` or `VotingRegressor` |
48 | Access part of a `Pipeline` using slicing |
49 | Tune multiple models simultaneously with `GridSearchCV` |
50 | Adapt this pattern to solve many Machine Learning problems |You can interact with all of these notebooks online using **Binder:**
**Note:** Some of the tips do not include any code, and can only be viewed on LinkedIn.
## Who creates these tips?
Hi! I'm Kevin Markham, the founder of [Data School](https://www.dataschool.io). I've been teaching data science in Python since 2014. I create these tips because I love using scikit-learn and I want to help others use it more effectively.
## How can I get better at scikit-learn?
I teach three courses:
- **Course 1:** [Introduction to Machine Learning in Python with scikit-learn](https://courses.dataschool.io/introduction-to-machine-learning-with-scikit-learn) (4 hours, free)
- **Course 2:** [Building an Effective Machine Learning Workflow with scikit-learn](https://courses.dataschool.io/building-an-effective-machine-learning-workflow-with-scikit-learn) (8 hours, paid)
- **Course 3:** [Machine Learning with Text in Python](https://www.dataschool.io/learn/) (14 hours, paid)👉 [Find out which course is right for you!](https://www.dataschool.io/ml-courses/) 👈
## Do you have any other tips?
Yes! In 2019, I posted [100 pandas tricks](https://www.dataschool.io/python-pandas-tips-and-tricks/). I also created a video featuring my [top 25 pandas tricks](https://www.dataschool.io/python-pandas-tricks/).
*© 2020-2021 [Data School](https://www.dataschool.io). All rights reserved.*