https://github.com/equinor/ml-pitfalls

Material for a short course on pitfalls in machine learning
https://github.com/equinor/ml-pitfalls

course-materials data-science machine-learning pitfalls short-course

Last synced: about 1 year ago
JSON representation

Material for a short course on pitfalls in machine learning

Host: GitHub
URL: https://github.com/equinor/ml-pitfalls
Owner: equinor
License: cc-by-4.0
Created: 2023-08-10T09:28:06.000Z (almost 3 years ago)
Default Branch: main
Last Pushed: 2024-10-21T11:49:18.000Z (over 1 year ago)
Last Synced: 2024-10-21T17:05:05.767Z (over 1 year ago)
Topics: course-materials, data-science, machine-learning, pitfalls, short-course
Language: Jupyter Notebook
Homepage:
Size: 901 KB
Stars: 2
Watchers: 0
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

# ml-pitfalls

Notes and notebooks about **Pitfalls in machine learning**, including why they happen, how to recognize them, and how to avoid them.

**Pitfalls** is probably not the best thing to call them, because it makes it sound as if they are scattered about, few and far between, and you'd have to be a bit unlucky to fall into one. But this is not the case.

It also makes it sound as if you'll know when you fall into one. Everything will go dark and you'll twist your ankle or fall on your backside when you land. But this is not true either.

The reality is that machine learning pitfalls are everywhere, and quite large. And you will almost certainly fall into them on a regular basis. For sure you have already fallen into them, probably several times. And unfortunately, you won't usually be able to tell if you're in one or not — even though everything is completely broken and possibly even on fire.

Basically, I need a better metaphor...

## The big ones

This is important, so let's start with the punchline. They say 'untested code is broken code' and the same applies to machine learning projects:

> Unverified pipelines are broken pipelines.

If you have not thoroughly and critically reviewed and documented your machine learning pipeline, with the eyeballs and experience of others, then your pipeline is probably hiding one or more of the following pathologies:

- Poor project design
- Poor data
- Leakage
- Modeling mistakes (especially underfitting, overfitting)
- Improper evaluation
- Improper application
- Improper deployment
- Insufficient engineering
- Insufficient governance

All of these pathologies can lead to unreliable, unsafe, and unethical models.

Your project is not suffering from these problems because you are a bad practitioner of machine learning, but because machine learning is hard and impossible to get perfectly right every time.

The reality is that making scientific and engineering recommendations on the basis of machine learning models is new to most of us, in the same way that scientific experimentation was new to most practitioners in the 17th century. While we learn how to get good at it, we need to help each other stay out of these pitfalls.

## Approximate plan

| When | What |
|------|-------------------------------------|
| 1000 | Welcome and introduction |
| | A series of unfortunate events |
| | Breakpoint |
| | David Wade: Hard lessons |
| | Finish the examples |
| 1300 | Lunch |
| 1345 | Case studies |
| | Breakpoint |
| | Tooling for _Safety by design_ |
| | Hackathon: Building smoke detectors |

## The notebooks

- [A simple classification](notebooks/A_simple_classification.ipynb)

More examples and playgrounds:

- [Balance_classes_with_SMOTE.ipynb](notebooks/Balance_classes_with_SMOTE.ipynb)
- [Curse_of_dimensionality.ipynb](notebooks/Curse_of_dimensionality.ipynb)
- [Dealing_with_categorical_features.ipynb](notebooks/Dealing_with_categorical_features.ipynb)
- [Effect_of_bad_labels.ipynb](notebooks/Effect_of_bad_labels.ipynb)
- [Encoding_time_features.ipynb](notebooks/Encoding_time_features.ipynb)
- [Scaling_the_target.ipynb](notebooks/Scaling_the_target.ipynb)
- [Splitting_autocorrelated_data.ipynb](notebooks/Splitting_autocorrelated_data.ipynb)
- [Splitting_imbalanced_data.ipynb](notebooks/Splitting_imbalanced_data.ipynb)

And one reproduction of a paper:

- [Reproducing_Haklidir_and_Haklidir](notebooks/Reproducing_Haklidir_and_Haklidir.ipynb) (see video, below).

## Case studies

- **Geothermal temperature prediction** — https://www.youtube.com/watch?v=-Y0fb23FDzI (Haklidir & Haklidir 2021)
- **Classifying Chicxulub images** — https://www.youtube.com/watch?v=v0Yuygp-8RQ (Hall, 2020)
- **Rock lithology classification** — https://mcee.ou.edu/aaspi/publications/2019/Pires_de_Lima_et_al_2019-Convolutional_neural_networks_as_an_aid_in_core_lithofacies_classification.pdf (de Lima et al, 2019)

## See also

We maintain a few other repos containing learning material related to machine learning and artificial intelligence.

- [`promptly`](https://github.com/equinor/promptly) for more on prompting and pitfalls in generative AI.
- [`llm-engineering-101`](https://github.com/equinor/llm-engineering-101) for a short workshop aimed at getting developers up to speed.
- [`ai-upskill-events`](https://github.com/equinor/ai-upskill-events) for a repo describing Equinor's company upskill events and materials.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/equinor/ml-pitfalls

Awesome Lists containing this project

README