{"id":19835075,"url":"https://github.com/equinor/ml-pitfalls","last_synced_at":"2025-05-01T17:32:41.610Z","repository":{"id":187438885,"uuid":"676905516","full_name":"equinor/ml-pitfalls","owner":"equinor","description":"Material for a short course on pitfalls in machine learning","archived":false,"fork":false,"pushed_at":"2024-10-21T11:49:18.000Z","size":923,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":0,"default_branch":"main","last_synced_at":"2024-10-21T17:05:05.767Z","etag":null,"topics":["course-materials","data-science","machine-learning","pitfalls","short-course"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"cc-by-4.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/equinor.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-08-10T09:28:06.000Z","updated_at":"2024-10-21T11:49:22.000Z","dependencies_parsed_at":"2024-10-25T17:18:07.626Z","dependency_job_id":"af5c682c-22b0-43fc-9d8a-658c8f82188d","html_url":"https://github.com/equinor/ml-pitfalls","commit_stats":null,"previous_names":["equinor/ml-pitfalls"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equinor%2Fml-pitfalls","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equinor%2Fml-pitfalls/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equinor%2Fml-pitfalls/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/equinor%2Fml-pitfalls/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/equinor","download_url":"https://codeload.github.com/equinor/ml-pitfalls/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":224270290,"owners_count":17283649,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["course-materials","data-science","machine-learning","pitfalls","short-course"],"created_at":"2024-11-12T12:06:30.764Z","updated_at":"2025-05-01T17:32:41.603Z","avatar_url":"https://github.com/equinor.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ml-pitfalls\n\nNotes and notebooks about **Pitfalls in machine learning**, including why they happen, how to recognize them, and how to avoid them.\n\n**Pitfalls** is probably not the best thing to call them, because it makes it sound as if they are scattered about, few and far between, and you'd have to be a bit unlucky to fall into one. But this is not the case.\n\nIt also makes it sound as if you'll know when you fall into one. Everything will go dark and you'll twist your ankle or fall on your backside when you land. But this is not true either.\n\nThe reality is that machine learning pitfalls are everywhere, and quite large. And you will almost certainly fall into them on a regular basis. For sure you have already fallen into them, probably several times. And unfortunately, you won't usually  be able to tell if you're in one or not \u0026mdash; even though everything is completely broken and possibly even on fire.\n\nBasically, I need a better metaphor...\n\n\n## The big ones\n\nThis is important, so let's start with the punchline. They say 'untested code is broken code' and the same applies to machine learning projects:\n\n\u003e Unverified pipelines are broken pipelines.\n\nIf you have not thoroughly and critically reviewed and documented your machine learning pipeline, with the eyeballs and experience of others, then your pipeline is probably hiding one or more of the following pathologies:\n\n- Poor project design\n- Poor data\n- Leakage\n- Modeling mistakes (especially underfitting, overfitting)\n- Improper evaluation\n- Improper application\n- Improper deployment\n- Insufficient engineering\n- Insufficient governance\n\nAll of these pathologies can lead to unreliable, unsafe, and unethical models.\n\nYour project is not suffering from these problems because you are a bad practitioner of machine learning, but because machine learning is hard and impossible to get perfectly right every time.\n\nThe reality is that making scientific and engineering recommendations on the basis of machine learning models is new to most of us, in the same way that scientific experimentation was new to most practitioners in the 17th century. While we learn how to get good at it, we need to help each other stay out of these pitfalls.\n\n\n## Approximate plan\n\n| When | What                                |\n|------|-------------------------------------|\n| 1000 | Welcome and introduction            |\n|      | A series of unfortunate events      |\n|      | Breakpoint                          |\n|      | David Wade: Hard lessons            |\n|      | Finish the examples                 |\n| 1300 | Lunch                               |\n| 1345 | Case studies                        |\n|      | Breakpoint                          |\n|      | Tooling for _Safety by design_      |\n|      | Hackathon: Building smoke detectors |  \n\n\n## The notebooks\n\n- [A simple classification](notebooks/A_simple_classification.ipynb)\n\nMore examples and playgrounds:\n\n- [Balance_classes_with_SMOTE.ipynb](notebooks/Balance_classes_with_SMOTE.ipynb)\n- [Curse_of_dimensionality.ipynb](notebooks/Curse_of_dimensionality.ipynb)\n- [Dealing_with_categorical_features.ipynb](notebooks/Dealing_with_categorical_features.ipynb)\n- [Effect_of_bad_labels.ipynb](notebooks/Effect_of_bad_labels.ipynb)\n- [Encoding_time_features.ipynb](notebooks/Encoding_time_features.ipynb)\n- [Scaling_the_target.ipynb](notebooks/Scaling_the_target.ipynb)\n- [Splitting_autocorrelated_data.ipynb](notebooks/Splitting_autocorrelated_data.ipynb)\n- [Splitting_imbalanced_data.ipynb](notebooks/Splitting_imbalanced_data.ipynb)\n\nAnd one reproduction of a paper:\n\n- [Reproducing_Haklidir_and_Haklidir](notebooks/Reproducing_Haklidir_and_Haklidir.ipynb) (see video, below).\n\n## Case studies\n\n- **Geothermal temperature prediction** \u0026mdash; https://www.youtube.com/watch?v=-Y0fb23FDzI (Haklidir \u0026 Haklidir 2021)\n- **Classifying Chicxulub images** \u0026mdash; https://www.youtube.com/watch?v=v0Yuygp-8RQ (Hall, 2020)\n- **Rock lithology classification** \u0026mdash; https://mcee.ou.edu/aaspi/publications/2019/Pires_de_Lima_et_al_2019-Convolutional_neural_networks_as_an_aid_in_core_lithofacies_classification.pdf (de Lima et al, 2019)\n\n## See also\n\nWe maintain a few other repos containing learning material related to machine learning and artificial intelligence.\n\n- [`promptly`](https://github.com/equinor/promptly) for more on prompting and pitfalls in generative AI.\n- [`llm-engineering-101`](https://github.com/equinor/llm-engineering-101) for a short workshop aimed at getting developers up to speed.\n- [`ai-upskill-events`](https://github.com/equinor/ai-upskill-events) for a repo describing Equinor's company upskill events and materials.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fequinor%2Fml-pitfalls","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fequinor%2Fml-pitfalls","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fequinor%2Fml-pitfalls/lists"}