Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.
Awesome Lists | Featured Topics | Projects
https://github.com/glemaitre/talks

Last synced: 27 days ago
JSON representation
Host: GitHub
URL: https://github.com/glemaitre/talks
Owner: glemaitre
Created: 2017-06-11T10:03:51.000Z (over 7 years ago)
Default Branch: master
Last Pushed: 2024-12-05T11:22:09.000Z (about 1 month ago)
Last Synced: 2024-12-05T12:25:21.630Z (about 1 month ago)
Language: Jupyter Notebook
Size: 40.5 MB
Stars: 1
Watchers: 3
Forks: 1
Open Issues: 1
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

        

# TRACES winter school 2024

## Introduction to machine learning in Python

*Tutorial*: This tutorial introduces how to use scikit-learn to craft predictive models

using machine learning. It covers the basics of machine learning: the evaluation,

give insights about linear models, tree-based models, discussed about hyperparameter

tuning and finally goes a bit into confidence intervals prediction.

  Tutorials repository

  Static course

# Sample space podcast 2024

## Imbalanced-learn: regrets and onwards

*Abstract*: Imbalanced-learn is one of the most popular `scikit-learn` projects

out there. It has support for resampling techniques which historically have

always been used for imbalanced classification use-cases. However, now that we

are a few years down the line, it may be time to start rethinking the library.

As it turns out, other techniques may be preferable.

  Videos

# Practical AI podcast 2024

## scikit-learn & data science you own

*Abstract*: We are at GenAI saturation, so let’s talk about `scikit-learn`, a long time

favorite for data scientists building classifiers, time series analyzers, dimensionality

reducers, and more! `Scikit-learn` is deployed across industry and driving a significant

portion of the “AI” that is actually in production. :probabl is a new kind of company

that is stewarding this project along with a variety of other open source projects. Yann

Lechelle and Guillaume Lemaitre share some of the vision behind the company and talk

about the future of `scikit-learn`!

  Podcast

# ENGIE 2024

## scikit-learn: community insights & latest features

*Abstract*: Insights regarding the `scikit-learn` community and some new features

available in the latest versions.

  Slides

# PyData Paris 2024

## An update on the latest scikit-learn features

*Abstract*: In this talk, we provide an update on the latest `scikit-learn`

features that have been implemented in versions 1.4 and 1.5. We will

particularly discuss the following features:

- the metadata routing API allowing to pass metadata around estimators;

- the `TunedThresholdClassifierCV` allowing to tuned operational decision through custom

  metric;

- better support for categorical features and missing values;

- interoperability of array and dataframe.

  Slides

  Videos

  Tutorials repository

# EuroSciPy 2024

## Probabilistic classification and cost-sensitive learning with scikit-learn

*Tutorial*: Data scientists are repeatedly told that it is absolutely critical to align

their model training methodology with a specific business objective. While being a

rather good advice, it usually falls short on details on how to achieve this in

practice.

This hands-on tutorial aims to introduce helpful theoretical concepts and concrete

software tools to help them bridge this gap. This method will be illustrated on a worked

practical use case: optimizing the operations of a fraud detection system for a payment

processing platform.

More specifically, we will introduce the concepts of calibrated probabilistic

classifiers, how to evaluate them and fix common causes of mis-calibration. In a second

part, we will explore how to turn probabilistic classifiers into optimal business

decision makers.

  Slides

  Tutorials repository

# DeepLabCut AI Residency 2024

## scikit-learn: An OSS community-driven development

*Abstract*: This talk provides insights regarding the `scikit-learn` community

and the development of the library.

  Slides

# Sacl-AI 2024

## Introduction to machine learning in Python

*Tutorial*: This tutorial provides an introduction to machine learning in Python,

notably using `scikit-learn`.

  Tutorials repository

# DataTalksClub podcast 2024

## Insights regarding the scikit-learn project

*Abstract*: This podcast discusses some insights regarding the `scikit-learn` project

and `imbalanced-learn` project.

  Videos

# PyCon Italia 2024

## A Retrieval Augmented Generation system to query the scikit-learn documentation

*Abstract*: Rubber ducks are used for many years to help Pythonistas in their everyday

quest. At scikit-learn, we’ve elevated ducky to another level: come and meet the

scikit-learn Ragger Duck, a RAG system designed to answer all your scikit-learn

questions – at least as effectively as a duck can.

  Slides

  Blog post

  Tutorials repository

# PyConDE & PyData Berlin 2024

## A Retrieval Augmented Generation system to query the scikit-learn documentation

*Abstract*: The scikit-learn website currently employs an "exact" search engine based on

the Sphinx Python package, but it has limitations: it cannot handle spelling mistakes

and queries based on natural language. To address these constraints, we experimented

with using large language models (LLMs) and opted for a retrieval augmented generation

(RAG) system due to resource constraints.

This talk introduces our experimental RAG system for querying scikit-learn

documentation. We focus on an open-source software stack and open-weight models. The

talk presents the different stages of the RAG pipeline. We provide documentation

scraping strategies that we designed based on numpydoc and sphinx-gallery, which are

used to build vector indices for the lexical and semantic searches. We compare our RAG

approach with an LLM-only approach to demonstrate the advantage of providing context.

The source code for this experiment is available on GitHub:

https://github.com/glemaitre/sklearn-ragger-duck.

Finally, we discuss the gains and challenges of integrating such a system into an

open-source project, including hosting and cost considerations, comparing it with

alternative approaches.

  Slides

  Blog post

  Videos

  Tutorials repository

# CDiscount 2024

## Get the best from your scikit-learn classifier

*Abstract*: When operating a classifier in a production setting (i.e. predictive phase),

practitioners are interested in potentially two different outputs: a "hard" decision

used to leverage a business decision or/and a "soft" decision to get a confidence score

linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions:

it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function)

to get class labels. However, optimizing a classifier to get a confidence score close to

the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain

accurate "hard" predictions using this heuristic. Reversely, training a classifier for

an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not

guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the

best of the two worlds: a calibrated classifier providing optimum "hard" predictions.

This meta-estimator will land in a future version of scikit-learn:

https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and

predictions and also illustrate how to use in practice this model on different use

cases: cost-sensitive problems and imbalanced classification problems.

  Slides

# PyData Paris Meetup 2024

## Get the best from your scikit-learn classifier

*Abstract*: When operating a classifier in a production setting (i.e. predictive phase),

practitioners are interested in potentially two different outputs: a "hard" decision

used to leverage a business decision or/and a "soft" decision to get a confidence score

linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions:

it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function)

to get class labels. However, optimizing a classifier to get a confidence score close to

the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain

accurate "hard" predictions using this heuristic. Reversely, training a classifier for

an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not

guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the

best of the two worlds: a calibrated classifier providing optimum "hard" predictions.

This meta-estimator will land in a future version of scikit-learn:

https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and

predictions and also illustrate how to use in practice this model on different use

cases: cost-sensitive problems and imbalanced classification problems.

  Slides

# PyData Global 2023

## Get the best from your scikit-learn classifier

*Abstract*: When operating a classifier in a production setting (i.e. predictive phase),

practitioners are interested in potentially two different outputs: a "hard" decision

used to leverage a business decision or/and a "soft" decision to get a confidence score

linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions:

it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function)

to get class labels. However, optimizing a classifier to get a confidence score close to

the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain

accurate "hard" predictions using this heuristic. Reversely, training a classifier for

an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not

guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the

best of the two worlds: a calibrated classifier providing optimum "hard" predictions.

This meta-estimator will land in a future version of scikit-learn:

https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and

predictions and also illustrate how to use in practice this model on different use

cases: cost-sensitive problems and imbalanced classification problems.

  Slides

  Videos

# EuroSciPy 2023

## Get the best from your scikit-learn classifier

*Abstract*: When operating a classifier in a production setting (i.e. predictive phase),

practitioners are interested in potentially two different outputs: a "hard" decision

used to leverage a business decision or/and a "soft" decision to get a confidence score

linked to each potential decision (e.g. usually related to class probabilities).

Scikit-learn does not provide any flexibility to go from "soft" to "hard" predictions:

it uses a cut-off point at a confidence score of 0.5 (or 0 when using decision_function)

to get class labels. However, optimizing a classifier to get a confidence score close to

the true probabilities (i.e. a calibrated classifier) does not guarantee to obtain

accurate "hard" predictions using this heuristic. Reversely, training a classifier for

an optimum "hard" prediction accuracy (with the cut-off constraint at 0.5) does not

guarantee obtaining a calibrated classifier.

In this talk, we will present a new scikit-learn meta-estimator allowing us to get the

best of the two worlds: a calibrated classifier providing optimum "hard" predictions.

This meta-estimator will land in a future version of scikit-learn:

https://github.com/scikit-learn/scikit-learn/pull/26120.

We will provide some insights regarding the way to obtain accurate probabilities and

predictions and also illustrate how to use in practice this model on different use

cases: cost-sensitive problems and imbalanced classification problems.

  Slides

  Videos

# PyConDE & PyData Berlin 2022

## Inspect an try to interpret your `scikit-learn` machine-learning models

*Abstract*: This tutorial is subdivided into three parts. First, we focus on

the family of linear models and present the common pitfalls to be aware of when

interpreting the coefficients of such models. Then, we look at a larger range

of models (e.g. gradient-boosting) and put into practice available inspection

techniques developed in `scikit-learn` to inspect such models. Finally, we

present other tools to interpret models (i.e. `shap`), not currently available

in `scikit-learn`, but widely used in practice.

  Slides

  Videos

  Tutorials repository

# PyLadies Paris 2022

## Inspecting your predictive model in Python

*Abstract*: This presentation intends to present the available tools allowing

to inspect your predictive model in Python. We will first quickly present

what we mean by predictive model and what it implies when one wants to explain

the decision of such a model. We will provide a quick taxonomy of the current

methods intending to explain predictive model. Finally, we will give an

overview of the available tools in `scikit-learn` and `shap`.

  Slides

# Euler Hermes 2019

## Learning from imbalanced datasets: state of the art

*Abstract*: This presentation gives an overview of the state of the art of

predictive modelling with imbalanced datasets.

  Slides

# Euroscipy 2019

## Rapid Analytics & Model Prototyping (RAMP)

*Abstract*: We will give an overview of the RAMP framework, which provides a

platform to organize reproducible and transparent data challenges. RAMP

workflow is a python package used to define and formalize the data science

problem to be solved. It can be used as a standalone package and allows a user

to prototype different solutions. In addition to RAMP workflow, a set of

packages have been developed allowing to share and collaborate around the

developer solutions. Therefore, RAMP database provides a database structure to

store the solutions of different users and the performance of these solutions.

RAMP engine is the package to run the user solutions (possibly on the cloud)

and populate the database. Finally, RAMP frontend is the web frontend where

users can upload their solutions and which shows the leaderboard of the

challenge. The project is open-source and can be deployed on any local server.

The framework has been used at the Paris-Saclay Center for Data Science for

setting up and solving about twenty scientific problems, for organizing

collaborative data challenges, for organizing scientific sub-communities around

these events, and for training novice data scientists.

  Slides

  RAMP board

  RAMP workflow

## Introduction to `scikit-learn`: from model fitting to model interpretation

*Abstract*: Our introduction to scikit-learn will be subdivided into 2 parts.

We will give a general introduction to scikit-learn presenting basic concepts

around cross-validation, pipeline estimator, and hyperparameter search. Then,

we will focus on model interpretation presenting the challenges and the

available tools to understand a trained machine-learning model: partial

independence plot, features importance, LIME, shapley values, etc.

  Slides

  Tutorials repository

# Euroscipy 2018

## Imbalanced-learn: A scikit-learn-contrib to tackle learning from imbalanced data set

*Abstract*: The curse of imbalanced data set refers to data sets in which the

number of samples in one class is less than in others. This issue is often

encountered in real world data sets such as medical imaging applications

(e.g. cancer detection), fraud detection, etc. In such particular condition,

machine learning algorithms learn sub-optimal models which will generally favor

the class having the largest number of samples. In this talk, we review the

different available strategy to learn a statistical model under those specific

condition. Then, we will present `imbalanced-learn` package and the new

features which will be released in the new version 0.4.

  Slides

  Package

# CDS Pitching Day 2017

## RAMP on predicting autism from resting-state functional MRI and anatomical MRI

*Abstract*: This talk will present the ongoing preparation of a RAMP aiming at

distinguishing subjects with Autism Spectrum Disorder (ASD) from typical

control subjects. This analysis will use the Autism Brain Imaging Data Exchange

(ABIDE I & II) database and data from Robert Debre Hospital based on R-fMRI and

anatomical MRI. We will particularly focus on presenting the problematic, the

typical pipeline answering this problem, and the current status of this RAMP.

This work is in collaboration with the Pasteur Institute (Neuroanatomy group of

the Unit of Human Genetics and Cognitive Functions).

  Slides

# Euroscipy 2017

## Leverage knowledge from under-represented classes in machine learning: imbalanced-learn release 0.3.0

*Abstract*: The curse of imbalanced data set refers to data sets in which the

number of samples in one class is less than in others. This issue is often

encountered in real world data sets such as medical imaging applications

(e.g. cancer detection), fraud detection, etc. In such particular condition,

machine learning algorithms learn sub-optimal models which will generally favor

the class having the largest number of samples. In this talks, we present the

new feature which are available in the release 0.3.0.

  Slides

  Package

# PyParis 2017

## Leverage knowledge from under-represented classes in machine learning: an introduction to imbalanced-learn

*Abstract*: The curse of imbalanced data set refers to data sets in which the

number of samples in one class is less than in others. This issue is often

encountered in real world data sets such as medical imaging applications

(e.g. cancer detection), fraud detection, etc. In such particular condition,

machine learning algorithms learn sub-optimal models which will generally favor

the class having the largest number of samples. In this talk, we will present

the imbalanced-learn package which implement some of the state-of-the-art

algorithms, tackling the class imbalance problem.

  Slides

  Package