awesome-python-data-science
Probably the best curated list of data science software in Python.
https://github.com/krzjoa/awesome-python-data-science
Last synced: 12 days ago
JSON representation
-
Machine Learning
-
General Purpose Machine Learning
- pystruct - Simple structured learning framework for Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- sklearn-expertsys - Highly interpretable classifiers for scikit learn. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- RuleFit - Implementation of the rulefit. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- pyGAM - Generalized Additive Models in Python.
- causalml - Uplift modeling and causal inference with machine learning algorithms. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- SciPy - Fundamental algorithms for scientific computing in Python
- metric-learn - Metric learning algorithms in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
Gradient Boosting
- XGBoost - Scalable, Portable, and Distributed Gradient Boosting. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
- LightGBM - A fast, distributed, high-performance gradient boosting. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
- CatBoost - An open-source gradient boosting on decision trees library. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
- ThunderGBM - Fast GBDTs and Random Forests on GPUs. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
- NGBoost - Natural Gradient Boosting for Probabilistic Prediction.
- TensorFlow Decision Forests - A collection of state-of-the-art algorithms for the training, serving and interpretation of Decision Forest models in Keras. <img height="20" src="img/keras_big.png" alt="keras"> <img height="20" src="img/tf_big2.png" alt="TensorFlow">
-
Imbalanced Datasets
- imbalanced-learn - Module to perform under-sampling and over-sampling with various techniques. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- imbalanced-algorithms - Python-based implementations of algorithms for learning on imbalanced data. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/tf_big2.png" alt="sklearn">
-
Kernel Methods
- pyFM - Factorization machines in python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- fastFM - A library for Factorization Machines. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- tffm - TensorFlow implementation of an arbitrary order Factorization Machine. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/tf_big2.png" alt="sklearn">
- liquidSVM - An implementation of SVMs.
- scikit-rvm - Relevance Vector Machine implementation using the scikit-learn API. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- ThunderSVM - A fast SVM Library on GPUs and CPUs. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
-
Random Forests
- rpforest - A forest of random projection trees. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- sklearn-random-bits-forest - Wrapper of the Random Bits Forest program written by (Wang et al., 2016).<img height="20" src="img/sklearn_big.png" alt="sklearn">
- rgf_python - Python Wrapper of Regularized Greedy Forest. <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
-
Model Explanation
-
Others
- dalex - moDel Agnostic Language for Exploration and explanation. <img height="20" src="img/sklearn_big.png" alt="sklearn"><img height="20" src="img/R_big.png" alt="R inspired/ported lib">
- Shapley - A data-driven framework to quantify the value of classifiers in a machine learning ensemble.
- Alibi - Algorithms for monitoring and explaining machine learning models.
- anchor - Code for "High-Precision Model-Agnostic Explanations" paper.
- aequitas - Bias and Fairness Audit Toolkit.
- Contrastive Explanation - Contrastive Explanation (Foil Trees). <img height="20" src="img/sklearn_big.png" alt="sklearn">
- yellowbrick - Visual analysis and diagnostic tools to facilitate machine learning model selection. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- scikit-plot - An intuitive library to add plotting functionality to scikit-learn objects. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- ELI5 - A library for debugging/inspecting machine learning classifiers and explaining their predictions.
- Lime - Explaining the predictions of any machine learning classifier. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- FairML - FairML is a python toolbox auditing the machine learning models for bias. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- L2X - Code for replicating the experiments in the paper *Learning to Explain: An Information-Theoretic Perspective on Model Interpretation*.
- PDPbox - Partial dependence plot toolbox.
- PyCEbox - Python Individual Conditional Expectation Plot Toolbox.
- model-analysis - Model analysis tools for TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
- themis-ml - A library that implements fairness-aware machine learning algorithms. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- treeinterpreter - Interpreting scikit-learn's decision tree and random forest predictions. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- Auralisation - Auralisation of learned features in CNN (for audio).
- CapsNet-Visualization - A visualization of the CapsNet layers to better understand how it works.
- lucid - A collection of infrastructure and tools for research in neural network interpretability.
- Netron - Visualizer for deep learning and machine learning models (no Python code, but visualizes models from most Python Deep Learning frameworks).
- Skater - Python Library for Model Interpretation.
- AI Explainability 360 - Interpretability and explainability of data and machine learning models.
- tensorboard-pytorch - Tensorboard for PyTorch (and chainer, mxnet, numpy, ...).
- shap - A unified approach to explain the output of any machine learning model. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- InterpretML - InterpretML implements the Explainable Boosting Machine (EBM), a modern, fully interpretable machine learning model based on Generalized Additive Models (GAMs). This open-source package also provides visualization tools for EBMs, other glass-box models, and black-box explanations. <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
-
Natural Language Processing
-
Others
- spaCy - Industrial-Strength Natural Language Processing.
- gensim - Topic Modelling for Humans.
- torchtext - Data loaders and abstractions for text and NLP. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- NLTK - Modules, data sets, and tutorials supporting research and development in Natural Language Processing.
- CLTK - The Classical Language Toolkik.
- pyMorfologik - Python binding for <a href="https://github.com/morfologik/morfologik-stemming">Morfologik</a>.
- skift - Scikit-learn wrappers for Python fastText. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- Phonemizer - Simple text-to-phonemes converter for multiple languages.
- flair - Very simple framework for state-of-the-art NLP.
-
-
Optimization
-
Others
- OR-Tools - An open-source software suite for optimization by Google; provides a unified programming interface to a half dozen solvers: SCIP, GLPK, GLOP, CP-SAT, CPLEX, and Gurobi.
- Optuna - A hyperparameter optimization framework.
- pymoo - Multi-objective Optimization in Python.
- pycma - Python implementation of CMA-ES.
- Spearmint - Bayesian optimization.
- scikit-opt - Heuristic Algorithms for optimization.
- sklearn-genetic-opt - Hyperparameters tuning and feature selection using evolutionary algorithms. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- SMAC3 - Sequential Model-based Algorithm Configuration.
- Optunity - Is a library containing various optimizers for hyperparameter tuning.
- hyperopt - Distributed Asynchronous Hyperparameter Optimization in Python.
- hyperopt-sklearn - Hyper-parameter optimization for sklearn. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- sklearn-deap - Use evolutionary algorithms instead of gridsearch in scikit-learn. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- Bayesian Optimization - A Python implementation of global optimization with gaussian processes.
- SafeOpt - Safe Bayesian Optimization.
- scikit-optimize - Sequential model-based optimization with a `scipy.optimize` interface.
- Solid - A comprehensive gradient-free optimization framework written in Python.
- PySwarms - A research toolkit for particle swarm optimization in Python.
- Platypus - A Free and Open Source Python Library for Multiobjective Optimization.
- GPflowOpt - Bayesian Optimization using GPflow. <img height="20" src="img/tf_big2.png" alt="sklearn">
- Talos - Hyperparameter Optimization for Keras Models.
- nlopt - Library for nonlinear optimization (global and local, constrained or unconstrained).
- BoTorch - Bayesian optimization in PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
-
-
Probabilistic Graphical Models
-
Others
- pyAgrum - A GRaphical Universal Modeler.
- pomegranate - Probabilistic and graphical models for Python. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- pgmpy - A python library for working with Probabilistic Graphical Models.
-
-
Probabilistic Methods
-
Others
- ZhuSuan - Bayesian Deep Learning. <img height="20" src="img/tf_big2.png" alt="sklearn">
- PyMC - Bayesian Stochastic Modelling in Python.
- InferPy - Deep Probabilistic Modelling Made Easy. <img height="20" src="img/tf_big2.png" alt="sklearn">
- PyStan - Bayesian inference using the No-U-Turn sampler (Python interface).
- sklearn-bayes - Python package for Bayesian Machine Learning with scikit-learn API. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- PyVarInf - Bayesian Deep Learning methods with Variational Inference for PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- emcee - The Python ensemble sampling toolkit for affine-invariant MCMC.
- hsmmlearn - A library for hidden semi-Markov models with explicit durations.
- pyhsmm - Bayesian inference in HSMMs and HMMs.
- GPyTorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- sklearn-crfsuite - A scikit-learn-inspired API for CRFsuite. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- pyro - A flexible, scalable deep probabilistic programming library built on PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- skpro - Supervised domain-agnostic prediction framework for probabilistic modelling by [The Alan Turing Institute](https://www.turing.ac.uk/). <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
-
Quantum Computing
-
Synthetic Data
- qiskit - Qiskit is an open-source SDK for working with quantum computers at the level of circuits, algorithms, and application modules.
- cirq - A python framework for creating, editing, and invoking Noisy Intermediate Scale Quantum (NISQ) circuits.
- QML - A Python Toolkit for Quantum Machine Learning.
-
-
Reinforcement Learning
-
Others
- Gymnasium - An API standard for single-agent reinforcement learning environments, with popular reference environments and related utilities (formerly [Gym](https://github.com/openai/gym)).
- PettingZoo - An API standard for multi-agent reinforcement learning environments, with popular reference environments and related utilities.
- MAgent2 - An engine for high performance multi-agent environments with very large numbers of agents, along with a set of reference environments.
- Stable Baselines3 - A set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.
- Shimmy - An API conversion tool for popular external reinforcement learning environments.
- EnvPool - C++-based high-performance parallel environment execution engine (vectorized env) for general RL environments.
- Acme - A library of reinforcement learning components and agents.
- Catalyst-RL - PyTorch framework for RL research. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- d3rlpy - An offline deep reinforcement learning library.
- DI-engine - OpenDILab Decision AI Engine. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- TF-Agents - A library for Reinforcement Learning in TensorFlow. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
- TensorForce - A TensorFlow library for applied reinforcement learning. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
- TRFL - TensorFlow Reinforcement Learning. <img height="20" src="img/tf_big2.png" alt="sklearn">
- Dopamine - A research framework for fast prototyping of reinforcement learning algorithms.
- keras-rl - Deep Reinforcement Learning for Keras. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- garage - A toolkit for reproducible reinforcement learning research.
- rlpyt - Reinforcement Learning in PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- cleanrl - High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features (PPO, DQN, C51, DDPG, TD3, SAC, PPG).
- Machin - A reinforcement library designed for pytorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- SKRL - Modular reinforcement learning library (on PyTorch and JAX) with support for NVIDIA Isaac Gym, Isaac Orbit and Omniverse Isaac Gym. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- Imitation - Clean PyTorch implementations of imitation and reward learning algorithms. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- Tianshou - An elegant PyTorch deep reinforcement learning library. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- Horizon - A platform for Applied Reinforcement Learning.
- RLlib - Scalable Reinforcement Learning.
-
-
Spatial Analysis
-
Statistics
-
NLP
- statsmodels - Statistical modeling and econometrics in Python.
- stockstats - Supply a wrapper ``StockDataFrame`` based on the ``pandas.DataFrame`` with inline stock statistics/indicators support.
- weightedcalcs - A pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.
- scikit-posthocs - Pairwise Multiple Comparisons Post-hoc Tests.
- Alphalens - Performance analysis of predictive (alpha) stock factors.
- Pandas Profiling - Create HTML profiling reports from pandas DataFrame objects. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
-
-
Time Series
-
Others
- dateutil - Powerful extensions to the standard datetime module
- skforecast - Time series forecasting with machine learning models
- darts - A python library for easy manipulation and forecasting of time series.
- statsforecast - Lightning fast forecasting with statistical and econometric models.
- mlforecast - Scalable machine learning-based time series forecasting.
- neuralforecast - Scalable machine learning-based time series forecasting.
- tslearn - Machine learning toolkit dedicated to time-series data. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- tick - Module for statistical learning, with a particular emphasis on time-dependent modeling. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- greykite - A flexible, intuitive, and fast forecasting library next.
- Prophet - Automatic Forecasting Procedure.
- PyFlux - Open source time series library for Python.
- bayesloop - Probabilistic programming framework that facilitates objective model selection for time-varying parameter models.
- luminol - Anomaly Detection and Correlation library.
- maya - makes it very easy to parse a string and for changing timezones
- Chaos Genius - ML powered analytics engine for outlier/anomaly detection and root cause analysis
- sktime - A unified framework for machine learning with time series. <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
-
Uncategorized
-
Uncategorized
- TabGAN - Synthetic tabular data generation using GANs, Diffusion Models, and LLMs. <img height="16" width="16" src="https://github.com/krzjoa/awesome-python-data-science/raw/master/img/sklearn_big.png" alt="sklearn">
-
-
Visualization
-
Automatic Plotting
-
General Purposes
- Matplotlib - Plotting with Python.
- seaborn - Statistical data visualization using matplotlib.
- prettyplotlib - Painlessly create beautiful matplotlib plots.
- python-ternary - Ternary plotting library for Python with matplotlib.
- missingno - Missing data visualization module for Python.
- physt - Improved histograms.
-
Interactive plots
- plotly - A Python library that makes interactive and publication-quality graphs.
- Altair - Declarative statistical visualization library for Python. Can easily do many data transformation within the code to create graph
- animatplot - A python package for animating plots built on matplotlib.
- Bokeh - Interactive Web Plotting for Python.
- bqplot - Plotting library for IPython/Jupyter notebooks
- pyecharts - Migrated from [Echarts](https://github.com/apache/echarts), a charting and visualization library, to Python's interactive visual drawing library.<img height="20" src="img/pyecharts.png" alt="pyecharts"> <img height="20" src="img/echarts.png" alt="echarts">
-
Map
-
NLP
-
-
Web Scraping
-
Synthetic Data
- BeautifulSoup
- Selenium
- Pattern - establish websites such as Google, Twitter, and Wikipedia. Also has NLP, machine learning algorithms, and visualization
- twitterscraper
-
Programming Languages
Categories
Machine Learning
43
Data Manipulation
29
Deep Learning
28
Model Explanation
26
Reinforcement Learning
24
Optimization
22
Visualization
18
Graph Machine Learning
17
Feature Engineering
16
Time Series
16
Probabilistic Methods
13
Computer Vision
12
Computer Audition
9
Computations
9
Natural Language Processing
9
Deployment
8
Distributed Computing
7
Experimentation
7
Learning-to-Rank & Recommender Systems
7
Statistics
6
Genetic Programming
6
Automated Machine Learning
6
Data Validation
6
Evaluation
5
Web Scraping
4
Conversion
4
Graph Manipulation
4
Probabilistic Graphical Models
3
Quantum Computing
3
Spatial Analysis
2
Uncategorized
1
License
1
Sub Categories
Others
179
Synthetic Data
49
General Purpose Machine Learning
21
TensorFlow
16
NLP
15
Data Frames
15
General
11
Pipelines
10
General Purposes
6
Gradient Boosting
6
Interactive plots
6
Kernel Methods
6
Ensemble Methods
5
Feature Selection
5
PyTorch
4
JAX
3
Data-centric AI
3
Automatic Plotting
3
Random Forests
3
Imbalanced Datasets
2
Map
2
Uncategorized
1
Keywords
machine-learning
118
python
106
deep-learning
57
data-science
43
pytorch
30
tensorflow
26
scikit-learn
18
pandas
18
keras
15
reinforcement-learning
13
ml
13
time-series
12
optimization
12
visualization
11
ai
11
data-analysis
10
numpy
10
neural-network
9
statistics
9
artificial-intelligence
9
automl
9
hyperparameter-optimization
9
data-visualization
9
interpretability
8
gpu
8
c-plus-plus
8
neural-networks
7
mlops
7
graph-neural-networks
7
dask
7
computer-vision
7
xgboost
7
forecasting
7
cuda
6
gym
6
tabular-data
6
nlp
6
automated-machine-learning
6
natural-language-processing
6
distributed
6
feature-engineering
6
machine-learning-algorithms
6
jupyter
5
feature-selection
5
machinelearning
5
pandas-dataframe
5
plotting
5
dataframe
5
r
5
classification
5