awesome-python-data-science
Probably the best curated list of data science software in Python.
https://github.com/krzjoa/awesome-python-data-science
Last synced: 6 days ago
JSON representation
-
Automated Machine Learning
-
Others
- auto-sklearn - An AutoML toolkit and a drop-in replacement for a scikit-learn estimator. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- AutoKeras - AutoML library for deep learning. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- AutoGluon - AutoML for Image, Text, Tabular, Time-Series, and MultiModal Data.
- MLBox - A powerful Automated Machine Learning python library.
- TPOT - AutoML tool that optimizes machine learning pipelines using genetic programming. <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
-
Computations
-
Synthetic Data
- numpy - The fundamental package needed for scientific computing with Python.
- Dask - Parallel computing with task scheduling. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- CuPy - NumPy-like API accelerated with CUDA.
- scikit-tensor - Python library for multilinear algebra and tensor factorizations.
- numdifftools - Solve automatic numerical differentiation problems in one or more variables.
- quaternion - Add built-in support for quaternions to numpy.
- adaptive - Tools for adaptive and parallel samping of mathematical functions.
- NumExpr - A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.
- numpy - The fundamental package needed for scientific computing with Python.
-
-
Computer Audition
-
Others
- torchaudio - An audio library for PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- librosa - Python library for audio and music analysis.
- Yaafe - Audio features extraction.
- aubio - A library for audio and music analysis.
- Essentia - Library for audio and music analysis, description, and synthesis.
- LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
- Marsyas - Music Analysis, Retrieval, and Synthesis for Audio Signals.
- muda - A library for augmenting annotated audio data.
- madmom - Python audio and music signal processing library.
-
-
Computer Vision
-
Others
- torchvision - Datasets, Transforms, and Models specific to Computer Vision. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- PyTorch3D - PyTorch3D is FAIR's library of reusable components for deep learning with 3D data. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- KerasCV - Industry-strength Computer Vision workflows with Keras. <img height="20" src="img/keras_big.png" alt="MXNet based">
- OpenCV - Open Source Computer Vision Library.
- Decord - An efficient video loader for deep learning with smart shuffling that's super easy to digest.
- MMEngine - OpenMMLab Foundational Library for Training Deep Learning Models. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- scikit-image - Image Processing SciKit (Toolbox for SciPy).
- imgaug - Image augmentation for machine learning experiments.
- Augmentor - Image augmentation library in Python for machine learning.
- LAVIS - A One-stop Library for Language-Vision Intelligence.
- imgaug_extension - Additional augmentations for imgaug.
- albumentations - Fast image augmentation library and easy-to-use wrapper around other libraries.
-
-
Conversion
-
Synthetic Data
- sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
- ONNX - Open Neural Network Exchange.
- MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
- treelite - Universal model exchange and serialization format for decision tree forests.
-
-
Data Manipulation
-
Data-centric AI
-
Data Frames
- pandas - Powerful Python data analysis toolkit.
- polars - A fast multi-threaded, hybrid-out-of-core DataFrame library.
- Arctic - High-performance datastore for time series and tick data.
- datatable - Data.table for Python. <img height="20" src="img/R_big.png" alt="R inspired/ported lib">
- cuDF - GPU DataFrame Library. <img height="20" src="img/pandas_big.png" alt="pandas compatible"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
- blaze - NumPy and pandas interface to Big Data. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- pandasql - Allows you to query pandas DataFrames using SQL syntax. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by [The Alan Turing Institute](https://www.turing.ac.uk/).
- pysparkling - A pure Python implementation of Apache Spark's RDD and DStream interfaces. <img height="20" src="img/spark_big.png" alt="Apache Spark based">
- modin - Speed up your pandas workflows by changing a single line of code. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- swifter - A package that efficiently applies any function to a pandas dataframe or series in the fastest available manner.
- pandas-log - A package that allows providing feedback about basic pandas operations and finds both business logic and performance issues.
- vaex - Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
- xarray - Xarray combines the best features of NumPy and pandas for multidimensional data selection by supplementing numerical axis labels with named dimensions for more intuitive, concise, and less error-prone indexing routines.
- pandas-gbq - pandas Google Big Query. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
-
Pipelines
- SSPipe - Python pipe (|) operator with support for DataFrames and Numpy, and Pytorch.
- pandas-ply - Functional data manipulation for pandas. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- Dplython - Dplyr for Python. <img height="20" src="img/R_big.png" alt="R inspired/ported lib">
- sklearn-pandas - pandas integration with sklearn. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- meza - A Python toolkit for processing tabular data.
- Prodmodel - Build system for data science pipelines.
- dopanda - Hints and tips for using pandas in an analysis environment. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- Hamilton - A microframework for dataframe generation that applies Directed Acyclic Graphs specified by a flow of lazily evaluated Python functions.
- pdpipe - Sasy pipelines for pandas DataFrames.
- Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
-
Synthetic Data
- ydata-synthetic - A package to generate synthetic tabular and time-series data leveraging the state-of-the-art generative models. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
-
-
Data Validation
-
Synthetic Data
- great_expectations - Always know what to expect from your data.
- pandera - A lightweight, flexible, and expressive statistical data testing library.
- deepchecks - Validation & testing of ML models and data during model development, deployment, and production. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- evidently - Evaluate and monitor ML models from validation to production.
- TensorFlow Data Validation - Library for exploring and validating machine learning data.
- DataComPy - A library to compare Pandas, Polars, and Spark data frames. It provides stats and lets users adjust for match accuracy.
-
-
Deep Learning
-
JAX
-
Others
- transformers - State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible"> <img height="20" src="img/tf_big2.png" alt="sklearn">
- Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
- autograd - Efficiently computes derivatives of numpy code.
- Caffe - A fast open framework for deep learning.
- nnabla - Neural Network Libraries by Sony.
-
PyTorch
- PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- ignite - High-level library to help with training neural networks in PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- Catalyst - High-level utils for PyTorch DL & RL research. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- ChemicalX - A PyTorch-based deep learning library for drug pair scoring. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
-
TensorFlow
- Keras - A high-level neural networks API running on top of TensorFlow. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- TensorFlow - Computation using data flow graphs for scalable machine learning by Google. <img height="20" src="img/tf_big2.png" alt="sklearn">
- TFLearn - Deep learning library featuring a higher-level API for TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
- Sonnet - TensorFlow-based neural network library. <img height="20" src="img/tf_big2.png" alt="sklearn">
- tensorpack - A Neural Net Training Interface on TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
- Polyaxon - A platform that helps you build, manage and monitor deep learning models. <img height="20" src="img/tf_big2.png" alt="sklearn">
- tfdeploy - Deploy TensorFlow graphs for fast evaluation and export to TensorFlow-less environments running numpy. <img height="20" src="img/tf_big2.png" alt="sklearn">
- tensorflow-upstream - TensorFlow ROCm port. <img height="20" src="img/tf_big2.png" alt="sklearn"> <img height="20" src="img/amd_big.png" alt="Possible to run on AMD GPU">
- TensorFlow Fold - Deep learning with dynamic computation graphs in TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
- Mesh TensorFlow - Model Parallelism Made Easier. <img height="20" src="img/tf_big2.png" alt="sklearn">
- keras-contrib - Keras community contributions. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- Hyperas - Keras + Hyperopt: A straightforward wrapper for a convenient hyperparameter. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- Elephas - Distributed Deep learning with Keras & Spark. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- qkeras - A quantization deep learning library. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- TensorLight - A high-level framework for TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
- Ludwig - A toolbox that allows one to train and test deep learning models without the need to write code. <img height="20" src="img/tf_big2.png" alt="sklearn">
-
-
Deployment
-
NLP
- fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
- streamlit - Make it easy to deploy the machine learning model
- datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
- binder - Enable sharing and execute Jupyter Notebooks
- gradio - Create UIs for your machine learning model in Python in 3 minutes.
- Vizro - A toolkit for creating modular data visualization applications.
- streamsync - No-code in the front, Python in the back. An open-source framework for creating data apps.
- Deepnote - Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek UI, new blocks, and native data integrations. Use Python, R, and SQL locally in your favorite IDE, then scale to Deepnote cloud for real-time collaboration, Deepnote agent, and deployable data apps.
-
-
Distributed Computing
-
Synthetic Data
- Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. <img height="20" src="img/tf_big2.png" alt="sklearn">
- Veles - Distributed machine learning platform.
- Jubatus - Framework and Library for Distributed Online Machine Learning.
- DMTK - Microsoft Distributed Machine Learning Toolkit.
- PaddlePaddle - PArallel Distributed Deep LEarning.
- dask-ml - Distributed and parallel machine learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- Distributed - Distributed computation in Python.
-
-
Evaluation
-
Synthetic Data
- recmetrics - Library of useful metrics and plots for evaluating recommender systems.
- Metrics - Machine learning evaluation metric.
- sklearn-evaluation - Model evaluation made easy: plots, tables, and markdown reports. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- AI Fairness 360 - Fairness metrics for datasets and ML models, explanations, and algorithms to mitigate bias in datasets and models.
- alibi-detect - Algorithms for outlier, adversarial and drift detection.<img height="20" src="img/alibi-detect.png" alt="sklearn">
-
-
Experimentation
-
Synthetic Data
- Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
- mlflow - Open source platform for the machine learning lifecycle.
- envd - 🏕️ machine learning development environment for data science and AI/ML engineering teams.
- Sacred - A tool to help you configure, organize, log, and reproduce experiments.
- Ax - Adaptive Experimentation Platform. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- dvc - Data Version Control | Git for Data & Models | ML Experiments Management.
- Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
-
-
Feature Engineering
-
Feature Selection
- scikit-feature - Feature selection repository in Python.
- boruta_py - Implementations of the Boruta all-relevant feature selection method. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- BoostARoota - A fast xgboost feature selection algorithm. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- scikit-rebate - A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- zoofs - A feature selection library based on evolutionary algorithms.
-
General
- Featuretools - Automated feature engineering.
- Feature Engine - Feature engineering package with sklearn-like functionality. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- OpenFE - Automated feature generation with expert-level performance.
- Feature Forge - A set of tools for creating and testing machine learning features. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- few - A feature engineering wrapper for sklearn. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- scikit-mdr - A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- tsfresh - Automatic extraction of relevant features from time series. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- dirty_cat - Machine learning on dirty tabular data (especially: string-based variables for classifcation and regression). <img height="20" src="img/sklearn_big.png" alt="sklearn">
- NitroFE - Moving window features. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- sk-transformer - A collection of various pandas & scikit-learn compatible transformers for all kinds of preprocessing and feature engineering steps <img height="20" src="img/pandas_big.png" alt="pandas compatible">
- tubular - Collection of scikit-learn compatible transformers written in [narwhals]( https://github.com/narwhals-dev/narwhals), which can accept either polars/pandas inputs and utilise the chosen library under the hood. <img height="20" src="img/sklearn_big.png" alt="sklearn"><img height="20" src="img/pandas_big.png" alt="pandas compatible">
-
-
Genetic Programming
-
Others
- gplearn - Genetic Programming in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- PyGAD - Genetic Algorithm in Python. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible"> <img height="20" src="img/keras_big.png" alt="keras">
- DEAP - Distributed Evolutionary Algorithms in Python.
- karoo_gp - A Genetic Programming platform for Python with GPU support. <img height="20" src="img/tf_big2.png" alt="sklearn">
- monkeys - A strongly-typed genetic programming framework for Python.
- sklearn-genetic - Genetic feature selection module for scikit-learn. <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
-
Graph Machine Learning
-
Others
- pytorch_geometric_temporal - Temporal Extension Library for PyTorch Geometric. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- PyTorch Geometric Signed Directed - A signed/directed graph neural network extension library for PyTorch Geometric. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- dgl - Python package built to ease deep learning on graph, on top of existing DL frameworks. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible"> <img height="20" src="img/tf_big2.png" alt="TensorFlow"> <img height="20" src="img/mxnet_big.png" alt="MXNet based">
- Spektral - Deep learning on graphs. <img height="20" src="img/keras_big.png" alt="Keras compatible">
- StellarGraph - Machine Learning on Graphs. <img height="20" src="img/tf_big2.png" alt="TensorFlow"> <img height="20" src="img/keras_big.png" alt="Keras compatible">
- Graph Nets - Build Graph Nets in Tensorflow. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
- TensorFlow GNN - A library to build Graph Neural Networks on the TensorFlow platform. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
- Auto Graph Learning - An autoML framework & toolkit for machine learning on graphs.
- PyTorch-BigGraph - Generate embeddings from large-scale graph-structured data. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- Karate Club - An unsupervised machine learning library for graph-structured data.
- Little Ball of Fur - A library for sampling graph structured data.
- GreatX - A graph reliability toolbox based on PyTorch and PyTorch Geometric (PyG). <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- Jraph - A Graph Neural Network Library in Jax.
- pytorch_geometric - Geometric Deep Learning Extension Library for PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- GRAPE - GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations
- TRL - Train transformer language models with reinforcement learning.
- Cleora - The Graph Embedding Engine.
-
-
Graph Manipulation
-
Others
- Networkx - Network Analysis in Python.
- Rustworkx - A high performance Python graph library implemented in Rust.
- graph-tool - an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks).
- igraph - Python interface for igraph.
-
-
Learning-to-Rank & Recommender Systems
-
Others
- LightFM - A Python implementation of LightFM, a hybrid recommendation algorithm.
- Spotlight - Deep recommender models using PyTorch.
- Surprise - A Python scikit for building and analyzing recommender systems.
- RecBole - A unified, comprehensive and efficient recommendation library. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- allRank - allRank is a framework for training learning-to-rank neural models based on PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- TensorFlow Recommenders - A library for building recommender system models using TensorFlow. <img height="20" src="img/tf_big2.png" alt="TensorFlow"> <img height="20" src="img/keras_big.png" alt="Keras compatible">
- TensorFlow Ranking - Learning to Rank in TensorFlow. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
-
-
Machine Learning
-
Ensemble Methods
- ML-Ensemble - High performance ensemble learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- ML-Ensemble - High performance ensemble learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- Stacking - Simple and useful stacking library written in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- stacked_generalization - Library for machine learning stacking generalization. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- vecstack - Python package for stacking (machine learning technique). <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
General Purpose Machine Learning
- scikit-learn - Machine learning in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- scikit-learn - Machine learning in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- PyCaret - An open-source, low-code machine learning library in Python. <img height="20" src="img/R_big.png" alt="R inspired lib">
- Shogun - Machine learning toolbox.
- xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
- cuML - RAPIDS Machine Learning Library. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
- Sparkit-learn - PySpark + scikit-learn = Sparkit-learn. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/spark_big.png" alt="Apache Spark based">
- mlpack - A scalable C++ machine learning library (Python bindings).
- dlib - Toolkit for making real-world machine learning and data analysis applications in C++ (Python bindings).
- MLxtend - Extension and helper modules for Python's data analysis and machine learning libraries. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- hyperlearn - 50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
- Reproducible Experiment Platform (REP) - Machine Learning toolbox for Humans. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- scikit-multilearn - Multi-label classification for python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
- seqlearn - Sequence classification toolkit for Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
-
Programming Languages
Categories
Machine Learning
43
Data Manipulation
29
Deep Learning
28
Model Explanation
26
Reinforcement Learning
24
Optimization
22
Visualization
18
Graph Machine Learning
17
Feature Engineering
16
Time Series
16
Probabilistic Methods
13
Computer Vision
12
Computer Audition
9
Computations
9
Natural Language Processing
9
Deployment
8
Distributed Computing
7
Experimentation
7
Learning-to-Rank & Recommender Systems
7
Statistics
6
Genetic Programming
6
Automated Machine Learning
6
Data Validation
6
Evaluation
5
Web Scraping
4
Conversion
4
Graph Manipulation
4
Probabilistic Graphical Models
3
Quantum Computing
3
Spatial Analysis
2
Uncategorized
1
License
1
Sub Categories
Others
179
Synthetic Data
49
General Purpose Machine Learning
21
TensorFlow
16
NLP
15
Data Frames
15
General
11
Pipelines
10
General Purposes
6
Gradient Boosting
6
Interactive plots
6
Kernel Methods
6
Ensemble Methods
5
Feature Selection
5
PyTorch
4
JAX
3
Data-centric AI
3
Automatic Plotting
3
Random Forests
3
Imbalanced Datasets
2
Map
2
Uncategorized
1
Keywords
machine-learning
118
python
106
deep-learning
57
data-science
43
pytorch
30
tensorflow
26
scikit-learn
18
pandas
18
keras
15
reinforcement-learning
13
ml
13
time-series
12
optimization
12
visualization
11
ai
11
data-analysis
10
numpy
10
neural-network
9
statistics
9
artificial-intelligence
9
automl
9
hyperparameter-optimization
9
data-visualization
9
interpretability
8
gpu
8
c-plus-plus
8
neural-networks
7
mlops
7
graph-neural-networks
7
dask
7
computer-vision
7
xgboost
7
forecasting
7
cuda
6
gym
6
tabular-data
6
nlp
6
automated-machine-learning
6
natural-language-processing
6
distributed
6
feature-engineering
6
machine-learning-algorithms
6
jupyter
5
feature-selection
5
machinelearning
5
pandas-dataframe
5
plotting
5
dataframe
5
r
5
classification
5