An open API service indexing awesome lists of open source software.

awesome-python-data-science

Probably the best curated list of data science software in Python.
https://github.com/krzjoa/awesome-python-data-science

Last synced: 6 days ago
JSON representation

  • Automated Machine Learning

    • Others

      • auto-sklearn - An AutoML toolkit and a drop-in replacement for a scikit-learn estimator. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • Auto-PyTorch - Automatic architecture search and hyperparameter optimization for PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • AutoKeras - AutoML library for deep learning. <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • AutoGluon - AutoML for Image, Text, Tabular, Time-Series, and MultiModal Data.
      • MLBox - A powerful Automated Machine Learning python library.
      • TPOT - AutoML tool that optimizes machine learning pipelines using genetic programming. <img height="20" src="img/sklearn_big.png" alt="sklearn">
  • Computations

    • Synthetic Data

      • numpy - The fundamental package needed for scientific computing with Python.
      • Dask - Parallel computing with task scheduling. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • CuPy - NumPy-like API accelerated with CUDA.
      • scikit-tensor - Python library for multilinear algebra and tensor factorizations.
      • numdifftools - Solve automatic numerical differentiation problems in one or more variables.
      • quaternion - Add built-in support for quaternions to numpy.
      • adaptive - Tools for adaptive and parallel samping of mathematical functions.
      • NumExpr - A fast numerical expression evaluator for NumPy that comes with an integrated computing virtual machine to speed calculations up by avoiding memory allocation for intermediate results.
      • numpy - The fundamental package needed for scientific computing with Python.
  • Computer Audition

    • Others

      • torchaudio - An audio library for PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • librosa - Python library for audio and music analysis.
      • Yaafe - Audio features extraction.
      • aubio - A library for audio and music analysis.
      • Essentia - Library for audio and music analysis, description, and synthesis.
      • LibXtract - A simple, portable, lightweight library of audio feature extraction functions.
      • Marsyas - Music Analysis, Retrieval, and Synthesis for Audio Signals.
      • muda - A library for augmenting annotated audio data.
      • madmom - Python audio and music signal processing library.
  • Computer Vision

    • Others

      • torchvision - Datasets, Transforms, and Models specific to Computer Vision. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • PyTorch3D - PyTorch3D is FAIR's library of reusable components for deep learning with 3D data. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • KerasCV - Industry-strength Computer Vision workflows with Keras. <img height="20" src="img/keras_big.png" alt="MXNet based">
      • OpenCV - Open Source Computer Vision Library.
      • Decord - An efficient video loader for deep learning with smart shuffling that's super easy to digest.
      • MMEngine - OpenMMLab Foundational Library for Training Deep Learning Models. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • scikit-image - Image Processing SciKit (Toolbox for SciPy).
      • imgaug - Image augmentation for machine learning experiments.
      • Augmentor - Image augmentation library in Python for machine learning.
      • LAVIS - A One-stop Library for Language-Vision Intelligence.
      • imgaug_extension - Additional augmentations for imgaug.
      • albumentations - Fast image augmentation library and easy-to-use wrapper around other libraries.
  • Conversion

    • Synthetic Data

      • sklearn-porter - Transpile trained scikit-learn estimators to C, Java, JavaScript, and others.
      • ONNX - Open Neural Network Exchange.
      • MMdnn - A set of tools to help users inter-operate among different deep learning frameworks.
      • treelite - Universal model exchange and serialization format for decision tree forests.
  • Data Manipulation

    • Data-centric AI

      • cleanlab - The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
      • snorkel - A system for quickly generating training data with weak supervision.
      • dataprep - Collect, clean, and visualize your data in Python with a few lines of code.
    • Data Frames

      • pandas - Powerful Python data analysis toolkit.
      • polars - A fast multi-threaded, hybrid-out-of-core DataFrame library.
      • Arctic - High-performance datastore for time series and tick data.
      • datatable - Data.table for Python. <img height="20" src="img/R_big.png" alt="R inspired/ported lib">
      • cuDF - GPU DataFrame Library. <img height="20" src="img/pandas_big.png" alt="pandas compatible"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
      • blaze - NumPy and pandas interface to Big Data. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • pandasql - Allows you to query pandas DataFrames using SQL syntax. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • xpandas - Universal 1d/2d data containers with Transformers .functionality for data analysis by [The Alan Turing Institute](https://www.turing.ac.uk/).
      • pysparkling - A pure Python implementation of Apache Spark's RDD and DStream interfaces. <img height="20" src="img/spark_big.png" alt="Apache Spark based">
      • modin - Speed up your pandas workflows by changing a single line of code. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • swifter - A package that efficiently applies any function to a pandas dataframe or series in the fastest available manner.
      • pandas-log - A package that allows providing feedback about basic pandas operations and finds both business logic and performance issues.
      • vaex - Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second.
      • xarray - Xarray combines the best features of NumPy and pandas for multidimensional data selection by supplementing numerical axis labels with named dimensions for more intuitive, concise, and less error-prone indexing routines.
      • pandas-gbq - pandas Google Big Query. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
    • Pipelines

      • SSPipe - Python pipe (|) operator with support for DataFrames and Numpy, and Pytorch.
      • pandas-ply - Functional data manipulation for pandas. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • Dplython - Dplyr for Python. <img height="20" src="img/R_big.png" alt="R inspired/ported lib">
      • sklearn-pandas - pandas integration with sklearn. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • meza - A Python toolkit for processing tabular data.
      • Prodmodel - Build system for data science pipelines.
      • dopanda - Hints and tips for using pandas in an analysis environment. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • Hamilton - A microframework for dataframe generation that applies Directed Acyclic Graphs specified by a flow of lazily evaluated Python functions.
      • pdpipe - Sasy pipelines for pandas DataFrames.
      • Dataset - Helps you conveniently work with random or sequential batches of your data and define data processing.
    • Synthetic Data

      • ydata-synthetic - A package to generate synthetic tabular and time-series data leveraging the state-of-the-art generative models. <img height="20" src="img/pandas_big.png" alt="pandas compatible">
  • Data Validation

    • Synthetic Data

      • great_expectations - Always know what to expect from your data.
      • pandera - A lightweight, flexible, and expressive statistical data testing library.
      • deepchecks - Validation & testing of ML models and data during model development, deployment, and production. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • evidently - Evaluate and monitor ML models from validation to production.
      • TensorFlow Data Validation - Library for exploring and validating machine learning data.
      • DataComPy - A library to compare Pandas, Polars, and Spark data frames. It provides stats and lets users adjust for match accuracy.
  • Deep Learning

    • JAX

      • FLAX - A neural network library for JAX that is designed for flexibility.
      • Optax - A gradient processing and optimization library for JAX.
      • JAX - Composable transformations of Python+NumPy programs: differentiate, vectorize, JIT to GPU/TPU, and more.
    • Others

      • transformers - State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible"> <img height="20" src="img/tf_big2.png" alt="sklearn">
      • Tangent - Source-to-Source Debuggable Derivatives in Pure Python.
      • autograd - Efficiently computes derivatives of numpy code.
      • Caffe - A fast open framework for deep learning.
      • nnabla - Neural Network Libraries by Sony.
    • PyTorch

      • PyTorch - Tensors and Dynamic neural networks in Python with strong GPU acceleration. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • ignite - High-level library to help with training neural networks in PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • Catalyst - High-level utils for PyTorch DL & RL research. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • ChemicalX - A PyTorch-based deep learning library for drug pair scoring. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
    • TensorFlow

      • Keras - A high-level neural networks API running on top of TensorFlow. <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • TensorFlow - Computation using data flow graphs for scalable machine learning by Google. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • TFLearn - Deep learning library featuring a higher-level API for TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • Sonnet - TensorFlow-based neural network library. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • tensorpack - A Neural Net Training Interface on TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • Polyaxon - A platform that helps you build, manage and monitor deep learning models. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • tfdeploy - Deploy TensorFlow graphs for fast evaluation and export to TensorFlow-less environments running numpy. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • tensorflow-upstream - TensorFlow ROCm port. <img height="20" src="img/tf_big2.png" alt="sklearn"> <img height="20" src="img/amd_big.png" alt="Possible to run on AMD GPU">
      • TensorFlow Fold - Deep learning with dynamic computation graphs in TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • Mesh TensorFlow - Model Parallelism Made Easier. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • keras-contrib - Keras community contributions. <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • Hyperas - Keras + Hyperopt: A straightforward wrapper for a convenient hyperparameter. <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • Elephas - Distributed Deep learning with Keras & Spark. <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • qkeras - A quantization deep learning library. <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • TensorLight - A high-level framework for TensorFlow. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • Ludwig - A toolbox that allows one to train and test deep learning models without the need to write code. <img height="20" src="img/tf_big2.png" alt="sklearn">
  • Deployment

    • NLP

      • fastapi - Modern, fast (high-performance), a web framework for building APIs with Python
      • streamlit - Make it easy to deploy the machine learning model
      • datapane - A collection of APIs to turn scripts and notebooks into interactive reports.
      • binder - Enable sharing and execute Jupyter Notebooks
      • gradio - Create UIs for your machine learning model in Python in 3 minutes.
      • Vizro - A toolkit for creating modular data visualization applications.
      • streamsync - No-code in the front, Python in the back. An open-source framework for creating data apps.
      • Deepnote - Deepnote is a drop-in replacement for Jupyter with an AI-first design, sleek UI, new blocks, and native data integrations. Use Python, R, and SQL locally in your favorite IDE, then scale to Deepnote cloud for real-time collaboration, Deepnote agent, and deployable data apps.
  • Distributed Computing

    • Synthetic Data

      • Horovod - Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • Veles - Distributed machine learning platform.
      • Jubatus - Framework and Library for Distributed Online Machine Learning.
      • DMTK - Microsoft Distributed Machine Learning Toolkit.
      • PaddlePaddle - PArallel Distributed Deep LEarning.
      • dask-ml - Distributed and parallel machine learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • Distributed - Distributed computation in Python.
  • Evaluation

    • Synthetic Data

      • recmetrics - Library of useful metrics and plots for evaluating recommender systems.
      • Metrics - Machine learning evaluation metric.
      • sklearn-evaluation - Model evaluation made easy: plots, tables, and markdown reports. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • AI Fairness 360 - Fairness metrics for datasets and ML models, explanations, and algorithms to mitigate bias in datasets and models.
      • alibi-detect - Algorithms for outlier, adversarial and drift detection.<img height="20" src="img/alibi-detect.png" alt="sklearn">
  • Experimentation

    • Synthetic Data

      • Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
      • mlflow - Open source platform for the machine learning lifecycle.
      • envd - 🏕️ machine learning development environment for data science and AI/ML engineering teams.
      • Sacred - A tool to help you configure, organize, log, and reproduce experiments.
      • Ax - Adaptive Experimentation Platform. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • dvc - Data Version Control | Git for Data & Models | ML Experiments Management.
      • Neptune - A lightweight ML experiment tracking, results visualization, and management tool.
  • Feature Engineering

    • Feature Selection

      • scikit-feature - Feature selection repository in Python.
      • boruta_py - Implementations of the Boruta all-relevant feature selection method. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • BoostARoota - A fast xgboost feature selection algorithm. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • scikit-rebate - A scikit-learn-compatible Python implementation of ReBATE, a suite of Relief-based feature selection algorithms for Machine Learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • zoofs - A feature selection library based on evolutionary algorithms.
    • General

      • Featuretools - Automated feature engineering.
      • Feature Engine - Feature engineering package with sklearn-like functionality. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • OpenFE - Automated feature generation with expert-level performance.
      • Feature Forge - A set of tools for creating and testing machine learning features. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • few - A feature engineering wrapper for sklearn. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • scikit-mdr - A sklearn-compatible Python implementation of Multifactor Dimensionality Reduction (MDR) for feature construction. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • tsfresh - Automatic extraction of relevant features from time series. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • dirty_cat - Machine learning on dirty tabular data (especially: string-based variables for classifcation and regression). <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • NitroFE - Moving window features. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • sk-transformer - A collection of various pandas & scikit-learn compatible transformers for all kinds of preprocessing and feature engineering steps <img height="20" src="img/pandas_big.png" alt="pandas compatible">
      • tubular - Collection of scikit-learn compatible transformers written in [narwhals]( https://github.com/narwhals-dev/narwhals), which can accept either polars/pandas inputs and utilise the chosen library under the hood. <img height="20" src="img/sklearn_big.png" alt="sklearn"><img height="20" src="img/pandas_big.png" alt="pandas compatible">
  • Genetic Programming

    • Others

      • gplearn - Genetic Programming in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • PyGAD - Genetic Algorithm in Python. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible"> <img height="20" src="img/keras_big.png" alt="keras">
      • DEAP - Distributed Evolutionary Algorithms in Python.
      • karoo_gp - A Genetic Programming platform for Python with GPU support. <img height="20" src="img/tf_big2.png" alt="sklearn">
      • monkeys - A strongly-typed genetic programming framework for Python.
      • sklearn-genetic - Genetic feature selection module for scikit-learn. <img height="20" src="img/sklearn_big.png" alt="sklearn">
  • Graph Machine Learning

    • Others

      • pytorch_geometric_temporal - Temporal Extension Library for PyTorch Geometric. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • PyTorch Geometric Signed Directed - A signed/directed graph neural network extension library for PyTorch Geometric. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • dgl - Python package built to ease deep learning on graph, on top of existing DL frameworks. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible"> <img height="20" src="img/tf_big2.png" alt="TensorFlow"> <img height="20" src="img/mxnet_big.png" alt="MXNet based">
      • Spektral - Deep learning on graphs. <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • StellarGraph - Machine Learning on Graphs. <img height="20" src="img/tf_big2.png" alt="TensorFlow"> <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • Graph Nets - Build Graph Nets in Tensorflow. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
      • TensorFlow GNN - A library to build Graph Neural Networks on the TensorFlow platform. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
      • Auto Graph Learning - An autoML framework & toolkit for machine learning on graphs.
      • PyTorch-BigGraph - Generate embeddings from large-scale graph-structured data. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • Karate Club - An unsupervised machine learning library for graph-structured data.
      • Little Ball of Fur - A library for sampling graph structured data.
      • GreatX - A graph reliability toolbox based on PyTorch and PyTorch Geometric (PyG). <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • Jraph - A Graph Neural Network Library in Jax.
      • pytorch_geometric - Geometric Deep Learning Extension Library for PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • GRAPE - GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations
      • TRL - Train transformer language models with reinforcement learning.
      • Cleora - The Graph Embedding Engine.
  • Graph Manipulation

    • Others

      • Networkx - Network Analysis in Python.
      • Rustworkx - A high performance Python graph library implemented in Rust.
      • graph-tool - an efficient Python module for manipulation and statistical analysis of graphs (a.k.a. networks).
      • igraph - Python interface for igraph.
  • Learning-to-Rank & Recommender Systems

    • Others

      • LightFM - A Python implementation of LightFM, a hybrid recommendation algorithm.
      • Spotlight - Deep recommender models using PyTorch.
      • Surprise - A Python scikit for building and analyzing recommender systems.
      • RecBole - A unified, comprehensive and efficient recommendation library. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • allRank - allRank is a framework for training learning-to-rank neural models based on PyTorch. <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • TensorFlow Recommenders - A library for building recommender system models using TensorFlow. <img height="20" src="img/tf_big2.png" alt="TensorFlow"> <img height="20" src="img/keras_big.png" alt="Keras compatible">
      • TensorFlow Ranking - Learning to Rank in TensorFlow. <img height="20" src="img/tf_big2.png" alt="TensorFlow">
  • Machine Learning

    • Ensemble Methods

      • ML-Ensemble - High performance ensemble learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • ML-Ensemble - High performance ensemble learning. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • Stacking - Simple and useful stacking library written in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • stacked_generalization - Library for machine learning stacking generalization. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • vecstack - Python package for stacking (machine learning technique). <img height="20" src="img/sklearn_big.png" alt="sklearn">
    • General Purpose Machine Learning

      • scikit-learn - Machine learning in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • scikit-learn - Machine learning in Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • PyCaret - An open-source, low-code machine learning library in Python. <img height="20" src="img/R_big.png" alt="R inspired lib">
      • Shogun - Machine learning toolbox.
      • xLearn - High Performance, Easy-to-use, and Scalable Machine Learning Package.
      • cuML - RAPIDS Machine Learning Library. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/gpu_big.png" alt="GPU accelerated">
      • Sparkit-learn - PySpark + scikit-learn = Sparkit-learn. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/spark_big.png" alt="Apache Spark based">
      • mlpack - A scalable C++ machine learning library (Python bindings).
      • dlib - Toolkit for making real-world machine learning and data analysis applications in C++ (Python bindings).
      • MLxtend - Extension and helper modules for Python's data analysis and machine learning libraries. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • hyperlearn - 50%+ Faster, 50%+ less RAM usage, GPU support re-written Sklearn, Statsmodels. <img height="20" src="img/sklearn_big.png" alt="sklearn"> <img height="20" src="img/pytorch_big2.png" alt="PyTorch based/compatible">
      • Reproducible Experiment Platform (REP) - Machine Learning toolbox for Humans. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • scikit-multilearn - Multi-label classification for python. <img height="20" src="img/sklearn_big.png" alt="sklearn">
      • seqlearn - Sequence classification toolkit for Python. <img height="20" src="img/sklearn_big.png" alt="sklearn">