An open API service indexing awesome lists of open source software.

awesome-python-data-science

A curated list of Python libraries used for data science.
https://github.com/thomasjpfan/awesome-python-data-science

Last synced: about 21 hours ago
JSON representation

  • Machine Learning Frameworks

    • Xgboost - Scalable, Portable and Distributed Gradient Boosting.
    • scikit-learn - Machine learning.
    • CatBoost - Gradient boosting library with categorical features support.
    • LightGBM - Fast, distributed, high performance gradient boosting.
    • PyMC - Probabilistic Programming.
    • statsmodels - Statistical modeling and econometrics.
    • SymPy - A computer algebra system.
    • dask-ml - Distributed and parallel machine learning.
    • imbalanced-learn - Perform under sampling and over sampling.
    • lightning - Large-scale linear models.
    • scikit-optimize - Sequential model-based optimization with a `scipy.optimize` interface.
    • BayesianOptimization - Global optimization with gaussian processes.
    • gplearn - Genetic Programming.
    • python-glmnet - glmnet package for fitting generalized linear models.
    • hmmlearn - Hidden Markov Models.
    • vecstack - stacking (machine learning technique).
    • deap - Evolutionary computation framework.
    • pyro - Deep universal probabilistic programming with PyTorch.
    • civisml-extensions - scikit-learn-compatible estimators from Civis Analytics.
    • hyperopt-sklearn - Hyper-parameter optimization for sklearn.
    • scikit-survival - Survival analysis built on top of scikit-learn.
    • dstoolbox - Tools that make working with scikit-learn and pandas easier.
    • modin - Unify the way you interact with your data.
    • pyomo - Python Optimization MOdels.
    • BAMBI - BAyesian Model-Building Interface.
    • combo - A Python Toolbox for Machine Learning Model Combination.
    • fastai - The fast.ai deep learning library, lessons, and tutorials.
    • pycaret - Low-code machine learning library in Python.
    • river - River is a Python library for online machine learning.
  • Scientific

    • Pandas - A library providing high-performance, easy-to-use data structures and data analysis tools.
    • Numba - NumPy aware dynamic Python compiler using LLVM.
    • or-tools - Google's Operations Research tools. Classical CS algorithms.
    • cvxpy - Python-embedded modeling language for convex optimization problems.
    • dask - Parallel computing with task scheduling.
    • blaze - NumPy and Pandas for databases.
    • Biopython - Astronomy and astrophysics.
    • PyDy - Multibody Dynamics.
    • nilearn - NeuroImaging.
    • patsy - Describing statistical models using symbolic formulas.
    • numexpr - Fast numerical array expression evaluator.
  • Deep Learning Tools

    • lightly - Lightly is a computer vision framework for self-supervised learning.
    • TorchDrift - TorchDrift is a data and concept drift library for PyTorch.
    • TorchDrift - TorchDrift is a data and concept drift library for PyTorch.
    • Edward - Probabilistic programming language in TensorFlow.
    • pomegranate - Probabilistic modelling.
    • skorch - Scikit-learn PyTorch.
    • DLTK - Deep Learning Toolkit for Medical Image Analysis.
    • sonnet - TensorFlow-based neural network library.
    • rasa_core - Dialogue engine.
    • luminoth - Computer Vision.
    • allennlp - NLP Research library.
    • spotlight - Pytorch Recommender framework.
    • tensorforce - TensorFlow library for applied reinforcement learning.
    • keras-vis - Neural network visualization toolkit for keras.
    • hyperas - Keras + Hyperopt.
    • tensorboard_logger - Log TensorBoard events without touching TensorFlow.
    • foolbox - Python toolbox to create adversarial examples that fool neural networks.
    • pytorch/vision - Datasets, Transforms and Models specific to Computer Vision.
    • gluon-nlp - NLP made easy.
    • pytorch/ignite - High-level library to help with training neural networks in PyTorch.
    • Netron - Visualizer for deep learning and machine learning models.
    • gpytorch - A highly efficient and modular implementation of Gaussian Processes in PyTorch.
    • tensorly - Tensor Learning in Python.
    • einops - Deep learning operations reinvented.
    • hiddenlayer - Neural network graphs and training metrics for PyTorch, Tensorflow, and Keras.
    • segmentation_models.pytorch - Segmentation models with pretrained backbones.
    • pytorch-lightning - The lightweight PyTorch wrapper.
    • lightly - Lightly is a computer vision framework for self-supervised learning.
  • Visualization

    • PyGWalker - Turns pandas and polars dataframes into a Tableau-like user interface for visual exploration.
    • matplotlib-venn - Area-weighted venn-diagrams.
    • pyLDAvis - Interactive topic model visualization.
    • cufflinks - Productivity Tools for Plotly + Pandas.
    • scatterText - Visualizations of how language differs among document types.
    • plotnine - ggplot for python.
    • mizani - scales package.
    • PtitPrince - Raindrop cloud.
    • joypy - Ridgeline plots.
    • dtreeviz - Decision tree visualization and model interpretation.
    • ipyvolume - 3d plotting for Python in the Jupyter notebook based on IPython widgets using WebGL.
    • Great Tables - Absolutely Delightful Table-making in Python.
    • diagrams - Diagrams lets you draw the cloud system architecture in Python code.
    • bokeh - Interactive web plotting.
    • dash - Interactive Web plotting.
    • altair - Declarative statistical visualization.
    • folium - Leaflet.js Maps.
    • geoplot - High-level geospatial data visualization.
    • mplleaftlet - Matplotlib plots from Python into interactive Leaflet web maps.
  • Exploration

    • fitter - simple class to identify the distribution from which a data samples is generated from.
    • Dora - Exploratory data analysis.
    • mlxtend - A library of extension and helper modules for Python's data analysis and machine learning libraries.
    • yellowbrick - Visual analysis and diagnostic tools.
    • pandas-profiling - Profiling reports for pandas DataFrame objects.
    • sklearn-evaluation - scikit-learn model evaluation.
    • fitter - simple class to identify the distribution from which a data samples is generated from.
    • missingno - Missing data visualization.
    • hypertools - Gaining geometric insights into high-dimensional data.
    • scikit-plot - Plotting functionality to scikit-learn objects.
    • elih - Explain Machine Learning.
    • kmeans_smote - Oversampling for imbalanced learning based on k-means and SMOTE.
    • pyUpSet - UpSet suite of visualisation methods.
    • lime - Explaining the predictions of any machine learning classifier.
    • SauceCat/PDPbox - Partial dependence plot toolbox.
    • shap - A unified approach to explain the output of any machine learning model.
    • eli5 - Debug machine learning classifiers and explain their predictions.
    • rfpimp - Permutation and drop-column importance for scikit-learn random forests.
    • pypeln - Concurrent data pipelines made easy.
    • pycm - Multi-class confusion matrix library in Python.
    • great_expectations - Always know what to expect from your data.
    • alibi - Algorithms for monitoring and explaining machine learning models.
    • InterpretML - Fit interpretable models. Explain blackbox machine learning.
    • cleanlab - Finding label errors in datasets and learning with noisy labels.
    • dtale - Flask/React client for visualizing pandas data structures
    • dabl - Data Analysis Baseline Library
    • XAI - XAI - An eXplainability toolbox for machine learning
    • explainerdashboard - This package makes it convenient to quickly deploy a dashboard web app that explains the workings of a (scikit-learn compatible) machine learning model.
    • alibi-detect - Open source Python library focused on outlier, adversarial and drift detection. The package aims to cover both online and offline detectors for tabular data, text, images and time series.
  • Feature Extraction

    • General Feature Extraction

      • dirty_cat - Encoding methods for dirty categorical variables.
      • sklearn-pandas - Pandas integration with sklearn.
      • pdpipe - Easy pipelines for pandas DataFrames.
      • datacleaner - Tool that automatically cleans data sets and readies them for analysis.
      • categorical-encoding - sklearn compatible categorical variable encoders.
      • fancyimpute - Multivariate imputation and matrix completion algorithms.
      • raccoon - DataFrame with fast insert and appends.
      • kmodes - k-modes and k-prototypes clustering algorithm.
      • annoy - Approximate Nearest Neighbors.
      • scikit-feature - Filter methods for feature selection.
      • mifs - Parallelized Mutual Information based Feature Selection module.
      • skggm - Scikit-learn compatible estimation of general graphical models.
      • Impyute - Data imputations library to preprocess datasets with missing data.
      • eif - Extended Isolation Forest for Anomaly Detection.
      • featexp - Feature exploration for supervised learning.
      • feature_engine - Feature engineering package with sklearn like functionality.
      • stumpy - STUMPY is a powerful and scalable Python library that can be used for a variety of time series data mining tasks.
      • n2 - Lightweight approximate Nearest Neighbor library which runs faster even with large datasets.
      • compressio - Compressio provides lossless in-memory compression of pandas DataFrames and Series.
    • Images and Video

      • SimpleCV - Wrapper around OpenCV.
      • pillow - PIL fork.
      • hmap - Image histogram remapping.
      • pyocr - A wrapper for Tesseract and Cuneiform (Optical Character Recognition).
      • scikit-video - Video processing.
      • OpenCV - Open Source Computer Vision Library.
      • SimpleCV - Wrapper around OpenCV.
      • label-maker - Data Preparation for Satellite Machine Learning.
      • face_recognition - Facial recognition.
      • imgaug - Image augmentation.
      • pyvips - Fast image processing.
      • ImageHash - Image hashing.
      • Augmentor - Image augmentation library.
      • PyAV - Bindings for FFmpeg.
      • imutils - Convenience functions to make basic image processing operations.
    • Text/NLP

      • preprocessing - Simple interface for the CMU Pronouncing Dictionary.
      • unidecode - ASCII transliterations of Unicode text.
      • pytorch/text - Data loaders and abstractions for text and NLP.
      • sent2vec - General purpose unsupervised sentence representations.
      • pyhunspell - Python bindings for the Hunspell spellchecker engine.
      • facebook/fastText - Library for fast text representation and classification.
      • textblob - Simple, Pythonic, text processing--Sentiment analysis, part-of-speech tagging, noun phrase extraction, translation, and more.
      • facebook/InferSent - Sentence embeddings (InferSent) and training code for NLI.
      • nmslib - Non-Metric Space Library.
      • ftfy - Fixes mojibake and other glitches in Unicode text, after the fact.
      • fletcher - Pandas ExtensionDType/Array backed by Apache Arrow.
      • textacy - NLP, before and after spaCy.
      • hmtl - Hierarchical Multi-Task Learning - A State-of-the-Art neural network model for several NLP tasks based on PyTorch and AllenNLP.
      • pytext - A natural language modeling framework based on PyTorch.
      • flair - A very simple framework for state-of-the-art Natural Language Processing.
      • LASER - Language-Agnostic SEntence Representations.
      • transformer-xl - Attentive Language Models Beyond a Fixed-Length Context.
      • Fuzzy - Soundex, NYSIIS, Double Metaphone.
      • BlingFire - A lightning fast Finite State machine and REgular expression manipulation library.
      • wordfreq - Library for looking up the frequencies of words in many languages, based on many sources of data.
      • BERT-pytorch - Google AI 2018 BERT pytorch implementation.
      • gensim - Topic Modeling.
      • pattern - Web ining module.
      • probablepeople - Parsing unstructured western names into name components.
      • Expynent - Regular expression patterns.
      • mimesis - Generate synthetic data.
      • pyenchant - Spell checking.
      • parserator - Domain-specific probabilistic parsers.
      • scrubadub - Clean personally identifiable information from dirty dirty text.
      • usaddress - Parsing unstructured address strings into address components.
      • python-phonenumbers - Python port of Google's libphonenumber.
      • jellyfish - Approximate and phonetic matching of strings.
      • langid - Stand-alone language identification system.
      • fuzzywuzzy - Fuzzy String Matching.
      • snowball - Snowball compiler and stemming algorithms.
      • leven - Levenshtein edit distance.
      • flashtext - Extract Keywords from sentence or Replace keywords in sentences.
      • polyglot - Multilingual text NLP processing toolkit.
      • sentencepiece - Unsupervised text tokenizer for Neural Network-based text generation.
      • pyfasttext - Binding for fastText.
      • python-wordsegment - English word segmentation.
      • pyahocorasick - Exact or approximate multi-pattern string search.
      • Wordbatch - Parallel text feature extraction for machine learning.
      • langdetect - Port of Google's language-detection library.
      • translation - Uses web services for text translation.
      • textstat - Calculate readability statistics of a text object - paragraphs, sentences, articles.
      • nlpaug - Augmenting nlp for your machine learning projects.
      • sum - Automatic summarization of text documents and HTML.
      • textract - Extract text from any document.
      • newspaper - News extraction, article extraction and content curation.
    • Time Series

      • luminaire - ML driven solutions for monitoring time series data.
      • GrayKite - Greykite: A flexible, intuitive and fast forecasting library
      • Causality - Causal analysis.
      • traces - Unevenly-spaced time series analysis.
      • PyFlux - Time series library for Python.
      • prophet - Tool for producing high quality forecasts.
      • tsfresh - Automatic extraction of relevant features from time series.
      • tslearn - Machine learning toolkit dedicated to time-series data.
      • pyts - A Python package for time series transformation and classification.
      • sktime - A scikit-learn compatible Python toolbox for learning with time series data.
      • Merlion - A Machine Learning Library for Time Series
      • Darts - darts is a Python library for easy manipulation and forecasting of time series.
      • NeuralProphet - A Neural Network based Time-Series model, inspired by Facebook Prophet and AR-Net, built on PyTorch.
    • Audio

      • python_speech_features - Speech features.
      • speechpy - A Library for Speech Processing and Recognition.
      • magenta - Music and Art Generation with Machine Intelligence.
      • librosa - Audio and music analysis.
      • pydub - Manipulate audio with a simple and easy high level interface.
      • pytorch/audio - simple audio I/O for pytorch.
    • Geolocation

    • Ranking/Recommender

      • recommenders - Examples and best practices for building recommendation systems
      • Surprise - Analyzing recommender systems.
      • trueskill - TrueSkill rating system.
      • LightFM - Hybrid recommendation algorithm.
      • implicit - Collaborative Filtering for Implicit Datasets.
  • Profiling

    • Ranking/Recommender

      • memory_profiler - monitoring memory usage of a python program.
      • mem_usage_ui - Measuring and graphing memory usage of local processes.
      • viztracer - VizTracer is a low-overhead logging/debugging/profiling tool that can trace and visualize your python code execution.
      • py-spy - Sampling profiler for Python programs.
      • memory_profiler - monitoring memory usage of a python program.
      • line_profiler - Line-by-line profiling.
      • filprofiler - Fil a memory profiler designed for data processing applications.
      • scalene - High-performance CPU and memory profiler for Python.
      • python-flamegraph - Statistical profiler which outputs in format suitable for FlameGraph.
  • Python Tools

    • Ranking/Recommender

      • devpi - PyPI server and packaging/testing/release tool.
      • sacred - Reproduce computational experiments.
      • Typer - Build CLIs with type hints.
      • hydra - Framework for elegantly configuring complex applications.
      • neurtu - A Python package for parametric benchmarks.
      • pyprojroot - Finding project directories in Python.
      • datasette - An open source multi-tool for exploring and publishing data.
      • delorean - Time Travel Made Easy.
      • pip-tools - Keeps dependencies up to date.
      • click - CLI package.
      • sacredboard - Dashboard for sacred.
      • sacred - Reproduce computational experiments.
      • magic-wormhole - get things from one computer to another, safely.
  • Outlier Detection

    • PyOD - Versatile Python library for detecting anomalies in multivariate data.
    • DeepOD - Deep learning-based outlier/anomaly detection
  • Deep Learning Frameworks

    • Tensorflow - DL Framework.
    • PyTorch - DL Framework.
    • mxnet - Apache MXNet: A flexible and efficient library for deep learning.
    • tensorlayer - A Deep Learning and Reinforcement Learning Library for Researchers and Engineers.
  • AutoML

    • Nevergrad - Gradient-free optimization.
    • featuretools - Automated feature engineering.
    • auto-sklearn - Automated machine learning.
    • tpot - Automated machine learning.
    • auto_ml - Automated machine learning.
    • MLBox - Automated Machine Learning python library.
    • devol - Automated deep neural network design via genetic programming.
    • skll - SciKit-Learn Laboratory (SKLL) makes it easy to run machine learning experiments.
    • autokeras - Automated machine learning in Keras.
    • SMAC3 - Sequential Model-based Algorithm Configuration.
  • Deep Learning Projects

    • fairseq - Sequence-to-Sequence Toolkit.
    • tensorflow-wavenet - DeepMind's WaveNet.
    • DeepRecommender - Recommender systems.
    • DrQA - Reading Wikipedia to Answer Open-Domain Questions.
    • vqa.pytorch - Visual Question Answering in Pytorch.
    • Half-Life Regression - Model for spaced repetition practice.
    • learning-to-learn - Learning to Learn in Tensorflow.
    • capsule-networks - A PyTorch implementation of the NIPS 2017 paper "Dynamic Routing Between Capsules".
    • Mask_RCNN - Mask R-CNN for object detection and instance segmentation on Keras and TensorFlow.
    • lightnet - Bringing pjreddie's DarkNet out of the shadows.
    • pytorch-openai-transformer-lm - OpenAI's finetuned transformer language model with a script to import the weights pre-trained by OpenAI.
    • maskrcnn-benchmark - Fast, modular reference implementation of Semantic Segmentation and Object Detection algorithm in PyTorch.
    • LovaszSoftmax - Lovász-Softmax loss.
    • ludwing - Ludwig is a toolbox built on top of TensorFlow that allows to train and test deep learning models without the need to write code.
  • Trading

    • Ranking/Recommender

      • Clairvoyant - Identify and monitor social/historical cues.
      • zipline - Algorithmic Trading Library.
      • qstrader - Advanced Trading Infrastructure.
  • Misc

    • Ranking/Recommender

      • mmh3 - MurmurHash3, a set of fast and robust hash functions.
      • fbpca - Fast Randomized PCA/SVD.
      • pipeline - Standard Runtime For Every Real-Time Machine Learning.
      • crayon - A language-agnostic interface to TensorBoard.
      • faiss - A library for efficient similarity search and clustering of dense vectors.
  • Deployment

    • Ranking/Recommender

      • evidently - Evidently helps evaluate machine learning models during validation and monitor them in production.
      • onnx - Open Neutral Network Exchange.
      • lore - Lore makes machine learning approachable for Software Engineers and maintainable for Machine Learning Researchers.
      • kubeflow - Machine Learning Toolkit for Kubernetes.
      • airflow - ETL.
      • mlflow - Open source platform for the complete machine learning lifecycle.
      • sklearn-porter - Transpile trained scikit-learn estimators.
      • sklearn-compiledtrees - Compiled Decision Trees for scikit-learn.
  • Data Gathering

    • Ranking/Recommender

      • gain - Web crawling framework based on asyncio.
      • MechanicalSoup - A Python library for automating interaction with websites.
      • Pandarallel - Parallel pandas.
      • parse - Parse strings using a specification based on the Python format() syntax.
      • CleverCSV - CleverCSV is a Python package for handling messy CSV files