An open API service indexing awesome lists of open source software.

https://github.com/awesomelistsio/awesome-ai-infrastructure

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.
https://github.com/awesomelistsio/awesome-ai-infrastructure

List: awesome-ai-infrastructure

ai ai-infra ai-infrastructure awesome awesome-list awesome-lists

Last synced: 4 months ago
JSON representation

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

Awesome Lists containing this project

README

        

# Awesome AI Infrastructure [![Awesome Lists](https://srv-cdn.himpfen.io/badges/awesome-lists/awesomelists-flat.svg)](https://github.com/awesomelistsio/awesome)

[![Buy Me A Coffee](https://srv-cdn.himpfen.io/badges/buymeacoffee/buymeacoffee-flat.svg)](https://tinyurl.com/2h9aktmd)   [![Ko-Fi](https://srv-cdn.himpfen.io/badges/kofi/kofi-flat.svg)](https://tinyurl.com/d4xnrptz)   [![PayPal](https://srv-cdn.himpfen.io/badges/paypal/paypal-flat.svg)](https://tinyurl.com/mr22naua)   [![Stripe](https://srv-cdn.himpfen.io/badges/stripe/stripe-flat.svg)](https://tinyurl.com/e8ymxdw3)

> A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

## Contents

- [Distributed Training](#distributed-training)
- [Model Serving and Deployment](#model-serving-and-deployment)
- [MLOps and Automation](#mlops-and-automation)
- [Data Management](#data-management)
- [Optimization Tools](#optimization-tools)
- [Infrastructure as Code](#infrastructure-as-code)
- [Cloud Platforms](#cloud-platforms)
- [Learning Resources](#learning-resources)
- [Books](#books)
- [Community](#community)
- [Contribute](#contribute)
- [License](#license)

## Distributed Training

- [Horovod](https://horovod.ai/) - A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.
- [Ray](https://www.ray.io/) - A framework for building scalable distributed applications, including distributed AI and reinforcement learning.
- [PyTorch Distributed](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) - Tools and libraries for distributed training in PyTorch.
- [DeepSpeed](https://www.deepspeed.ai/) - A deep learning optimization library that makes distributed training easy and efficient.
- [MPI for Machine Learning](https://www.mpich.org/) - Using the Message Passing Interface (MPI) standard for distributed machine learning.

## Model Serving and Deployment

- [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) - A flexible, high-performance serving system for machine learning models.
- [TorchServe](https://pytorch.org/serve/) - A model serving framework for PyTorch, providing fast and efficient model deployment.
- [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) - A scalable model serving platform supporting multiple frameworks.
- [ONNX Runtime](https://onnxruntime.ai/) - A cross-platform, high-performance scoring engine for serving ONNX models.
- [Seldon Core](https://www.seldon.io/tech/products/core/) - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
- [KFServing (KServe)](https://kserve.github.io/website/) - A Kubernetes-based model serving solution as part of the Kubeflow project.

## MLOps and Automation

- [MLflow](https://mlflow.org/) - An open-source platform for managing the end-to-end machine learning lifecycle.
- [Kubeflow](https://www.kubeflow.org/) - A platform for orchestrating machine learning workflows on Kubernetes.
- [DVC (Data Version Control)](https://dvc.org/) - A tool for version control and reproducibility in machine learning projects.
- [ZenML](https://zenml.io/) - An extensible MLOps framework for creating portable, production-ready machine learning pipelines.
- [Airflow](https://airflow.apache.org/) - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
- [Metaflow](https://metaflow.org/) - A human-centric framework for building and managing real-life data science projects, developed by Netflix.

## Data Management

- [Delta Lake](https://delta.io/) - An open-source storage layer that brings reliability to data lakes.
- [Apache Hudi](https://hudi.apache.org/) - A data management framework that simplifies incremental data processing and streaming analytics.
- [Feast](https://feast.dev/) - An open-source feature store for managing and serving machine learning features.
- [Great Expectations](https://greatexpectations.io/) - A tool for data validation and testing in machine learning workflows.
- [LakeFS](https://lakefs.io/) - An open-source data versioning platform for managing data lakes.

## Optimization Tools

- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) - A high-performance deep learning inference optimizer and runtime.
- [Apache TVM](https://tvm.apache.org/) - A deep learning compiler stack for optimizing models on various hardware backends.
- [Intel OpenVINO](https://docs.openvino.ai/latest/index.html) - A toolkit for optimizing and deploying AI inference on Intel hardware.
- [OctoML](https://octoml.ai/) - An AI model optimization platform for efficient deployment on edge and cloud.
- [Quantization Aware Training (QAT)](https://www.tensorflow.org/model_optimization) - Tools for optimizing model performance through quantization.

## Infrastructure as Code

- [Terraform](https://www.terraform.io/) - A tool for building, changing, and versioning infrastructure safely and efficiently.
- [Pulumi](https://www.pulumi.com/) - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
- [Ansible](https://www.ansible.com/) - An open-source automation tool for provisioning and managing infrastructure.
- [AWS CloudFormation](https://aws.amazon.com/cloudformation/) - A service for automating AWS resource deployment and management.
- [Google Deployment Manager](https://cloud.google.com/deployment-manager) - An infrastructure management tool for Google Cloud Platform.

## Cloud Platforms

- [AWS SageMaker](https://aws.amazon.com/sagemaker/) - A comprehensive platform for building, training, and deploying machine learning models on AWS.
- [Google AI Platform](https://cloud.google.com/ai-platform) - Google Cloud’s integrated environment for AI development and deployment.
- [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/) - A cloud-based platform for training, deploying, and managing machine learning models.
- [IBM Watson Studio](https://www.ibm.com/cloud/watson-studio) - A suite of tools for data science, machine learning, and AI model development.
- [Paperspace Gradient](https://gradient.paperspace.com/) - A cloud platform for developing, training, and deploying machine learning models.

## Learning Resources

- [Coursera: MLOps Fundamentals](https://www.coursera.org/specializations/mlops-fundamentals) - A course on MLOps best practices for machine learning projects.
- [Google Cloud: ML Operations](https://cloud.google.com/learn/training/ai-ml-operations) - Training resources on MLOps and model deployment.
- [AWS SageMaker Workshops](https://sagemaker-examples.readthedocs.io/en/latest/) - Example projects and tutorials for using AWS SageMaker.
- [Kubeflow Documentation](https://www.kubeflow.org/docs/) - Official documentation and guides for using Kubeflow.
- [PyTorch Distributed Training Guide](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) - A tutorial on distributed training with PyTorch.

## Books

- *Machine Learning Engineering* by Andriy Burkov - A book on building scalable machine learning infrastructure.
- *Building Machine Learning Powered Applications* by Emmanuel Ameisen - A guide to building robust ML applications in production.
- *Designing Data-Intensive Applications* by Martin Kleppmann - A comprehensive guide to building scalable and reliable data systems.
- *MLOps: Data Science in Production* by Mark Treveil and The Dotscience Team - A book on best practices for MLOps and model deployment.
- *Reliable Machine Learning* by Cathy Chen - A book on creating resilient machine learning infrastructure.

## Community

- [MLOps Community](https://mlops.community/) - A global community focused on MLOps and AI infrastructure.
- [Reddit: r/MachineLearning](https://www.reddit.com/r/MachineLearning/) - A subreddit for discussions on machine learning infrastructure and tools.
- [Kubeflow Slack](https://kubeflow.slack.com/) - A Slack community for discussing Kubeflow and machine learning pipelines.
- [Paperspace Forums](https://forums.paperspace.com/) - A community forum for discussing machine learning infrastructure and tools.
- [GitHub: MLOps Repositories](https://github.com/topics/mlops) - A collection of open-source MLOps projects on GitHub.

## Contribute

Contributions are welcome!

## License

[![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/by-sa.svg)](http://creativecommons.org/licenses/by-sa/4.0/)