https://github.com/awesomelistsio/awesome-ai-infrastructure

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.
https://github.com/awesomelistsio/awesome-ai-infrastructure

List: awesome-ai-infrastructure

ai ai-infra ai-infrastructure awesome awesome-list awesome-lists

Last synced: 5 months ago
JSON representation

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

Host: GitHub
URL: https://github.com/awesomelistsio/awesome-ai-infrastructure
Owner: awesomelistsio
Created: 2024-11-17T23:33:12.000Z (8 months ago)
Default Branch: main
Last Pushed: 2024-11-17T23:33:56.000Z (8 months ago)
Last Synced: 2025-01-26T22:01:21.734Z (6 months ago)
Topics: ai, ai-infra, ai-infrastructure, awesome, awesome-list, awesome-lists
Language: Python
Homepage: https://www.awesomelists.xyz/
Size: 4.88 KB
Stars: 3
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md

Awesome Lists containing this project

awesome-mlops - Awesome AI Infrastructure
ultimate-awesome - awesome-ai-infrastructure - A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment. (Other Lists / TeX Lists)

README

        # Awesome AI Infrastructure [![Awesome Lists](https://srv-cdn.himpfen.io/badges/awesome-lists/awesomelists-flat.svg)](https://github.com/awesomelistsio/awesome)

[![Buy Me A Coffee](https://srv-cdn.himpfen.io/badges/buymeacoffee/buymeacoffee-flat.svg)](https://tinyurl.com/2h9aktmd)   [![Ko-Fi](https://srv-cdn.himpfen.io/badges/kofi/kofi-flat.svg)](https://tinyurl.com/d4xnrptz)   [![PayPal](https://srv-cdn.himpfen.io/badges/paypal/paypal-flat.svg)](https://tinyurl.com/mr22naua)   [![Stripe](https://srv-cdn.himpfen.io/badges/stripe/stripe-flat.svg)](https://tinyurl.com/e8ymxdw3)

> A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.

## Contents

- [Distributed Training](#distributed-training)

- [Model Serving and Deployment](#model-serving-and-deployment)

- [MLOps and Automation](#mlops-and-automation)

- [Data Management](#data-management)

- [Optimization Tools](#optimization-tools)

- [Infrastructure as Code](#infrastructure-as-code)

- [Cloud Platforms](#cloud-platforms)

- [Learning Resources](#learning-resources)

- [Books](#books)

- [Community](#community)

- [Contribute](#contribute)

- [License](#license)

## Distributed Training

- [Horovod](https://horovod.ai/) - A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.

- [Ray](https://www.ray.io/) - A framework for building scalable distributed applications, including distributed AI and reinforcement learning.

- [PyTorch Distributed](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) - Tools and libraries for distributed training in PyTorch.

- [DeepSpeed](https://www.deepspeed.ai/) - A deep learning optimization library that makes distributed training easy and efficient.

- [MPI for Machine Learning](https://www.mpich.org/) - Using the Message Passing Interface (MPI) standard for distributed machine learning.

## Model Serving and Deployment

- [TensorFlow Serving](https://www.tensorflow.org/tfx/guide/serving) - A flexible, high-performance serving system for machine learning models.

- [TorchServe](https://pytorch.org/serve/) - A model serving framework for PyTorch, providing fast and efficient model deployment.

- [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server) - A scalable model serving platform supporting multiple frameworks.

- [ONNX Runtime](https://onnxruntime.ai/) - A cross-platform, high-performance scoring engine for serving ONNX models.

- [Seldon Core](https://www.seldon.io/tech/products/core/) - An open-source platform for deploying and monitoring machine learning models on Kubernetes.

- [KFServing (KServe)](https://kserve.github.io/website/) - A Kubernetes-based model serving solution as part of the Kubeflow project.

## MLOps and Automation

- [MLflow](https://mlflow.org/) - An open-source platform for managing the end-to-end machine learning lifecycle.

- [Kubeflow](https://www.kubeflow.org/) - A platform for orchestrating machine learning workflows on Kubernetes.

- [DVC (Data Version Control)](https://dvc.org/) - A tool for version control and reproducibility in machine learning projects.

- [ZenML](https://zenml.io/) - An extensible MLOps framework for creating portable, production-ready machine learning pipelines.

- [Airflow](https://airflow.apache.org/) - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.

- [Metaflow](https://metaflow.org/) - A human-centric framework for building and managing real-life data science projects, developed by Netflix.

## Data Management

- [Delta Lake](https://delta.io/) - An open-source storage layer that brings reliability to data lakes.

- [Apache Hudi](https://hudi.apache.org/) - A data management framework that simplifies incremental data processing and streaming analytics.

- [Feast](https://feast.dev/) - An open-source feature store for managing and serving machine learning features.

- [Great Expectations](https://greatexpectations.io/) - A tool for data validation and testing in machine learning workflows.

- [LakeFS](https://lakefs.io/) - An open-source data versioning platform for managing data lakes.

## Optimization Tools

- [NVIDIA TensorRT](https://developer.nvidia.com/tensorrt) - A high-performance deep learning inference optimizer and runtime.

- [Apache TVM](https://tvm.apache.org/) - A deep learning compiler stack for optimizing models on various hardware backends.

- [Intel OpenVINO](https://docs.openvino.ai/latest/index.html) - A toolkit for optimizing and deploying AI inference on Intel hardware.

- [OctoML](https://octoml.ai/) - An AI model optimization platform for efficient deployment on edge and cloud.

- [Quantization Aware Training (QAT)](https://www.tensorflow.org/model_optimization) - Tools for optimizing model performance through quantization.

## Infrastructure as Code

- [Terraform](https://www.terraform.io/) - A tool for building, changing, and versioning infrastructure safely and efficiently.

- [Pulumi](https://www.pulumi.com/) - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.

- [Ansible](https://www.ansible.com/) - An open-source automation tool for provisioning and managing infrastructure.

- [AWS CloudFormation](https://aws.amazon.com/cloudformation/) - A service for automating AWS resource deployment and management.

- [Google Deployment Manager](https://cloud.google.com/deployment-manager) - An infrastructure management tool for Google Cloud Platform.

## Cloud Platforms

- [AWS SageMaker](https://aws.amazon.com/sagemaker/) - A comprehensive platform for building, training, and deploying machine learning models on AWS.

- [Google AI Platform](https://cloud.google.com/ai-platform) - Google Cloud’s integrated environment for AI development and deployment.

- [Azure Machine Learning](https://azure.microsoft.com/en-us/services/machine-learning/) - A cloud-based platform for training, deploying, and managing machine learning models.

- [IBM Watson Studio](https://www.ibm.com/cloud/watson-studio) - A suite of tools for data science, machine learning, and AI model development.

- [Paperspace Gradient](https://gradient.paperspace.com/) - A cloud platform for developing, training, and deploying machine learning models.

## Learning Resources

- [Coursera: MLOps Fundamentals](https://www.coursera.org/specializations/mlops-fundamentals) - A course on MLOps best practices for machine learning projects.

- [Google Cloud: ML Operations](https://cloud.google.com/learn/training/ai-ml-operations) - Training resources on MLOps and model deployment.

- [AWS SageMaker Workshops](https://sagemaker-examples.readthedocs.io/en/latest/) - Example projects and tutorials for using AWS SageMaker.

- [Kubeflow Documentation](https://www.kubeflow.org/docs/) - Official documentation and guides for using Kubeflow.

- [PyTorch Distributed Training Guide](https://pytorch.org/tutorials/intermediate/ddp_tutorial.html) - A tutorial on distributed training with PyTorch.

## Books

- *Machine Learning Engineering* by Andriy Burkov - A book on building scalable machine learning infrastructure.

- *Building Machine Learning Powered Applications* by Emmanuel Ameisen - A guide to building robust ML applications in production.

- *Designing Data-Intensive Applications* by Martin Kleppmann - A comprehensive guide to building scalable and reliable data systems.

- *MLOps: Data Science in Production* by Mark Treveil and The Dotscience Team - A book on best practices for MLOps and model deployment.

- *Reliable Machine Learning* by Cathy Chen - A book on creating resilient machine learning infrastructure.

## Community

- [MLOps Community](https://mlops.community/) - A global community focused on MLOps and AI infrastructure.

- [Reddit: r/MachineLearning](https://www.reddit.com/r/MachineLearning/) - A subreddit for discussions on machine learning infrastructure and tools.

- [Kubeflow Slack](https://kubeflow.slack.com/) - A Slack community for discussing Kubeflow and machine learning pipelines.

- [Paperspace Forums](https://forums.paperspace.com/) - A community forum for discussing machine learning infrastructure and tools.

- [GitHub: MLOps Repositories](https://github.com/topics/mlops) - A collection of open-source MLOps projects on GitHub.

## Contribute

Contributions are welcome!

## License

[![CC0](https://mirrors.creativecommons.org/presskit/buttons/88x31/svg/by-sa.svg)](http://creativecommons.org/licenses/by-sa/4.0/)

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/awesomelistsio/awesome-ai-infrastructure

Awesome Lists containing this project

README