Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
awesome-ai-infrastructure
A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.
https://github.com/awesomelistsio/awesome-ai-infrastructure
Last synced: about 11 hours ago
JSON representation
-
Distributed Training
- Horovod - A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.
- MPI for Machine Learning - Using the Message Passing Interface (MPI) standard for distributed machine learning.
- Ray - A framework for building scalable distributed applications, including distributed AI and reinforcement learning.
- DeepSpeed - A deep learning optimization library that makes distributed training easy and efficient.
- MPI for Machine Learning - Using the Message Passing Interface (MPI) standard for distributed machine learning.
-
Model Serving and Deployment
- TorchServe - A model serving framework for PyTorch, providing fast and efficient model deployment.
- ONNX Runtime - A cross-platform, high-performance scoring engine for serving ONNX models.
- Seldon Core - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
- KFServing (KServe) - A Kubernetes-based model serving solution as part of the Kubeflow project.
- TensorFlow Serving - A flexible, high-performance serving system for machine learning models.
- TorchServe - A model serving framework for PyTorch, providing fast and efficient model deployment.
- ONNX Runtime - A cross-platform, high-performance scoring engine for serving ONNX models.
- Seldon Core - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
- KFServing (KServe) - A Kubernetes-based model serving solution as part of the Kubeflow project.
-
MLOps and Automation
- Kubeflow - A platform for orchestrating machine learning workflows on Kubernetes.
- DVC (Data Version Control) - A tool for version control and reproducibility in machine learning projects.
- Airflow - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
- Metaflow - A human-centric framework for building and managing real-life data science projects, developed by Netflix.
- MLflow - An open-source platform for managing the end-to-end machine learning lifecycle.
- MLflow - An open-source platform for managing the end-to-end machine learning lifecycle.
- Kubeflow - A platform for orchestrating machine learning workflows on Kubernetes.
- DVC (Data Version Control) - A tool for version control and reproducibility in machine learning projects.
- Airflow - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
- Metaflow - A human-centric framework for building and managing real-life data science projects, developed by Netflix.
-
Data Management
- Delta Lake - An open-source storage layer that brings reliability to data lakes.
- Apache Hudi - A data management framework that simplifies incremental data processing and streaming analytics.
- Feast - An open-source feature store for managing and serving machine learning features.
- LakeFS - An open-source data versioning platform for managing data lakes.
- Delta Lake - An open-source storage layer that brings reliability to data lakes.
- Great Expectations - A tool for data validation and testing in machine learning workflows.
- LakeFS - An open-source data versioning platform for managing data lakes.
- Apache Hudi - A data management framework that simplifies incremental data processing and streaming analytics.
- Feast - An open-source feature store for managing and serving machine learning features.
-
Optimization Tools
- NVIDIA TensorRT - A high-performance deep learning inference optimizer and runtime.
- Apache TVM - A deep learning compiler stack for optimizing models on various hardware backends.
- Intel OpenVINO - A toolkit for optimizing and deploying AI inference on Intel hardware.
- OctoML - An AI model optimization platform for efficient deployment on edge and cloud.
- Quantization Aware Training (QAT) - Tools for optimizing model performance through quantization.
- NVIDIA TensorRT - A high-performance deep learning inference optimizer and runtime.
- Apache TVM - A deep learning compiler stack for optimizing models on various hardware backends.
- Intel OpenVINO - A toolkit for optimizing and deploying AI inference on Intel hardware.
- OctoML - An AI model optimization platform for efficient deployment on edge and cloud.
- Quantization Aware Training (QAT) - Tools for optimizing model performance through quantization.
-
Infrastructure as Code
- Terraform - A tool for building, changing, and versioning infrastructure safely and efficiently.
- Pulumi - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
- Ansible - An open-source automation tool for provisioning and managing infrastructure.
- AWS CloudFormation - A service for automating AWS resource deployment and management.
- Google Deployment Manager - An infrastructure management tool for Google Cloud Platform.
- Terraform - A tool for building, changing, and versioning infrastructure safely and efficiently.
- Pulumi - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
- Ansible - An open-source automation tool for provisioning and managing infrastructure.
- AWS CloudFormation - A service for automating AWS resource deployment and management.
- Google Deployment Manager - An infrastructure management tool for Google Cloud Platform.
-
Learning Resources
- Google Cloud: ML Operations - Training resources on MLOps and model deployment.
- Coursera: MLOps Fundamentals - A course on MLOps best practices for machine learning projects.
- AWS SageMaker Workshops - Example projects and tutorials for using AWS SageMaker.
- Kubeflow Documentation - Official documentation and guides for using Kubeflow.
- Coursera: MLOps Fundamentals - A course on MLOps best practices for machine learning projects.
- Google Cloud: ML Operations - Training resources on MLOps and model deployment.
- AWS SageMaker Workshops - Example projects and tutorials for using AWS SageMaker.
- Kubeflow Documentation - Official documentation and guides for using Kubeflow.
- PyTorch Distributed Training Guide - A tutorial on distributed training with PyTorch.
-
Cloud Platforms
- Paperspace Gradient - A cloud platform for developing, training, and deploying machine learning models.
- AWS SageMaker - A comprehensive platform for building, training, and deploying machine learning models on AWS.
- Paperspace Gradient - A cloud platform for developing, training, and deploying machine learning models.
-
Community
- MLOps Community - A global community focused on MLOps and AI infrastructure.
- Reddit: r/MachineLearning - A subreddit for discussions on machine learning infrastructure and tools.
- Kubeflow Slack - A Slack community for discussing Kubeflow and machine learning pipelines.
- Paperspace Forums - A community forum for discussing machine learning infrastructure and tools.
- GitHub: MLOps Repositories - A collection of open-source MLOps projects on GitHub.
- GitHub: MLOps Repositories - A collection of open-source MLOps projects on GitHub.
- MLOps Community - A global community focused on MLOps and AI infrastructure.
- Reddit: r/MachineLearning - A subreddit for discussions on machine learning infrastructure and tools.
- Kubeflow Slack - A Slack community for discussing Kubeflow and machine learning pipelines.
- Paperspace Forums - A community forum for discussing machine learning infrastructure and tools.
Programming Languages
Categories
Sub Categories