Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

awesome-ai-infrastructure

A curated list of awesome tools, frameworks, platforms, and resources for building scalable and efficient AI infrastructure, including distributed training, model serving, MLOps, and deployment.
https://github.com/awesomelistsio/awesome-ai-infrastructure

Last synced: 6 days ago
JSON representation

  • Distributed Training

    • Horovod - A distributed deep learning training framework for TensorFlow, Keras, and PyTorch.
    • MPI for Machine Learning - Using the Message Passing Interface (MPI) standard for distributed machine learning.
    • Ray - A framework for building scalable distributed applications, including distributed AI and reinforcement learning.
    • DeepSpeed - A deep learning optimization library that makes distributed training easy and efficient.
    • MPI for Machine Learning - Using the Message Passing Interface (MPI) standard for distributed machine learning.
  • Model Serving and Deployment

    • TorchServe - A model serving framework for PyTorch, providing fast and efficient model deployment.
    • ONNX Runtime - A cross-platform, high-performance scoring engine for serving ONNX models.
    • Seldon Core - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
    • KFServing (KServe) - A Kubernetes-based model serving solution as part of the Kubeflow project.
    • TensorFlow Serving - A flexible, high-performance serving system for machine learning models.
    • TorchServe - A model serving framework for PyTorch, providing fast and efficient model deployment.
    • ONNX Runtime - A cross-platform, high-performance scoring engine for serving ONNX models.
    • Seldon Core - An open-source platform for deploying and monitoring machine learning models on Kubernetes.
    • KFServing (KServe) - A Kubernetes-based model serving solution as part of the Kubeflow project.
  • MLOps and Automation

    • Kubeflow - A platform for orchestrating machine learning workflows on Kubernetes.
    • DVC (Data Version Control) - A tool for version control and reproducibility in machine learning projects.
    • Airflow - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
    • Metaflow - A human-centric framework for building and managing real-life data science projects, developed by Netflix.
    • MLflow - An open-source platform for managing the end-to-end machine learning lifecycle.
    • MLflow - An open-source platform for managing the end-to-end machine learning lifecycle.
    • Kubeflow - A platform for orchestrating machine learning workflows on Kubernetes.
    • DVC (Data Version Control) - A tool for version control and reproducibility in machine learning projects.
    • Airflow - A platform for orchestrating complex workflows, commonly used in machine learning pipelines.
    • Metaflow - A human-centric framework for building and managing real-life data science projects, developed by Netflix.
  • Data Management

    • Delta Lake - An open-source storage layer that brings reliability to data lakes.
    • Apache Hudi - A data management framework that simplifies incremental data processing and streaming analytics.
    • Feast - An open-source feature store for managing and serving machine learning features.
    • LakeFS - An open-source data versioning platform for managing data lakes.
    • Delta Lake - An open-source storage layer that brings reliability to data lakes.
    • Great Expectations - A tool for data validation and testing in machine learning workflows.
    • LakeFS - An open-source data versioning platform for managing data lakes.
    • Apache Hudi - A data management framework that simplifies incremental data processing and streaming analytics.
    • Feast - An open-source feature store for managing and serving machine learning features.
  • Optimization Tools

    • NVIDIA TensorRT - A high-performance deep learning inference optimizer and runtime.
    • Apache TVM - A deep learning compiler stack for optimizing models on various hardware backends.
    • Intel OpenVINO - A toolkit for optimizing and deploying AI inference on Intel hardware.
    • OctoML - An AI model optimization platform for efficient deployment on edge and cloud.
    • Quantization Aware Training (QAT) - Tools for optimizing model performance through quantization.
    • NVIDIA TensorRT - A high-performance deep learning inference optimizer and runtime.
    • Apache TVM - A deep learning compiler stack for optimizing models on various hardware backends.
    • Intel OpenVINO - A toolkit for optimizing and deploying AI inference on Intel hardware.
    • OctoML - An AI model optimization platform for efficient deployment on edge and cloud.
    • Quantization Aware Training (QAT) - Tools for optimizing model performance through quantization.
  • Infrastructure as Code

    • Terraform - A tool for building, changing, and versioning infrastructure safely and efficiently.
    • Pulumi - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
    • Ansible - An open-source automation tool for provisioning and managing infrastructure.
    • AWS CloudFormation - A service for automating AWS resource deployment and management.
    • Google Deployment Manager - An infrastructure management tool for Google Cloud Platform.
    • Terraform - A tool for building, changing, and versioning infrastructure safely and efficiently.
    • Pulumi - Infrastructure as code for deploying and managing cloud infrastructure using programming languages.
    • Ansible - An open-source automation tool for provisioning and managing infrastructure.
    • AWS CloudFormation - A service for automating AWS resource deployment and management.
    • Google Deployment Manager - An infrastructure management tool for Google Cloud Platform.
  • Learning Resources

  • Cloud Platforms

    • Paperspace Gradient - A cloud platform for developing, training, and deploying machine learning models.
    • AWS SageMaker - A comprehensive platform for building, training, and deploying machine learning models on AWS.
    • Paperspace Gradient - A cloud platform for developing, training, and deploying machine learning models.
  • Community