https://github.com/skypilot-org/skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://github.com/skypilot-org/skypilot

cloud-computing cloud-management cost-management cost-optimization data-science deep-learning distributed-training finops gpu hyperparameter-tuning job-queue job-scheduler llm-serving llm-training machine-learning ml-infrastructure ml-platform multicloud spot-instances tpu

Last synced: 5 months ago
JSON representation

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.

Host: GitHub
URL: https://github.com/skypilot-org/skypilot
Owner: skypilot-org
License: apache-2.0
Created: 2021-08-11T23:32:15.000Z (about 4 years ago)
Default Branch: master
Last Pushed: 2025-05-12T02:36:47.000Z (5 months ago)
Last Synced: 2025-05-12T02:43:20.669Z (5 months ago)
Topics: cloud-computing, cloud-management, cost-management, cost-optimization, data-science, deep-learning, distributed-training, finops, gpu, hyperparameter-tuning, job-queue, job-scheduler, llm-serving, llm-training, machine-learning, ml-infrastructure, ml-platform, multicloud, spot-instances, tpu
Language: Python
Homepage: https://docs.skypilot.co/
Size: 149 MB
Stars: 8,069
Watchers: 72
Forks: 645
Open Issues: 470
Metadata Files:
- Readme: README.md
- Contributing: CONTRIBUTING.md
- License: LICENSE

Awesome Lists containing this project

awesome - skypilot-org/skypilot - Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem). (Python)
Awesome_Multimodel_LLM - SkyPilot - Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface. (Tools for deploying LLM)
awesome-production-machine-learning - SkyPilot - org/skypilot.svg?style=social) - SkyPilot is a framework for running LLMs, AI, and batch jobs on any cloud, offering maximum cost savings, highest GPU availability, and managed execution. (Deployment and Serving)
awesome-ray - SkyPilot
Awesome-LLM - SkyPilot - Run LLMs and batch jobs on any cloud. Get maximum cost savings, highest GPU availability, and managed execution -- all with a simple interface. (LLM Deployment)
awesome-repositories - skypilot-org/skypilot - Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem). (Python)
awesome-starred - skypilot-org/skypilot - Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem). (Python)
awesome-starred - skypilot-org/skypilot - SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface. (Python)
StarryDivineSky - skypilot-org/skypilot - 6 倍的成本，并具有抢占自动恢复功能；优化器：通过自动选择最便宜和最可用的基础设施，节省2倍的成本。SkyPilot 支持您现有的 GPU、TPU 和 CPU 工作负载，无需更改代码。 (其他_机器学习与深度学习)
awesome - skypilot-org/skypilot - SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface. (Python)
stars - skypilot-org/skypilot - SkyPilot: Run LLMs, AI, and Batch jobs on any cloud. Get maximum savings, highest GPU availability, and managed execution—all with a simple interface. (Python)
awesome-LLM-resources - SkyPilot
awesome-hacking-lists - skypilot-org/skypilot - SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 15+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface. (Python)
awesome - skypilot-org/skypilot - Run, manage, and scale AI workloads on any AI infrastructure. Use one system to access & manage all AI compute (Kubernetes, 17+ clouds, or on-prem). (Python)
AiTreasureBox - skypilot-org/skypilot - 10-01_8786_2](https://img.shields.io/github/stars/skypilot-org/skypilot.svg)|SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 16+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.| (Repos)

README

          


  

    

    

  





  

    

  

  

    

  

  

    

  

  

    

  





    Run AI on Any Infra — Unified, Faster, Cheaper



----

:fire: *News* :fire:

- [Apr 2025] Spin up **Qwen3** on your cluster/cloud: [**example**](./llm/qwen/)

- [Mar 2025] Run and serve **Google Gemma 3** using SkyPilot [**example**](./llm/gemma3/)

- [Feb 2025] Prepare and serve **Retrieval Augmented Generation (RAG) with DeepSeek-R1**: [**blog post**](https://blog.skypilot.co/deepseek-rag), [**example**](./llm/rag/)

- [Feb 2025] Run and serve **DeepSeek-R1 671B** using SkyPilot and SGLang with high throughput: [**example**](./llm/deepseek-r1/)

- [Feb 2025] Prepare and serve large-scale image search with **vector databases**: [**blog post**](https://blog.skypilot.co/large-scale-vector-database/), [**example**](./examples/vector_database/)

- [Jan 2025] Launch and serve distilled models from **[DeepSeek-R1](https://github.com/deepseek-ai/DeepSeek-R1)** and **[Janus](https://github.com/deepseek-ai/DeepSeek-Janus)** on Kubernetes or any cloud: [**R1 example**](./llm/deepseek-r1-distilled/) and [**Janus example**](./llm/deepseek-janus/)

- [Oct 2024] :tada: **SkyPilot crossed 1M+ downloads** :tada:: Thank you to our community! [**Twitter/X**](https://x.com/skypilot_org/status/1844770841718067638)

- [Sep 2024] Point, launch and serve **Llama 3.2** on Kubernetes or any cloud: [**example**](./llm/llama-3_2/)

**LLM Finetuning Cookbooks**: Finetuning Llama 2 / Llama 3.1 in your own cloud environment, privately: Llama 2 [**example**](./llm/vicuna-llama-2/) and [**blog**](https://blog.skypilot.co/finetuning-llama2-operational-guide/); Llama 3.1 [**example**](./llm/llama-3_1-finetuning/) and [**blog**](https://blog.skypilot.co/finetune-llama-3_1-on-your-infra/)

----

SkyPilot is an open-source framework for running AI and batch workloads on any infra.

SkyPilot **is easy to use for AI users**:

- Quickly spin up compute on your own infra

- Environment and job as code — simple and portable

- Easy job management: queue, run, and auto-recover many jobs

SkyPilot **unifies multiple clusters, clouds, and hardware**:

- One interface to use reserved GPUs, Kubernetes clusters, or 16+ clouds

- [Flexible provisioning](https://docs.skypilot.co/en/latest/examples/auto-failover.html) of GPUs, TPUs, CPUs, with auto-retry

- [Team deployment](https://docs.skypilot.co/en/latest/reference/api-server/api-server.html) and resource sharing

SkyPilot **cuts your cloud costs & maximizes GPU availability**:

* Autostop: automatic cleanup of idle resources

* [Spot instance support](https://docs.skypilot.co/en/latest/examples/managed-jobs.html#running-on-spot-instances): 3-6x cost savings, with preemption auto-recovery

* Intelligent scheduling: automatically run on the cheapest & most available infra

SkyPilot supports your existing GPU, TPU, and CPU workloads, with no code changes.

Install with pip:

```bash

# Choose your clouds:

pip install -U "skypilot[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,nebius]"

```

To get the latest features and fixes, use the nightly build or [install from source](https://docs.skypilot.co/en/latest/getting-started/installation.html):

```bash

# Choose your clouds:

pip install "skypilot-nightly[kubernetes,aws,gcp,azure,oci,lambda,runpod,fluidstack,paperspace,cudo,ibm,scp,nebius]"

```



  



Current supported infra: Kubernetes, AWS, GCP, Azure, OCI, Lambda Cloud, Fluidstack,

RunPod, Cudo, Digital Ocean, Paperspace, Cloudflare, Samsung, IBM, Vast.ai,

VMware vSphere, Nebius.



  

    

    

  



## Getting started

You can find our documentation [here](https://docs.skypilot.co/).

- [Installation](https://docs.skypilot.co/en/latest/getting-started/installation.html)

- [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html)

- [CLI reference](https://docs.skypilot.co/en/latest/reference/cli.html)

## SkyPilot in 1 minute

A SkyPilot task specifies: resource requirements, data to be synced, setup commands, and the task commands.

Once written in this [**unified interface**](https://docs.skypilot.co/en/latest/reference/yaml-spec.html) (YAML or Python API), the task can be launched on any available cloud.  This avoids vendor lock-in, and allows easily moving jobs to a different provider.

Paste the following into a file `my_task.yaml`:

```yaml

resources:

  accelerators: A100:8  # 8x NVIDIA A100 GPU

num_nodes: 1  # Number of VMs to launch

# Working directory (optional) containing the project codebase.

# Its contents are synced to ~/sky_workdir/ on the cluster.

workdir: ~/torch_examples

# Commands to be run before executing the job.

# Typical use: pip install -r requirements.txt, git clone, etc.

setup: |

  pip install "torch<2.2" torchvision --index-url https://download.pytorch.org/whl/cu121

# Commands to run as a job.

# Typical use: launch the main program.

run: |

  cd mnist

  python main.py --epochs 1

```

Prepare the workdir by cloning:

```bash

git clone https://github.com/pytorch/examples.git ~/torch_examples

```

Launch with `sky launch` (note: [access to GPU instances](https://docs.skypilot.co/en/latest/cloud-setup/quota.html) is needed for this example):

```bash

sky launch my_task.yaml

```

SkyPilot then performs the heavy-lifting for you, including:

1. Find the lowest priced VM instance type across different clouds

2. Provision the VM, with auto-failover if the cloud returned capacity errors

3. Sync the local `workdir` to the VM

4. Run the task's `setup` commands to prepare the VM for running the task

5. Run the task's `run` commands

See [Quickstart](https://docs.skypilot.co/en/latest/getting-started/quickstart.html) to get started with SkyPilot.

## Runnable examples

See [**SkyPilot examples**](https://docs.skypilot.co/en/docs-examples/examples/index.html) that cover: development, training, serving, LLM models, AI apps, and common frameworks.

Latest featured examples:

| Task | Examples |

|----------|----------|

| Training | [PyTorch](https://docs.skypilot.co/en/latest/getting-started/tutorial.html), [DeepSpeed](https://docs.skypilot.co/en/latest/examples/training/deepspeed.html), [Finetune Llama 3](https://docs.skypilot.co/en/latest/examples/training/llama-3_1-finetuning.html), [NeMo](https://docs.skypilot.co/en/latest/examples/training/nemo.html), [Ray](https://docs.skypilot.co/en/latest/examples/training/ray.html), [Unsloth](https://docs.skypilot.co/en/latest/examples/training/unsloth.html), [Jax/TPU](https://docs.skypilot.co/en/latest/examples/training/tpu.html) |

| Serving | [vLLM](https://docs.skypilot.co/en/latest/examples/serving/vllm.html), [SGLang](https://docs.skypilot.co/en/latest/examples/serving/sglang.html), [Ollama](https://docs.skypilot.co/en/latest/examples/serving/ollama.html) |

| Models | [DeepSeek-R1](https://docs.skypilot.co/en/latest/examples/models/deepseek-r1.html), [Llama 3](https://docs.skypilot.co/en/latest/examples/models/llama-3.html), [CodeLlama](https://docs.skypilot.co/en/latest/examples/models/codellama.html), [Qwen](https://docs.skypilot.co/en/latest/examples/models/qwen.html), [Mixtral](https://docs.skypilot.co/en/latest/examples/models/mixtral.html) |

| AI apps | [RAG](https://docs.skypilot.co/en/latest/examples/applications/rag.html), [vector databases](https://docs.skypilot.co/en/latest/examples/applications/vector_database.html) (ChromaDB, CLIP) |

| Common frameworks | [Airflow](https://docs.skypilot.co/en/latest/examples/frameworks/airflow.html), [Jupyter](https://docs.skypilot.co/en/latest/examples/frameworks/jupyter.html) |

Source files and more examples can be found in [`llm/`](https://github.com/skypilot-org/skypilot/tree/master/llm) and [`examples/`](https://github.com/skypilot-org/skypilot/tree/master/examples).

## More information

To learn more, see [SkyPilot Overview](https://docs.skypilot.co/en/latest/overview.html), [SkyPilot docs](https://docs.skypilot.co/en/latest/), and [SkyPilot blog](https://blog.skypilot.co/).

Case studies and integrations: [Community Spotlights](https://blog.skypilot.co/community/)

Follow updates:

- [Slack](http://slack.skypilot.co)

- [X / Twitter](https://twitter.com/skypilot_org)

- [LinkedIn](https://www.linkedin.com/company/skypilot-oss/)

- [SkyPilot Blog](https://blog.skypilot.co/) ([Introductory blog post](https://blog.skypilot.co/introducing-skypilot/))

Read the research:

- [SkyPilot paper](https://www.usenix.org/system/files/nsdi23-yang-zongheng.pdf) and [talk](https://www.usenix.org/conference/nsdi23/presentation/yang-zongheng) (NSDI 2023)

- [Sky Computing whitepaper](https://arxiv.org/abs/2205.07147)

- [Sky Computing vision paper](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s02-stoica.pdf) (HotOS 2021)

- [SkyServe: AI serving across regions and clouds](https://arxiv.org/pdf/2411.01438) (EuroSys 2025)

- [Managed jobs spot instance policy](https://www.usenix.org/conference/nsdi24/presentation/wu-zhanghao)  (NSDI 2024)

SkyPilot was initially started at the [Sky Computing Lab](https://sky.cs.berkeley.edu) at UC Berkeley and has since gained many industry contributors. To read about the project's origin and vision, see [Concept: Sky Computing](https://docs.skypilot.co/en/latest/sky-computing.html).

## Questions and feedback

We are excited to hear your feedback:

* For issues and feature requests, please [open a GitHub issue](https://github.com/skypilot-org/skypilot/issues/new).

* For questions, please use [GitHub Discussions](https://github.com/skypilot-org/skypilot/discussions).

For general discussions, join us on the [SkyPilot Slack](http://slack.skypilot.co).

## Contributing

We welcome all contributions to the project! See [CONTRIBUTING](CONTRIBUTING.md) for how to get involved.

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/skypilot-org/skypilot

Awesome Lists containing this project

README

Run AI on Any Infra — Unified, Faster, Cheaper