{"id":13505761,"url":"https://github.com/iterative/terraform-provider-iterative","last_synced_at":"2025-06-18T02:38:55.058Z","repository":{"id":36999803,"uuid":"292388086","full_name":"iterative/terraform-provider-iterative","owner":"iterative","description":"☁️ Terraform plugin for machine learning workloads: spot instance recovery \u0026 auto-termination | AWS, GCP, Azure, Kubernetes","archived":false,"fork":false,"pushed_at":"2024-12-11T23:56:40.000Z","size":21259,"stargazers_count":295,"open_issues_count":71,"forks_count":28,"subscribers_count":12,"default_branch":"main","last_synced_at":"2025-06-13T23:46:05.977Z","etag":null,"topics":["aws","azure","cloud","cloud-computing","cloud-infrastructure","cloud-orchestration","cloud-storage","cml","data-science","developer-tools","gcp","gpu","hacktoberfest","k8s","machine-learning","mlops","terraform","terraform-provider","terraform-provider-iterative","tpi"],"latest_commit_sha":null,"homepage":"https://registry.terraform.io/providers/iterative/iterative/latest/docs","language":"Go","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iterative.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2020-09-02T20:30:28.000Z","updated_at":"2025-05-25T13:42:27.000Z","dependencies_parsed_at":"2024-11-01T03:31:12.709Z","dependency_job_id":"e5834c76-65e9-4f53-a66a-4404be95ea32","html_url":"https://github.com/iterative/terraform-provider-iterative","commit_stats":{"total_commits":321,"total_committers":25,"mean_commits":12.84,"dds":"0.47663551401869164","last_synced_commit":"b689064d0f9e00e626ac5a36c44e8f1c3c1fb431"},"previous_names":[],"tags_count":116,"template":false,"template_full_name":null,"purl":"pkg:github/iterative/terraform-provider-iterative","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fterraform-provider-iterative","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fterraform-provider-iterative/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fterraform-provider-iterative/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fterraform-provider-iterative/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iterative","download_url":"https://codeload.github.com/iterative/terraform-provider-iterative/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iterative%2Fterraform-provider-iterative/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":260476176,"owners_count":23014978,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["aws","azure","cloud","cloud-computing","cloud-infrastructure","cloud-orchestration","cloud-storage","cml","data-science","developer-tools","gcp","gpu","hacktoberfest","k8s","machine-learning","mlops","terraform","terraform-provider","terraform-provider-iterative","tpi"],"created_at":"2024-08-01T00:01:13.107Z","updated_at":"2025-06-18T02:38:50.044Z","avatar_url":"https://github.com/iterative.png","language":"Go","funding_links":[],"categories":["Go","Providers"],"sub_categories":["Vendor supported providers"],"readme":"![TPI](https://static.iterative.ai/img/tpi/banner.svg)\n\n# Terraform Provider Iterative (TPI)\n\n[![docs](https://img.shields.io/badge/-docs-5c4ee5?logo=terraform)](https://registry.terraform.io/providers/iterative/iterative/latest/docs)\n[![tests](https://img.shields.io/github/actions/workflow/status/iterative/terraform-provider-iterative/test.yml?branch=main\u0026label=tests\u0026logo=GitHub)](https://github.com/iterative/terraform-provider-iterative/actions/workflows/test.yml)\n[![Apache-2.0][licence-badge]][licence-file]\n\nTPI is a [Terraform](https://terraform.io) plugin built with machine learning in mind. This CLI tool offers full lifecycle management of computing resources (including GPUs and respawning spot instances) from several cloud vendors (AWS, Azure, GCP, K8s)... without needing to be a cloud expert.\n\n- **Lower cost with spot recovery**: transparent data checkpoint/restore \u0026 auto-respawning of low-cost spot/preemptible instances\n- **No cloud vendor lock-in**: switch between clouds with just one line thanks to unified abstraction\n- **No waste**: auto-cleanup unused resources (terminate compute instances upon task completion/failure \u0026 remove storage upon download of results), pay only for what you use\n- **Developer-first experience**: one-command data sync \u0026 code execution with no external server, making the cloud feel like a laptop\n\nSupported cloud vendors [include][auth]:\n\n| [![Amazon Web Services (AWS)][aws-badge]][aws] | [![Microsoft Azure][azure-badge]][azure] | [![Google Cloud Platform (GCP)][gcp-badge]][gcp] | [![Kubernetes (K8s)][k8s-badge]][k8s] |\n| ---------------------------------------------- | ---------------------------------------- | ------------------------------------------------ | ------------------------------------- |\n\n[aws-badge]: https://img.shields.io/badge/AWS-Amazon_Web_Services-black?colorA=white\u0026logoColor=232F3E\u0026logo=amazonaws\n[aws]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#amazon-web-services\n[azure-badge]: https://img.shields.io/badge/Azure-Microsoft_Azure-black?colorA=white\u0026logoColor=0078D4\u0026logo=microsoftazure\n[azure]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#microsoft-azure\n[gcp-badge]: https://img.shields.io/badge/GCP-Google_Cloud_Platform-black?colorA=white\u0026logoColor=4285F4\u0026logo=googlecloud\n[gcp]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#google-cloud-platform\n[k8s-badge]: https://img.shields.io/badge/K8s-Kubernetes-black?colorA=white\u0026logoColor=326CE5\u0026logo=kubernetes\n[k8s]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication#kubernetes\n[auth]: https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/authentication\n\n\u003cpicture\u003e\n  \u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://github.com/iterative/static/raw/main/img/tpi/high-level-dark.png\"\u003e\n  \u003cimg src=\"https://github.com/iterative/static/raw/main/img/tpi/high-level-light.png\"\u003e\n\u003c/picture\u003e\n\n## Why TPI?\n\nThere are several reasons to use TPI instead of other related solutions (custom scripts and/or cloud orchestrators):\n\n1. **Reduced management overhead and infrastructure cost**:\n   TPI is a CLI tool, not a running service. It requires no additional orchestrating machine (control plane/head nodes) to schedule/recover/terminate instances. Instead, TPI runs (spot) instances via cloud-native scaling groups[^scalers], taking care of recovery and termination automatically on the cloud provider's side. This design reduces management overhead \u0026 infrastructure costs. You can close your laptop while cloud tasks are running — auto-recovery happens even if you are offline.\n2. **Unified tool for data science and software development teams**:\n   TPI provides consistent tooling for both data scientists and DevOps engineers, improving cross-team collaboration. This simplifies compute management to a single config file, and reduces time to deliver ML models into production.\n3. **Reproducible, codified environments**:\n   Store hardware requirements in a single configuration file alongside the rest of your ML pipeline code.\n\n[^scalers]: [AWS Auto Scaling Groups](https://docs.aws.amazon.com/autoscaling/ec2/userguide/what-is-amazon-ec2-auto-scaling.html), [Azure VM Scale Sets](https://azure.microsoft.com/en-us/services/virtual-machine-scale-sets), [GCP managed instance groups](https://cloud.google.com/compute/docs/instance-groups#managed_instance_groups), and [Kubernetes Jobs](https://kubernetes.io/docs/concepts/workloads/controllers/job).\n\n\u003cimg width=24px src=\"https://static.iterative.ai/logo/cml.svg\"/\u003e TPI is used to power [CML](https://cml.dev), bringing cloud providers to existing GitHub, GitLab \u0026 Bitbucket CI/CD workflows ([repository](https://github.com/iterative/cml)).\n\n## Usage\n\n### Requirements\n\n- [Install Terraform 1.0+](https://learn.hashicorp.com/tutorials/terraform/install-cli#install-terraform), e.g.:\n  - Brew (Homebrew/Mac OS): `brew tap hashicorp/tap \u0026\u0026 brew install hashicorp/tap/terraform`\n  - Choco (Chocolatey/Windows): `choco install terraform`\n  - Conda (Anaconda): `conda install -c conda-forge terraform`\n  - Debian (Ubuntu/Linux):\n    ```\n    sudo apt-get update \u0026\u0026 sudo apt-get install -y gnupg software-properties-common curl\n    curl -fsSL https://apt.releases.hashicorp.com/gpg | sudo apt-key add -\n    sudo apt-add-repository \"deb [arch=amd64] https://apt.releases.hashicorp.com $(lsb_release -cs) main\"\n    sudo apt-get update \u0026\u0026 sudo apt-get install terraform\n    ```\n- Create an account with any supported cloud vendor and expose its [authentication credentials via environment variables][auth]\n\n### Define a Task\n\nIn a project root directory, create a file named `main.tf` with the following contents:\n\n```hcl\nterraform {\n  required_providers { iterative = { source = \"iterative/iterative\" } }\n}\nprovider \"iterative\" {}\n\nresource \"iterative_task\" \"example\" {\n  cloud      = \"aws\" # or any of: gcp, az, k8s\n  machine    = \"m\"   # medium. Or any of: l, xl, m+k80, xl+v100, ...\n  spot       = 0     # auto-price. Default -1 to disable, or \u003e0 for hourly USD limit\n  disk_size  = -1    # GB. Default -1 for automatic\n\n  storage {\n    workdir = \".\"       # default blank (don't upload)\n    output  = \"results\" # default blank (don't download). Relative to workdir\n  }\n  script = \u003c\u003c-END\n    #!/bin/bash\n\n    # create output directory if needed\n    mkdir -p results\n    # read last result (in case of spot/preemptible instance recovery)\n    if test -f results/epoch.txt; then EPOCH=\"$(cat results/epoch.txt)\"; fi\n    EPOCH=$${EPOCH:-1}  # start from 1 if last result not found\n\n    echo \"(re)starting training loop from $EPOCH up to 1337 epochs\"\n    for epoch in $(seq $EPOCH 1337); do\n      sleep 1\n      echo \"$epoch\" | tee results/epoch.txt\n    done\n  END\n}\n```\n\nSee [the reference](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#argument-reference) for the full list of options for `main.tf` -- including more information on [`machine` types](https://registry.terraform.io/providers/iterative/iterative/latest/docs/resources/task#machine-type) with and without GPUs.\n\n[![console](https://github.com/iterative/static/raw/main/img/tpi/console.gif)](https://youtu.be/2fEgO8SazSE)\n\nRun this once (in the directory containing `main.tf`) to download the `required_providers`:\n\n```\nterraform init\nexport TF_LOG_PROVIDER=INFO\n```\n\n### Run Task\n\n```\nterraform apply\n```\n\nThis launches a `machine` in the `cloud`, uploads `workdir`, and runs the `script`. Upon completion (or error), the `machine` is terminated.\n\nWith spot/preemptible instances (`spot \u003e= 0`), auto-recovery logic and persistent (`disk_size`) storage will be used to relaunch interrupted tasks.\n\n### Query Status\n\nResults and logs are periodically synced to persistent cloud storage. To query this status and view logs:\n\n```\nterraform refresh\nterraform show\n```\n\n### End Task\n\n```\nterraform destroy\n```\n\nThis terminates the `machine` (if still running), downloads `output`, and removes the persistent `disk_size` storage.\n\n## Example Projects\n\n- [Run Jupyter \u0026 TensorBoard in the cloud with one command](https://github.com/iterative/blog-tpi-jupyter)\n- [Move local ML experiments to the cloud](https://github.com/iterative/blog-tpi-bees)\n\n## How it Works\n\nThis diagram may help to see what TPI does under-the-hood:\n\n```mermaid\nflowchart LR\nsubgraph tpi [what TPI manages]\ndirection LR\n    subgraph you [what you manage]\n        direction LR\n        A([Personal Computer])\n    end\n    B[(\"Cloud Storage (low cost)\")]\n    C{{\"Cloud instance scaler (zero cost)\"}}\n    D[[\"Cloud (spot) Instance\"]]\n    A ---\u003e |2. create cloud storage| B\n    A --\u003e |1. create cloud instance scaler| C\n    A ==\u003e |3. upload script \u0026 workdir| B\n    A -.-\u003e |\"4. offline (lunch break)\"| A\n    C -.-\u003e |\"5. (re)provision instance\"| D\n    D ==\u003e |7. run script| D\n    B \u003c-.-\u003e |6. persistent workdir cache| D\n    D ==\u003e |8. script end,\\nshutdown instance| B\n    D -.-\u003e |outage| C\n    B ==\u003e |9. download output| A\nend\nstyle you fill:#FFFFFF00,stroke:#13ADC7\nstyle tpi fill:#FFFFFF00,stroke:#FFFFFF00,stroke-width:0px\nstyle A fill:#13ADC7,stroke:#333333,color:#000000\nstyle B fill:#945DD5,stroke:#333333,color:#000000\nstyle D fill:#F46737,stroke:#333333,color:#000000\nstyle C fill:#7B61FF,stroke:#333333,color:#000000\n```\n\n## Future Plans\n\nTPI is a CLI tool bringing the power of bare-metal cloud to a bare-metal local laptop. We're working on more featureful and visual interfaces. We'd also like to have more native support for distributed (multi-instance) training, more data sync optimisations \u0026 options, and tighter ecosystem integration with tools such as [DVC](https://dvc.org). Plus of course more examples for Data Scientists and Machine Learning Engineers - from Jupyter, VSCode, and Codespaces to improving the live logging/monitoring/reporting experience.\n\n## Help\n\nThe [getting started guide](https://registry.terraform.io/providers/iterative/iterative/latest/docs/guides/getting-started) has some more information. In case of errors, extra debugging information is available using `TF_LOG_PROVIDER=DEBUG` instead of `INFO`.\n\nFeature requests and bugs can be [reported via GitHub issues](https://github.com/iterative/terraform-provider-iterative/issues), while general questions and feedback are very welcome on our active [Discord server](https://discord.gg/bzA6uY7).\n\n## Contributing\n\nInstead of using the latest stable release, a local copy of the repository must be used.\n\n1. [Install Go 1.17+](https://golang.org/doc/install)\n2. Clone the repository \u0026 build the provider\n   ```\n   git clone https://github.com/iterative/terraform-provider-iterative\n   cd terraform-provider-iterative\n   make install\n   ```\n3. Use `source = \"github.com/iterative/iterative\"` in your `main.tf` to use the local repository (`source = \"iterative/iterative\"` will download the latest release instead), and run `terraform init --upgrade`\n\n## Copyright\n\nThis project and all contributions to it are distributed under [![Apache-2.0][licence-badge]][licence-file]\n\n[licence-badge]: https://img.shields.io/badge/licence-Apache%202.0-blue\n[licence-file]: https://github.com/iterative/terraform-provider-iterative/blob/main/LICENSE\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiterative%2Fterraform-provider-iterative","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiterative%2Fterraform-provider-iterative","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiterative%2Fterraform-provider-iterative/lists"}