Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/outerbounds/terraform-aws-metaflow

Deploy production-grade Metaflow cloud infrastructure on AWS
https://github.com/outerbounds/terraform-aws-metaflow

aws metaflow mlops terraform terraform-module

Last synced: 3 days ago
JSON representation

Deploy production-grade Metaflow cloud infrastructure on AWS

Awesome Lists containing this project

README

        

[![](https://img.shields.io/badge/[email protected]?logo=slack )](http://slack.outerbounds.co/)

# Metaflow Terraform module

Terraform module that provisions AWS resources to run [Metaflow](https://metaflow.org/) in production.

This module consists of submodules that can be used separately as well:

- AWS Batch cluster to run Metaflow steps ([`metaflow-computation`](./modules/computation))
- blob storage and metadata database ([`metaflow-datastore`](./modules/datastore))
- a service providing API to record and query past executions ([`metaflow-metadata-service`](./modules/metadata-service))
- resources to deploy Metaflow flows on Step Functions processing ([`metaflow-step-functions`](./modules/step-functions))
- Metaflow UI([`metaflow-ui`](./modules/ui))

![modules diagram](./docs/terraform_modules.png)

You can either use this high-level module, or submodules individually. See each submodule's corresponding `README.md` for more details.

Here's a minimal end-to-end example of using this module with VPC:

```terraform
# Random suffix for this deployment
resource "random_string" "suffix" {
length = 8
special = false
upper = false
}

locals {
resource_prefix = "metaflow"
resource_suffix = random_string.suffix.result
}

data "aws_availability_zones" "available" {
}

# VPC infra using https://github.com/terraform-aws-modules/terraform-aws-vpc
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "3.13.0"

name = "${local.resource_prefix}-${local.resource_suffix}"
cidr = "10.10.0.0/16"

azs = data.aws_availability_zones.available.names
private_subnets = ["10.10.8.0/21", "10.10.16.0/21", "10.10.24.0/21"]
public_subnets = ["10.10.128.0/21", "10.10.136.0/21", "10.10.144.0/21"]

enable_nat_gateway = true
single_nat_gateway = true
enable_dns_hostnames = true
}

module "metaflow" {
source = "outerbounds/metaflow/aws"
version = "0.3.0"

resource_prefix = local.resource_prefix
resource_suffix = local.resource_suffix

enable_step_functions = false
subnet1_id = module.vpc.public_subnets[0]
subnet2_id = module.vpc.public_subnets[1]
vpc_cidr_blocks = module.vpc.vpc_cidr_blocks
vpc_id = module.vpc.vpc_id
with_public_ip = true

tags = {
"managedBy" = "terraform"
}
}

# export all outputs from metaflow modules
output "metaflow" {
value = module.metaflow
}

# The module will generate a Metaflow config in JSON format, write it to a file
resource "local_file" "metaflow_config" {
content = module.metaflow.metaflow_profile_json
filename = "./metaflow_profile.json"
}
```

**Note:** You can find a more complete example that uses this module but also includes setting up sagemaker notebooks and other non-Metaflow-specific parts of infra [in this repo](https://github.com/outerbounds/metaflow-tools/tree/master/aws/terraform).

## Modules

| Name | Source | Version |
|------|--------|---------|
| [metaflow-common](#module\_metaflow-common) | ./modules/common | n/a |
| [metaflow-computation](#module\_metaflow-computation) | ./modules/computation | n/a |
| [metaflow-datastore](#module\_metaflow-datastore) | ./modules/datastore | n/a |
| [metaflow-metadata-service](#module\_metaflow-metadata-service) | ./modules/metadata-service | n/a |
| [metaflow-step-functions](#module\_metaflow-step-functions) | ./modules/step-functions | n/a |
| [metaflow-ui](#module\_metaflow-ui) | ./modules/ui | n/a |

## Inputs

| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| [access\_list\_cidr\_blocks](#input\_access\_list\_cidr\_blocks) | List of CIDRs we want to grant access to our Metaflow Metadata Service. Usually this is our VPN's CIDR blocks. | `list(string)` | `[]` | no |
| [batch\_type](#input\_batch\_type) | AWS Batch Compute Type ('ec2', 'fargate') | `string` | `"ec2"` | no |
| [compute\_environment\_desired\_vcpus](#input\_compute\_environment\_desired\_vcpus) | Desired Starting VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) | `number` | `8` | no |
| [compute\_environment\_egress\_cidr\_blocks](#input\_compute\_environment\_egress\_cidr\_blocks) | CIDR blocks to which egress is allowed from the Batch Compute environment's security group | `list(string)` |

[
"0.0.0.0/0"
]
| no |
| [compute\_environment\_instance\_types](#input\_compute\_environment\_instance\_types) | The instance types for the compute environment | `list(string)` |
[
"c4.large",
"c4.xlarge",
"c4.2xlarge",
"c4.4xlarge",
"c4.8xlarge"
]
| no |
| [compute\_environment\_max\_vcpus](#input\_compute\_environment\_max\_vcpus) | Maximum VCPUs for Batch Compute Environment [16-96] | `number` | `64` | no |
| [compute\_environment\_min\_vcpus](#input\_compute\_environment\_min\_vcpus) | Minimum VCPUs for Batch Compute Environment [0-16] for EC2 Batch Compute Environment (ignored for Fargate) | `number` | `8` | no |
| [db\_engine\_version](#input\_db\_engine\_version) | n/a | `string` | `"11"` | no |
| [db\_instance\_type](#input\_db\_instance\_type) | RDS instance type to launch for PostgresQL database. | `string` | `"db.t2.small"` | no |
| [db\_migrate\_lambda\_zip\_file](#input\_db\_migrate\_lambda\_zip\_file) | Output path for the zip file containing the DB migrate lambda | `string` | `null` | no |
| [enable\_custom\_batch\_container\_registry](#input\_enable\_custom\_batch\_container\_registry) | Provisions infrastructure for custom Amazon ECR container registry if enabled | `bool` | `false` | no |
| [enable\_key\_rotation](#input\_enable\_key\_rotation) | Enable key rotation for KMS keys | `bool` | `false` | no |
| [enable\_step\_functions](#input\_enable\_step\_functions) | Provisions infrastructure for step functions if enabled | `bool` | n/a | yes |
| [extra\_ui\_backend\_env\_vars](#input\_extra\_ui\_backend\_env\_vars) | Additional environment variables for UI backend container | `map(string)` | `{}` | no |
| [extra\_ui\_static\_env\_vars](#input\_extra\_ui\_static\_env\_vars) | Additional environment variables for UI static app | `map(string)` | `{}` | no |
| [force\_destroy\_s3\_bucket](#input\_force\_destroy\_s3\_bucket) | Empty S3 bucket before destroying via terraform destroy | `bool` | `false` | no |
| [iam\_partition](#input\_iam\_partition) | IAM Partition (Select aws-us-gov for AWS GovCloud, otherwise leave as is) | `string` | `"aws"` | no |
| [launch\_template\_http\_endpoint](#input\_launch\_template\_http\_endpoint) | Whether the metadata service is available. Can be 'enabled' or 'disabled' | `string` | `"enabled"` | no |
| [launch\_template\_http\_put\_response\_hop\_limit](#input\_launch\_template\_http\_put\_response\_hop\_limit) | The desired HTTP PUT response hop limit for instance metadata requests. Can be an integer from 1 to 64 | `number` | `2` | no |
| [launch\_template\_http\_tokens](#input\_launch\_template\_http\_tokens) | Whether or not the metadata service requires session tokens, also referred to as Instance Metadata Service Version 2 (IMDSv2). Can be 'optional' or 'required' | `string` | `"optional"` | no |
| [metadata\_service\_container\_image](#input\_metadata\_service\_container\_image) | Container image for metadata service | `string` | `""` | no |
| [metadata\_service\_enable\_api\_basic\_auth](#input\_metadata\_service\_enable\_api\_basic\_auth) | Enable basic auth for API Gateway? (requires key export) | `bool` | `true` | no |
| [metadata\_service\_enable\_api\_gateway](#input\_metadata\_service\_enable\_api\_gateway) | Enable API Gateway for public metadata service endpoint | `bool` | `true` | no |
| [resource\_prefix](#input\_resource\_prefix) | string prefix for all resources | `string` | `"metaflow"` | no |
| [resource\_suffix](#input\_resource\_suffix) | string suffix for all resources | `string` | `""` | no |
| [subnet1\_id](#input\_subnet1\_id) | First subnet used for availability zone redundancy | `string` | n/a | yes |
| [subnet2\_id](#input\_subnet2\_id) | Second subnet used for availability zone redundancy | `string` | n/a | yes |
| [tags](#input\_tags) | aws tags | `map(string)` | n/a | yes |
| [ui\_alb\_internal](#input\_ui\_alb\_internal) | Defines whether the ALB for the UI is internal | `bool` | `false` | no |
| [ui\_allow\_list](#input\_ui\_allow\_list) | List of CIDRs we want to grant access to our Metaflow UI Service. Usually this is our VPN's CIDR blocks. | `list(string)` | `[]` | no |
| [ui\_certificate\_arn](#input\_ui\_certificate\_arn) | SSL certificate for UI. If set to empty string, UI is disabled. | `string` | `""` | no |
| [ui\_static\_container\_image](#input\_ui\_static\_container\_image) | Container image for the UI frontend app | `string` | `""` | no |
| [vpc\_cidr\_blocks](#input\_vpc\_cidr\_blocks) | The VPC CIDR blocks that we'll access list on our Metadata Service API to allow all internal communications | `list(string)` | n/a | yes |
| [vpc\_id](#input\_vpc\_id) | The id of the single VPC we stood up for all Metaflow resources to exist in. | `string` | n/a | yes |
| [with\_public\_ip](#input\_with\_public\_ip) | Enable public IP assignment for the Metadata Service. If the subnets specified for subnet1\_id and subnet2\_id are public subnets, you will NEED to set this to true to allow pulling container images from public registries. Otherwise this should be set to false. | `bool` | n/a | yes |

## Outputs

| Name | Description |
|------|-------------|
| [METAFLOW\_BATCH\_JOB\_QUEUE](#output\_METAFLOW\_BATCH\_JOB\_QUEUE) | AWS Batch Job Queue ARN for Metaflow |
| [METAFLOW\_DATASTORE\_SYSROOT\_S3](#output\_METAFLOW\_DATASTORE\_SYSROOT\_S3) | Amazon S3 URL for Metaflow DataStore |
| [METAFLOW\_DATATOOLS\_S3ROOT](#output\_METAFLOW\_DATATOOLS\_S3ROOT) | Amazon S3 URL for Metaflow DataTools |
| [METAFLOW\_ECS\_S3\_ACCESS\_IAM\_ROLE](#output\_METAFLOW\_ECS\_S3\_ACCESS\_IAM\_ROLE) | Role for AWS Batch to Access Amazon S3 |
| [METAFLOW\_EVENTS\_SFN\_ACCESS\_IAM\_ROLE](#output\_METAFLOW\_EVENTS\_SFN\_ACCESS\_IAM\_ROLE) | IAM role for Amazon EventBridge to access AWS Step Functions. |
| [METAFLOW\_SERVICE\_INTERNAL\_URL](#output\_METAFLOW\_SERVICE\_INTERNAL\_URL) | URL for Metadata Service (Accessible in VPC) |
| [METAFLOW\_SERVICE\_URL](#output\_METAFLOW\_SERVICE\_URL) | URL for Metadata Service (Accessible in VPC) |
| [METAFLOW\_SFN\_DYNAMO\_DB\_TABLE](#output\_METAFLOW\_SFN\_DYNAMO\_DB\_TABLE) | AWS DynamoDB table name for tracking AWS Step Functions execution metadata. |
| [METAFLOW\_SFN\_IAM\_ROLE](#output\_METAFLOW\_SFN\_IAM\_ROLE) | IAM role for AWS Step Functions to access AWS resources (AWS Batch, AWS DynamoDB). |
| [api\_gateway\_rest\_api\_id\_key\_id](#output\_api\_gateway\_rest\_api\_id\_key\_id) | API Gateway Key ID for Metadata Service. Fetch Key from AWS Console [METAFLOW\_SERVICE\_AUTH\_KEY] |
| [batch\_compute\_environment\_security\_group\_id](#output\_batch\_compute\_environment\_security\_group\_id) | The ID of the security group attached to the Batch Compute environment. |
| [datastore\_s3\_bucket\_kms\_key\_arn](#output\_datastore\_s3\_bucket\_kms\_key\_arn) | The ARN of the KMS key used to encrypt the Metaflow datastore S3 bucket |
| [metadata\_svc\_ecs\_task\_role\_arn](#output\_metadata\_svc\_ecs\_task\_role\_arn) | n/a |
| [metaflow\_api\_gateway\_rest\_api\_id](#output\_metaflow\_api\_gateway\_rest\_api\_id) | The ID of the API Gateway REST API we'll use to accept MetaData service requests to forward to the Fargate API instance |
| [metaflow\_batch\_container\_image](#output\_metaflow\_batch\_container\_image) | The ECR repo containing the metaflow batch image |
| [metaflow\_profile\_json](#output\_metaflow\_profile\_json) | Metaflow profile JSON object that can be used to communicate with this Metaflow Stack. Store this in `~/.metaflow/config_[stack-name]` and select with `$ export METAFLOW_PROFILE=[stack-name]`. |
| [metaflow\_s3\_bucket\_arn](#output\_metaflow\_s3\_bucket\_arn) | The ARN of the bucket we'll be using as blob storage |
| [metaflow\_s3\_bucket\_name](#output\_metaflow\_s3\_bucket\_name) | The name of the bucket we'll be using as blob storage |
| [migration\_function\_arn](#output\_migration\_function\_arn) | ARN of DB Migration Function |
| [ui\_alb\_arn](#output\_ui\_alb\_arn) | UI ALB ARN |
| [ui\_alb\_dns\_name](#output\_ui\_alb\_dns\_name) | UI ALB DNS name |