https://github.com/buildkite/buildkite-agent-scaler

📈A lambda for scaling an AutoScalingGroup based on Buildkite metrics
https://github.com/buildkite/buildkite-agent-scaler

aws buildkite lambda

Last synced: 6 months ago
JSON representation

📈A lambda for scaling an AutoScalingGroup based on Buildkite metrics

Host: GitHub
URL: https://github.com/buildkite/buildkite-agent-scaler
Owner: buildkite
License: mit
Created: 2017-12-06T05:55:16.000Z (almost 8 years ago)
Default Branch: master
Last Pushed: 2025-03-27T13:30:44.000Z (7 months ago)
Last Synced: 2025-03-30T09:05:18.087Z (6 months ago)
Topics: aws, buildkite, lambda
Language: Go
Homepage:
Size: 5.17 MB
Stars: 61
Watchers: 19
Forks: 31
Open Issues: 12
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
- Codeowners: .github/CODEOWNERS

Awesome Lists containing this project

README

# Buildkite Agent Scaler

An AWS lambda function that handles the scaling of an
[Amazon Autoscaling Group](https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html)
(ASG) based on metrics provided by the Buildkite Agent Metrics API.

In practice, we've seen 300% faster initial scale-ups with this lambda vs native AutoScaling rules.
🚀

## Why?

The [Elastic CI Stack][] depends on being able to scale up quickly from zero instances in response
to scheduled Buildkite jobs. Amazon's AutoScaling primatives have a number of limitations that we
wanted more granular control over:

* The median time for a scaling event to be triggered was 2 minutes, due to needing two samples with
a minimum period of 60 seconds between.
* Scaling can either be by a fixed rate, a fixed step size or tracking, but tracking doesn't work
well with custom metrics like we use.

## How does it work?

The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the
results sets the `DesiredCount` to exactly what is needed. This allows much faster scale up.

## Gracefully scaling in

Whilst the lambda does support scaling in via setting `DesiredCount`, Amazon ASGs appear to not send
[Lifecycle Hooks][] before terminating instances, so jobs in progress are interrupted.

Instead, in the [Elastic CI Stack][] we run the scaler with scale-in disabled (`DISABLE_SCALE_IN`)
and rely on the
[recent addition in buildkite-agent v3.10.0](https://github.com/buildkite/agent/releases/tag/v3.10.0)
of `--disconnect-after-idle-timeout` in the Agent combined with a
[systemd PostStop script](https://github.com/buildkite/elastic-ci-stack-for-aws/blob/00c45ab47160b1d1d44c0b3bea8456456444c60e/packer/linux/conf/bin/bk-install-elastic-stack.sh#L136-L143)
to terminate the instance and atomically decrease the `DesiredCount` after the agent has been idle
for a time period. We've found it to work really well, and is less complicated than relying on
[lifecycled] and [Lifecycle Hooks][].

See the [forum post](https://forum.buildkite.community/t/experimental-lambda-based-scaler/425) for more details.

## Publishing Cloudwatch Metrics

The scaler collects it's own metrics and doesn't require [buildkite-agent-metrics][]. It supports
optionally publishing the metrics it collects back to Cloudwatch, although it only supports a subset
of the metrics that the [buildkite-agent-metrics][] binary collects:

* Buildkite > (Org, Queue) > `ScheduledJobsCount`
* Buildkite > (Org, Queue) > `RunningJobCount`

## Running as an AWS Lambda

An AWS Lambda bundle is created and published as part of the build process. The lambda will require
the following IAM permissions:

- `cloudwatch:PutMetricData`
- `autoscaling:DescribeAutoScalingGroups`
- `autoscaling:DescribeScalingActivities`
- `autoscaling:SetDesiredCapacity`

Its handler is `bootstrap`, it uses a `provided.al2` runtime and requires the following env vars:

- `BUILDKITE_AGENT_TOKEN` or `BUILDKITE_AGENT_TOKEN_SSM_KEY`
- `BUILDKITE_QUEUE`
- `AGENTS_PER_INSTANCE`
- `ASG_NAME`

If `BUILDKITE_AGENT_TOKEN_SSM_KEY` is set, the token will be read from
[AWS Systems Manager Parameter Store GetParameter](https://docs.aws.amazon.com/systems-manager/latest/APIReference/API_GetParameter.html)
which [can also read from AWS Secrets Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/integration-ps-secretsmanager.html).

```bash
aws lambda create-function \
--function-name buildkite-agent-scaler \
--memory 128 \
--role arn:aws:iam::account-id:role/execution_role \
--runtime provided.al2 \
--zip-file fileb://handler.zip \
--handler bootstrap
```

## Running locally for development

```
$ aws-vault exec my-profile -- go run . \
--asg-name elastic-runners-AgentAutoScaleGroup-XXXXX
--agent-token "$BUILDKITE_AGENT_TOKEN"
```

## Using Clusters

The `BUILDKITE_AGENT_TOKEN` is scoped to a specific cluster. It's best to create a unique token for
the cluster being targeted by the scaler.

The scaler is set up automatically by the [Elastic CI Stack][]'s CloudFormation templates, which
reference the agent token and a queue name. A Lambda function running the scaler is then generated
using these references (e.g., `BUILDKITE_AGENT_TOKEN_SSM_KEY` and `BUILDKITE_QUEUE`).

## Copyright

[Elastic CI Stack]: https://github.com/buildkite/elastic-ci-stack-for-aws
[buildkite-agent-metrics]: https://github.com/buildkite/buildkite-agent-metrics
[Lifecycle Hooks]: https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html
[lifecycled]: https://github.com/buildkite/lifecycled

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/buildkite/buildkite-agent-scaler

Awesome Lists containing this project

README