Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/buildkite/buildkite-agent-scaler
📈A lambda for scaling an AutoScalingGroup based on Buildkite metrics
https://github.com/buildkite/buildkite-agent-scaler
aws buildkite lambda
Last synced: 3 days ago
JSON representation
📈A lambda for scaling an AutoScalingGroup based on Buildkite metrics
- Host: GitHub
- URL: https://github.com/buildkite/buildkite-agent-scaler
- Owner: buildkite
- License: mit
- Created: 2017-12-06T05:55:16.000Z (about 7 years ago)
- Default Branch: master
- Last Pushed: 2024-12-16T00:37:54.000Z (26 days ago)
- Last Synced: 2025-01-01T11:07:07.740Z (10 days ago)
- Topics: aws, buildkite, lambda
- Language: Go
- Homepage:
- Size: 5.16 MB
- Stars: 61
- Watchers: 19
- Forks: 27
- Open Issues: 9
-
Metadata Files:
- Readme: README.md
- Changelog: CHANGELOG.md
- License: LICENSE.txt
Awesome Lists containing this project
README
# Buildkite Agent Scaler
An AWS lambda function that handles the scaling of an
[Amazon Autoscaling Group](https://docs.aws.amazon.com/autoscaling/ec2/userguide/AutoScalingGroup.html)
(ASG) based on metrics provided by the Buildkite Agent Metrics API.In practice, we've seen 300% faster initial scale-ups with this lambda vs native AutoScaling rules.
🚀## Why?
The [Elastic CI Stack][] depends on being able to scale up quickly from zero instances in response
to scheduled Buildkite jobs. Amazon's AutoScaling primatives have a number of limitations that we
wanted more granular control over:* The median time for a scaling event to be triggered was 2 minutes, due to needing two samples with
a minimum period of 60 seconds between.
* Scaling can either be by a fixed rate, a fixed step size or tracking, but tracking doesn't work
well with custom metrics like we use.## How does it work?
The lambda (or cli version) polls the Buildkite Metrics API every 10 seconds, and based on the
results sets the `DesiredCount` to exactly what is needed. This allows much faster scale up.## Gracefully scaling in
Whilst the lambda does support scaling in via setting `DesiredCount`, Amazon ASGs appear to not send
[Lifecycle Hooks][] before terminating instances, so jobs in progress are interrupted.Instead, in the [Elastic CI Stack][] we run the scaler with scale-in disabled (`DISABLE_SCALE_IN`)
and rely on the
[recent addition in buildkite-agent v3.10.0](https://github.com/buildkite/agent/releases/tag/v3.10.0)
of `--disconnect-after-idle-timeout` in the Agent combined with a
[systemd PostStop script](https://github.com/buildkite/elastic-ci-stack-for-aws/blob/00c45ab47160b1d1d44c0b3bea8456456444c60e/packer/linux/conf/bin/bk-install-elastic-stack.sh#L136-L143)
to terminate the instance and atomically decrease the `DesiredCount` after the agent has been idle
for a time period. We've found it to work really well, and is less complicated than relying on
[lifecycled] and [Lifecycle Hooks][].See the [forum post](https://forum.buildkite.community/t/experimental-lambda-based-scaler/425) for more details.
## Publishing Cloudwatch Metrics
The scaler collects it's own metrics and doesn't require [buildkite-agent-metrics][]. It supports
optionally publishing the metrics it collects back to Cloudwatch, although it only supports a subset
of the metrics that the [buildkite-agent-metrics][] binary collects:* Buildkite > (Org, Queue) > `ScheduledJobsCount`
* Buildkite > (Org, Queue) > `RunningJobCount`## Running as an AWS Lambda
An AWS Lambda bundle is created and published as part of the build process. The lambda will require
the following IAM permissions:- `cloudwatch:PutMetricData`
- `autoscaling:DescribeAutoScalingGroups`
- `autoscaling:DescribeScalingActivities`
- `autoscaling:SetDesiredCapacity`Its handler is `bootstrap`, it uses a `provided.al2` runtime and requires the following env vars:
- `BUILDKITE_AGENT_TOKEN` or `BUILDKITE_AGENT_TOKEN_SSM_KEY`
- `BUILDKITE_QUEUE`
- `AGENTS_PER_INSTANCE`
- `ASG_NAME`If `BUILDKITE_AGENT_TOKEN_SSM_KEY` is set, the token will be read from
[AWS Systems Manager Parameter Store GetParameter](https://docs.aws.amazon.com/systems-manager/latest/APIReference/API_GetParameter.html)
which [can also read from AWS Secrets Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/integration-ps-secretsmanager.html).```bash
aws lambda create-function \
--function-name buildkite-agent-scaler \
--memory 128 \
--role arn:aws:iam::account-id:role/execution_role \
--runtime provided.al2 \
--zip-file fileb://handler.zip \
--handler bootstrap
```## Running locally for development
```
$ aws-vault exec my-profile -- go run . \
--asg-name elastic-runners-AgentAutoScaleGroup-XXXXX
--agent-token "$BUILDKITE_AGENT_TOKEN"
```## Using Clusters
The `BUILDKITE_AGENT_TOKEN` is scoped to a specific cluster. It's best to create a unique token for
the cluster being targeted by the scaler.The scaler is set up automatically by the [Elastic CI Stack][]'s CloudFormation templates, which
reference the agent token and a queue name. A Lambda function running the scaler is then generated
using these references (e.g., `BUILDKITE_AGENT_TOKEN_SSM_KEY` and `BUILDKITE_QUEUE`).## Copyright
Copyright (c) 2014-2019 Buildkite Pty Ltd. See [LICENSE](./LICENSE.txt) for details.
[Elastic CI Stack]: https://github.com/buildkite/elastic-ci-stack-for-aws
[buildkite-agent-metrics]: https://github.com/buildkite/buildkite-agent-metrics
[Lifecycle Hooks]: https://docs.aws.amazon.com/autoscaling/ec2/userguide/lifecycle-hooks.html
[lifecycled]: https://github.com/buildkite/lifecycled