Self-Hosted GitHub Runners on Spot Instances

GitHub-hosted runners cost $0.008 per minute (Linux, 2 vCPU / 7 GB). If your team runs 500 minutes of CI per day, that’s $1,200/month. For a startup with a moderately active engineering team, CI bills are often the largest single cloud cost line item.

The standard alternative — persistent self-hosted runners — solves the cost problem but introduces uptime management: servers that are always on, patching to handle, capacity to predict. On bursty workloads (PR-heavy development, release days), they’re either under-provisioned (queued builds) or over-provisioned (idle machines you’re paying for).

jit-runners is our answer: GitHub Actions runners that provision on demand when a workflow starts, use EC2 Spot for 60–80% cost reduction versus on-demand, and terminate when the job finishes.

The architecture

GitHub Actions workflow queued
        │
        ▼
GitHub webhook (workflow_job: queued)
        │
        ▼
API Gateway → Lambda (Go)
        │
        ├── EC2 RunInstances (Spot, user data: GitHub runner install + register)
        │
        ▼
EC2 instance comes online
        │
        ├── Registers with GitHub as ephemeral runner
        │
        ▼
GitHub assigns job to runner
        │
        ├── Job executes
        │
        ▼
workflow_job: completed webhook
        │
        ├── Lambda terminates instance (or instance self-terminates on idle)

Three components:

Lambda function (Go) — receives webhook events, provisions runners, handles termination
EC2 Spot instances — ephemeral runners with JIT registration tokens
GitHub App — handles webhook delivery and runner registration tokens

Why Lambda + Go

Lambda is the right compute for this use case:

Event-driven — webhook fires, Lambda runs, done. No polling, no scheduler.
Cost — Lambda invocations for webhook handling are essentially free (well within the free tier)
Scale — 1,000 concurrent PRs? Lambda scales horizontally without configuration

Go is the right language for this Lambda:

Cold start — compiled Go binaries have 10–50ms Lambda cold starts. Python or Node.js are fine too, but Go is genuinely fast.
Single binary — no runtime dependencies, trivial deployment (GOARCH=amd64 GOOS=linux go build)
AWS SDK v2 — the official Go SDK is well-maintained and performant

The Lambda function is small: parse webhook event, call EC2 RunInstances with a user data script, handle errors. The whole thing fits comfortably in a single file.

EC2 Spot for runners

Spot instances offer unused EC2 capacity at 60–80% discount versus on-demand. The risk: AWS can reclaim capacity with 2 minutes notice.

For CI runners, this risk is manageable:

Spot interruption = job failure = job retry — GitHub Actions retries interrupted jobs automatically (with retry-failed-jobs: true) or manually
Short job duration — most CI jobs finish in 5–15 minutes. Spot interruption probability over that window is low.
Diversified instance types — using a Spot Fleet or instance type diversification reduces interruption frequency

Our default configuration: c5.2xlarge (8 vCPU / 16 GB) as primary, c5a.2xlarge as fallback. Average CI cost: ~$0.05/hour versus ~$0.34/hour on-demand.

The user data script

When the EC2 instance starts, user data handles runner setup:

#!/bin/bash
set -euo pipefail

# Install runner
mkdir -p /actions-runner && cd /actions-runner
curl -sSL https://github.com/actions/runner/releases/latest/download/actions-runner-linux-x64-*.tar.gz | tar -xz

# Register as ephemeral runner
./config.sh \
  --url "https://github.com/ORG_NAME" \
  --token "REGISTRATION_TOKEN" \
  --name "spot-$(ec2-metadata --instance-id)" \
  --labels "self-hosted,linux,spot" \
  --ephemeral \
  --unattended

# Run (exits after one job when --ephemeral)
./run.sh

The --ephemeral flag is key: the runner deregisters automatically after completing one job. No cleanup needed, no state to manage between runs.

The registration token (REGISTRATION_TOKEN) is fetched by the Lambda function via the GitHub API just before calling RunInstances, then injected into user data. Tokens expire after one hour — tight enough window for a newly provisioned instance.

Handling Spot interruptions gracefully

GitHub Actions doesn’t automatically retry jobs on runner failure. To handle Spot interruptions:

Termination notice poller — user data starts a background process that polls the EC2 metadata endpoint for termination notices. When a notice arrives, it sends SIGTERM to the runner process.
Runner cancels job — the runner marks the job as canceled on SIGTERM, which is retriable.
Lambda retriggers — a new workflow_job: queued event fires for the canceled job, provisioning a fresh instance.

This adds ~2 minutes of latency on interruption (termination notice → cancel → new instance provision → runner online). For most CI workloads this is acceptable; for time-critical jobs, use on-demand instance types.

Cost comparison

For a team running 1,000 CI minutes/day:

Approach	Monthly cost
GitHub-hosted runners	$240
EC2 on-demand (c5.2xlarge)	~$180
EC2 Spot (c5.2xlarge, ~70% discount)	~$55

At scale the savings compound. 5,000 minutes/day: GitHub → $1,200/month, Spot → ~$275/month.

The break-even on engineering time to set up jit-runners is typically under a week of CI spend.

What jit-runners handles

The open source jit-runners project provides:

Lambda function (Go) for webhook handling and instance provisioning
Terraform module for Lambda, API Gateway, IAM roles, and security groups
GitHub App configuration guide
User data templates for Ubuntu and Amazon Linux 2
Spot interruption handler sidecar

The deployment guide is in the repository README. Setup takes about 30 minutes if you have AWS credentials and GitHub App access.

jit-runners is open source under MIT. If you hit a Spot interruption rate that’s causing real problems, open an issue — there are several approaches to mitigation we haven’t implemented yet.