Right-Sizing an Idle Kubernetes Platform

How an idle Talos Kubernetes cluster was reduced from long-lived autoscale spend to a static baseline, and what that implies for capacity policy, GitOps, and future platform design.

How an idle Talos Kubernetes cluster was reduced from long-lived autoscale spend to a static baseline, and what that implies for capacity policy, GitOps, and future platform design.

An idle Kubernetes platform can be expensive without serving meaningful traffic. The dominant cost is not request volume. The dominant cost is the shape of the baseline: always-on nodes, attached volumes, a load balancer, and scheduler decisions that keep elastic capacity alive after it was created.

The reduction described here moved the cluster from a five-to-six-node operating shape back to the intended four-node baseline: one Talos control plane, three static Talos workers, and zero active autoscale workers. The important part was not deleting two servers. The important part was making sure the scheduler and GitOps state no longer had a reason to create them again.

Starting State

The platform is a Talos Linux Kubernetes cluster in Hetzner Cloud, deployed in Ashburn. The stable baseline is:

  • 1 x cpx31 control plane, running Talos and Kubernetes control plane components.
  • 3 x cpx21 static workers for normal platform and application workloads.
  • 1 x lb11 Hetzner load balancer for ingress.
  • 260 GB of persistent volumes across registry, database, monitoring, storage, and application state.
  • A cluster-autoscaler node pool named worker-autoscale configured with min=0 and max=5.
Target platform topology
flowchart LR
  subgraph external["External edge"]
    cloudflare["Cloudflare DNS / Access"]
    lb["Hetzner LB11"]
  end

  subgraph cluster["talos-redux Kubernetes cluster"]
    cp["Control plane\n1 x CPX31"]

    subgraph static["Static worker baseline"]
      w1["worker-1\nCPX21"]
      w2["worker-2\nCPX21"]
      w3["worker-3\nCPX21"]
    end

    subgraph elastic["Elastic worker pool"]
      autoscale["worker-autoscale\nmin 0 / max 5"]
    end

    argocd["ArgoCD"]
    registry["Private registry"]
    observability["Prometheus / Grafana / Loki"]
    apps["Application workloads"]
  end

  cloudflare --> lb
  lb --> w1
  lb --> w2
  lb --> w3
  argocd --> apps
  registry --> apps
  autoscale -. "burst jobs only" .-> apps
The static worker pool carries the idle baseline. The autoscale pool exists for burst work, not for always-on services.

The live cluster had drifted from that target. Two autoscale workers were running for long periods, and a third was created during remediation when request pressure briefly made a baseline pod unschedulable. The expensive state was therefore not just "autoscaler exists"; it was "baseline workloads are eligible to land on autoscaler capacity, and some requests are large enough that the scheduler asks for more nodes."

cost-before-after.txt
1 Before:
2 control plane: 1 x cpx31
3 static workers: 3 x cpx21
4 autoscale workers: 2 x cpx21, long-lived
5 load balancer: 1 x lb11
6 volumes: 260 GB
7 result: about $108-$110/month by hcloud API pricing
8
9 After:
10 control plane: 1 x cpx31
11 static workers: 3 x cpx21
12 autoscale workers: 0 x cpx21
13 load balancer: 1 x lb11
14 volumes: 260 GB
15 result: about $84-$86/month by hcloud API pricing

The hcloud API reported Ashburn monthly prices of $11.99 for cpx21, $20.99 for cpx31, and $7.49 for lb11. Removing two long-lived cpx21 autoscale nodes removed about $23.98/month of compute spend. The remaining $84-$86/month is baseline infrastructure: four servers, one load balancer, persistent volumes, and small snapshot overhead.

Why The Bill Was High

The system had no meaningful traffic, but Kubernetes cost is not strictly traffic-correlated. A cluster accrues cost when infrastructure exists, not when packets flow. For this cluster, the recurring spend came from four categories:

  • Compute baseline: the control plane and three static workers are always on.
  • Elastic compute leakage: autoscale nodes were created for scheduling pressure and then remained useful to baseline workloads.
  • Persistent storage: volumes remain attached and billed regardless of HTTP traffic.
  • Network edge: the load balancer is fixed monthly infrastructure.

The key operational distinction is between utilization and requests. Kubernetes scheduling is driven by declared requests, not by live usage. A pod using 40 MiB can reserve 512 MiB. A node using 2.4 GiB can still be considered full if requested memory is near allocatable memory. The scheduler has to be conservative because requests are the contract a pod makes with the cluster.

How a baseline pod can create elastic spend
flowchart TD
  pending["Pod is Pending"] --> scheduler["Kubernetes scheduler"]
  scheduler --> requests["Evaluate requests, node selectors, affinities, taints, PV topology"]
  requests --> fit{"Fits static workers?"}
  fit -->|"yes"| static["Schedule on nodepool=worker"]
  fit -->|"no"| autoscaler["cluster-autoscaler evaluates scale-up"]
  autoscaler --> allowed{"Pod can run on autoscale pool?"}
  allowed -->|"yes"| newnode["Create CPX21 autoscale worker"]
  allowed -->|"no"| remain["Remain Pending"]
  newnode --> cost["Idle server can persist if baseline workload lands there"]
The autoscaler responds to unschedulable pods. It does not know whether the pod represents real traffic, drift, or an over-reserved idle service.

That made resource requests the main cost-control surface. CPU and memory requests were not only performance declarations; they were the inputs that decided whether the scheduler could keep the platform inside the static worker baseline.

GitOps Drift

The first hard blocker was ArgoCD itself. The repository-level Terraform values set a global node selector for ArgoCD components: nodepool=worker. The live Helm release had a more specific controller override: controller.nodeSelector.nodepool=worker-autoscale. That component-specific value won over the global value.

When an autoscale node was drained, the ArgoCD application controller was evicted and could not schedule on the static worker pool because its live StatefulSet still demanded worker-autoscale. The durable fix was to make the controller-specific value explicit in Terraform, not only patch the live StatefulSet.

infra/prod/main.tf
1 controller = {
2 replicas = 1
3 nodeSelector = {
4 nodepool = "worker"
5 }
6 resources = {
7 requests = {
8 cpu = "100m"
9 memory = "256Mi"
10 }
11 limits = {
12 memory = "2Gi"
13 }
14 }
15 }

This is the main GitOps lesson: a live patch is only a repair if the source of truth agrees with it. Otherwise the reconciler is doing its job when it reverts the patch.

Request Pressure

After ArgoCD was moved back to the static worker pool, the next problem was pure scheduler pressure. The static worker nodes were close to full by requested memory even though live memory usage was lower. Before trimming, two workers were around 96 percent requested memory and the third was above 80 percent. That left too little room for rescheduling after draining autoscale capacity.

The remediation lowered requests for idle or over-reserved services while leaving limits high enough for bursts. This changed scheduling economics without removing the ability for workloads to use more memory when the kernel and kubelet allow it.

request-reductions.txt
1 Workload Before After
2 ArgoCD application controller 512Mi request 256Mi request
3 hosted Hermes web agent 768Mi request 384Mi request
4 Hermes agent gateway 512Mi request 128Mi request
5 BigCartBuddy OCR service 512Mi request 128Mi request
6 Kubernetes Dashboard 800Mi total scaled to zero
7 Hermes baseline affinities worker, autoscale worker only

The Kubernetes Dashboard was scaled to zero because it is an idle admin surface and was reserving approximately 800 MiB across its pods. The dashboard can be restored when needed, but it does not need to participate in the idle baseline.

The Hermes baseline deployments were also changed to remove worker-autoscale from their required node affinity. An always-on service should not match the elastic pool. The elastic pool should serve short-lived jobs, burst runners, batch workloads, and other capacity that can disappear without keeping the platform alive.

baseline-worker-affinity.yaml
1 affinity:
2 nodeAffinity:
3 requiredDuringSchedulingIgnoredDuringExecution:
4 nodeSelectorTerms:
5 - matchExpressions:
6 - key: nodepool
7 operator: In
8 values:
9 - worker

The Worker Failure

During the reduction, one static worker stopped posting kubelet status and moved to NotReady. The registry pod was stuck in ContainerCreating, private image pulls returned temporary 503s from the registry endpoint, and kubelet log requests to that worker timed out.

The sequence mattered:

  • The node was cordoned so new work would not be scheduled onto a weak worker.
  • A provider-level reboot was attempted first.
  • The node remained unhealthy, so a provider hard reset was used.
  • Kubernetes reported the node Ready after the reset.
  • The node was uncordoned only after kubelet had re-registered.
  • The registry, ArgoCD controller, and private-image application rollouts were allowed to complete before continuing autoscale deletion.

That avoided compounding the cost reduction with an availability incident. Private registries are dependency concentrators. If the registry is down, every new pod that needs a private image is vulnerable to ImagePullBackOff. The safe order was registry recovery first, autoscale deletion second.

Drain, Delete, Verify

Autoscale nodes were drained before deletion. Draining gives controllers a chance to recreate pods on valid nodes, forces volume detach/attach flows to complete, and reveals unexpected placement constraints before the cloud server disappears.

autoscale-removal.sh
1 kubectl drain talos-redux-worker-autoscale-5c851c3b70f45cfd \
2 --ignore-daemonsets \
3 --delete-emptydir-data \
4 --timeout=15m
5
6 kubectl drain talos-redux-worker-autoscale-6dd552a4eaa30c52 \
7 --ignore-daemonsets \
8 --delete-emptydir-data \
9 --timeout=10m
10
11 kubectl delete node \
12 talos-redux-worker-autoscale-5c851c3b70f45cfd \
13 talos-redux-worker-autoscale-6dd552a4eaa30c52
14
15 hcloud server delete \
16 talos-redux-worker-autoscale-5c851c3b70f45cfd \
17 talos-redux-worker-autoscale-6dd552a4eaa30c52

Deleting the Kubernetes node object and deleting the Hetzner server are both required. The Kubernetes node object removes cluster membership. The cloud server deletion removes the billable resource.

Verification was deliberately delayed. A cluster can look clean immediately after a drain and still recreate autoscale capacity a minute later if one pod remains unschedulable. The final checks were run after a short wait to confirm that the autoscaler did not create another node.

post-change-verification.sh
1 kubectl get nodes -o wide
2 kubectl get pods -A --field-selector=status.phase=Pending -o wide
3 kubectl get pods -A -o wide | rg 'ImagePull|ContainerCreating|Pending|CrashLoop|Unknown|Terminating|Error'
4 kubectl -n argocd get applications
5 hcloud server list
6 kubectl top nodes

The final verification state was:

  • Four Kubernetes nodes: one control plane and three static workers.
  • Four Hetzner servers: the same one control plane and three workers.
  • Zero active autoscale servers.
  • No Pending pods.
  • No matching pod errors for ImagePull, ContainerCreating, CrashLoop, Unknown, or Terminating.
  • Every ArgoCD Application reported Synced and Healthy.

The post-reduction node utilization check showed the tighter baseline: the control plane around 66 percent memory, worker-1 around 65 percent, worker-2 around 95 percent, and worker-3 around 78 percent. That is an acceptable idle/prepared-state posture, but it is not excess capacity. Real traffic or larger jobs should use autoscale again.

What Changed In Git

The durable changes were committed in the platform repository and in the application repository that owns BigCartBuddy's generated manifests.

  • talos-redux: 3a63368 chore: keep baseline workloads on static workers
  • bigcartbuddy: 32f620a chore: lower idle OCR resource requests

The platform commit did three things:

  • Made ArgoCD application-controller placement and requests explicit in Terraform.
  • Reduced hosted Hermes web-agent requests.
  • Removed autoscale eligibility from the Hermes baseline services.

The application commit reduced BigCartBuddy OCR's idle request from a conservative boot-time reserve to a smaller steady-state reserve. The memory limit stayed high so OCR can still burst when needed.

Operating Model

The target operating model is a static baseline plus explicit elastic capacity:

  • Static worker pool: ArgoCD, ingress, registry, databases, monitoring, and low-traffic application frontends.
  • Elastic worker pool: build runners, batch jobs, security scans, temporary compute, and burst traffic that can tolerate node creation latency.
  • GitOps source of truth: placement and requests must live in Git, not only in live patches.
  • Cost alerts: autoscale nodes should be treated as temporary capacity. A long-lived autoscale node is either real demand or a placement/request bug.
  • Security posture: Trivy, RBAC/config scanning, security-posture jobs, image updater controls, and cluster-health reporting remain part of the platform baseline.

The cluster autoscaler remains useful. The reduction did not remove autoscaling. It removed accidental baseline dependence on autoscaling. The next version of the platform should make that distinction impossible to miss.

What This Moves Towards

The direction is a platform that can sit idle cheaply, receive traffic safely, and scale only when there is a concrete reason to scale.

That requires a few explicit rules:

  • Baseline services must not match the autoscale node pool.
  • Resource requests should describe steady-state need, while limits should describe tolerated burst.
  • Autoscale nodes should be observable as an event, not accepted as background noise.
  • Stateful workloads need placement contracts that avoid unnecessary volume churn.
  • Administrative surfaces that are not needed continuously should be scaled to zero or gated behind an operational runbook.
  • Reconciliation state should be checked after live repairs so drift does not silently return.

The end state is not "minimum spend at all costs." The end state is controlled spend: a known idle floor, explicit burst capacity, and enough telemetry to explain any increase. In the current state, the known idle floor is about $84-$86/month. Any autoscale node above that floor should be explainable by a workload, a release, a security scan, or a capacity policy.