case study

Production Kubernetes on Talos + ArgoCD

A declarative, GitOps-driven cluster on Hetzner.

Role: Sole engineer
Stack: TalosArgoCDHetznerKubernetes
Outcome: Zero-touch deploys, fully auditable infra.

The Problem

Running Kubernetes on general-purpose Linux distributions introduces a class of operational risk that has nothing to do with your application: the underlying OS is mutable, SSH-accessible, and capable of drifting silently from its intended state. You can install packages, modify system files, and make configuration changes that are never recorded anywhere. Over time, the gap between “what the documentation says the node should look like” and “what is actually running” grows in ways that only manifest when something breaks.

The secondary problem was tooling sprawl. Without a strong operational model, it is easy to accumulate a mix of kubectl apply commands run from local terminals, Helm releases with values overrides that exist only in someone’s shell history, and configuration that lives nowhere except the running cluster. This is survivable until it is not — until a node needs to be replaced, or a cluster needs to be rebuilt, or a new engineer needs to understand what is actually deployed.

I needed an approach where the cluster was reproducible from source, every change was auditable, and the operational overhead of maintenance was as low as possible — without trading off control.

Architecture

Infrastructure layer: Hetzner Cloud on Talos Linux

Talos Linux was the right choice for this environment for a specific reason: it removes SSH entirely. Talos is purpose-built for running Kubernetes, with a minimal read-only filesystem and a machine configuration API (via talosctl) as the only management interface. There is no shell to log into, no package manager to use, no way to make ad-hoc changes to a node that are not captured in the machine configuration spec.

All node configuration — Kubernetes API server parameters, kubelet settings, etcd configuration, CNI setup — is expressed in Talos machine config files, stored in the infrastructure repository. Applying a change to a node means updating the config file and applying it through talosctl. If a node is destroyed and rebuilt, it comes back configured exactly as the spec describes.

Hetzner Cloud provides the underlying compute. Node lifecycle — provisioning, rebooting, replacing — is handled via the Hetzner API, driven from the infrastructure repository rather than from the Hetzner console.

Application delivery: ArgoCD

ArgoCD runs inside the cluster and watches a set of git repositories. Every application deployed to the cluster has an ArgoCD Application resource that declares: where the source manifests live, what target namespace and cluster to deploy to, and what sync policy to apply.

The sync policy in this setup is configured for automated sync with self-healing: if someone manually modifies a resource in the cluster — whether by accident or by kubectl apply from a terminal — ArgoCD detects the drift and corrects it back to the state in git. The git repository is the authoritative source of truth, not the live cluster state.

The ArgoCD Application resources themselves are managed by an App-of-Apps pattern: a root application points to a directory containing all other application definitions. Adding a new application to the cluster is a pull request that adds a manifest file; removing one is a pull request that removes it.

Registry: repo.toyintest.org

A self-hosted container registry at repo.toyintest.org stores all container images. Images built in CI are pushed here, and deployment manifests reference images by digest (not just by tag) to ensure that the “image” in git is unambiguous — the same digest in staging and in production.

How Changes Reach Production

The operational model for changes is entirely pull-request-driven:

A change to infrastructure configuration (node specs, network policies, cluster-level resources) is a PR against the infrastructure repository. It is reviewed and merged. Talos machine config changes are rolled out with talosctl apply-config; cluster-level Kubernetes resources are reconciled by ArgoCD once the App-of-Apps Application points at the updated manifests — never an ad-hoc kubectl apply.
A change to an application is a PR against the application’s source repository. CI builds and pushes a new image, then opens an automatic PR against the deployment repository updating the image digest. A reviewer merges that PR, and ArgoCD syncs the cluster within seconds.

Nothing changes in the cluster without a corresponding git commit. The audit trail is the git log.

Outcome

The practical result of this setup is that the cluster is fully legible. Any engineer with access to the repository can understand exactly what is deployed, at what version, with what configuration. Rebuilding from scratch is a defined process, not an archaeological exercise.

Zero-touch deploys is an accurate description: from a merged application PR to a running updated deployment, no human runs a command against the cluster. ArgoCD handles the sync; Talos handles the node-level consistency. The operational surface that requires active attention is limited to reviewing pull requests and responding to ArgoCD sync failures when they occur.

The audit trail is a side effect of the workflow rather than a separate system to maintain. Questions like “what changed before this started failing?” and “who approved this configuration?” are answered by git, not by memory or informal documentation.

This setup prioritises correctness and auditability over operational flexibility. That is the right trade-off for a system where reliability matters more than the ability to make fast ad-hoc changes.