Configuration management at Giant Swarm: a historical overview

by Laszlo Uveges on Jun 25, 2026

Configuration management at Giant Swarm: a historical overview image thumbnail

Configuration management at Giant Swarm: a historical overview » Giant Swarm
16:14

This is the first of three posts on configuration management at Giant Swarm.

Before we get into how the current system works, let's dig into the rich history of how it all started and evolved over the years.

The strata are still visible. Three eras, each one shaped by the pressure of its time, and each one leaving traces in what came after.

  • The vintage system at the base, two seams running side by side: the release system, and the unique apps system alongside it.
  • Above both, we can discover a layer that contains the origins that shaped into the current system over the years.
  • And at the surface, the current system: generalized, extensible, still accumulating its own sediment.

Join us on a journey through these layers.

How it all began

It was early 2020, configurations for management clusters were written statically and deployed manually. As we managed more and more of them, the need arose for a system that could:

  • Support multiple dimensions, like sane defaults and cluster-specific overrides
  • Ease the deployment of the configuration

Nothing like this existed at the time, and still doesn't today. In Kubernetes, configurations are generally kept in ConfigMaps and Secrets. The problem with them is that they are basically key-value stores, where the value is a primitive type, often simply a string. Even though we often store structured data in those strings — like YAML, TOML, and so on — we have no means to work with them in a programmatic way.

Take kustomize for example. An excellent tool to manage your Kubernetes manifests. It offers great ways to keep an inventory, create shared bases, apply common metadata, patches, and so on. But you cannot apply patches to structured data stored inside fields of a ConfigMap or a Secret. They are strings — you only change one as a whole string or not at all. Meaning that if you only want, say, different resources.limits or .cluster.name for a given app's large YAML configuration stored in a ConfigMap, you have no means to do that.

apiVersion: v1
kind: ConfigMap
  name: aws-operator-values
  namespace: giantswarm
data:
  configmap-values.yaml: |
    serviceMonitor:
      enabled: true
    resources:
      limits:
        cpu: 250m
        memory: 250Mi
      requests:
        cpu: 100m
        memory: 250Mi
    cluster:
      name: golem

We needed something to manage these for many applications across many management clusters in a structured way.

The bottom seam: vintage releases

The deepest layer is the oldest: the vintage release system, built before Cluster API clusters existed. In this era, what ran on a management cluster was determined by a release — an immutable snapshot of apps with fixed versions, locked in a YAML file that described the release. They were stored in the giantswarm/releases repository, where you can still find remnants of the past deep in the git history. For example, the v15.0.0 release for Kubernetes 1.20 from 2022. Take a look at the releases.yaml.

When a new release was created and a management cluster was told to upgrade, release-operator started by creating a Config CR — a signal that configuration needed to be generated for the apps in that release.

config-controller watched for new Config CRs and generated a ConfigMap and a Secret from the giantswarm/config repository. This repository contained configuration templates and values for apps shared across all management clusters, as well as management cluster-specific templates and values. For example, each cluster had the same base config for a given app deployment. But the base domain is a different value for each cluster, shared across all apps. Or each cluster was different in size and needed different resource requests and limits for specific apps.

An overview of the repository structure:

├── default
│   ├── apps
│   │   ├── aws-operator  # Application name
│   │   │   ├── configmap-values.yaml.template  # Default CM data template
│   │   │   └── secret-values.yaml.template  # Default Secret data template
│   │   ├── ...
│   ├── config.yaml  # Default, shared value file (no shared secrets across MCs)
├── installations  # Installations is a historic name for management clusters
│   ├── golem  # Management cluster name
│   │   ├── apps
│   │   ├── aws-operator  # Application name, contains MC specific overrides
│   │   │   │   ├── configmap-values.yaml.patch  # MC specific CM template
│   │   │   │   └── secret-values.yaml.patch  # MC specific Secret template
│   │   ├── ...
│   │   ├── config.yaml.patch  # MC specific value file
│   │   └── secret.yaml  # MC specific secret value file (SOPS encrypted)
│   ├── ...

The rendering of this structure was relatively simple. For aws-operator as an example:

  1. Render the ConfigMap and Secret data Go templates based on the default and management cluster-specific value files, for example default/config.yaml
  2. Render the ConfigMap and Secret data Go templates — if they exist — based on the default and management cluster-specific value files
  3. Merge the rendered management cluster-specific data templates onto the rendered default templates
  4. Create a ConfigMapand a Secret manifest for each respective rendered data

Secret data was SOPS-encrypted at rest in the configuration Git repositories.

Once the Config CR was ready, the controller updated the .status fields with references to the created ConfigMap and Secret.

apiVersion: core.giantswarm.io/v1alpha1
kind: Config
metadata:
  name: aws-operator-9.3.1
  namespace: giantswarm
spec:
  # ...
status:
  # ...
  config:
    configMapRef:
      name: aws-operator-9.3.1-c53762fbe3
      namespace: giantswarm
    secretRef:
      name: aws-operator-9.3.1-c53762fbe3
      namespace: giantswarm

After that, release-operator created the App CRs from the release.yaml file, with the App CR's spec.config set to the generated ConfigMap and Secret from the status of the Config CR.

The second seam: unique apps

Laid into the same layer as the release system — not above it in time, but alongside it — was a second mechanism for a different class of applications: unique apps, as they were called at the time.

The need arose to deploy some components to each management cluster outside the release cycle. Always the latest version, across all management clusters, no exceptions.

Unlike the release system, the mechanism here was more distributed. App versions lived in so-called collection repositories. They hosted a Helm chart, and one existed for each provider (e.g. aws-app-collection) and a shared one. Each chart contained a YAML file per unique app: a plain App CR manifest with a pinned version.

When an app repository cut a new release, our custom build tool called architect opened a PR to the relevant collection repos, released the updated collection chart, and emitted a GitHub deployment event for each management cluster, referenced by a hardcoded codename for each one of them. In the management cluster, a daemon called draughtsman ran continuously, polling GitHub's deployments API for events tagged with its own codename. When it found one, it ran helm upgrade --install on the collection chart — using a single ConfigMap and Secret in the draughtsman namespace as Helm values — which rendered the App CR templates and applied them to the cluster.

Unlike releases, the App CRs arrived without configuration. app-admission-controller immediately paused each incoming App CR, giving config-controller time to catch up. Rather than watching a Config CR, config-controller watched the App CR directly, generated the configuration, and wrote spec.config back directly onto the App CRs. Only then was the App CR unpaused and allowed to reconcile.

Where the pressure built

Both seams had rough edges, and it became a problem in itself that there were two mechanisms for roughly the same thing.

The release system had its own failure mode: if giantswarm/config failed to render, the Config CR stalled silently. Nothing moved until someone noticed and fixed the gap. Not catastrophic, but brittle.

The unique apps system had a more fundamental problem: architect maintained a hardcoded list of management cluster codenames. Every time Giant Swarm created a new management cluster, a developer had to file a PR against architect and wait for a release, then update every collection repository's CI configuration to pull the new architect version.

The values that draughtsman used to render the collection Helm charts came from another repository that contained the management cluster-specific values. When they changed, draughtsman did not automatically detect it. There was an internal CLI that had to be manually invoked for each management cluster to update the values.

Growth was expensive. The hardcoded list was a costly assumption — that the set of management clusters was small and stable. When that assumption stopped being true, the entire mechanism had to go.

What replaced it had to do two things this system couldn't: discover management clusters and changes to them dynamically. Configuration generation and deployment of management cluster App CRs had to be simplified.

That work became Giant Swarm embracing GitOps and creating konfigure.

Next seam: origins of the current system

konfigure began as a direct successor to config-controller. Much of the code moved over unchanged and was wrapped into a CLI instead of an operator framework. The goal was the same: generate configuration for apps from giantswarm/config.

konfigure's first home was ArgoCD — a significant moment in itself, marking Giant Swarm's first move toward GitOps-driven deployment.

Flux soon replaced Argo, and with it konfigure evolved into a kustomize plugin: a KRM function running under Flux's kustomize-controller. The collection Helm charts that draughtsman had been rendering were replaced by kustomizations, and a new custom resource, generators.giantswarm.io/v1, which told konfigure what to generate.

apiVersion: generators.giantswarm.io/v1
app_catalog: control-plane-catalog
app_destination_namespace: giantswarm
app_name: aws-admission-controller
app_version: 3.6.3
kind: Konfigure
metadata:
  annotations:
    config.kubernetes.io/function: |-
      exec:
        path: /plugins/konfigure
  name: aws-admission-controller
name: aws-admission-controller

And the kustomization.yaml file for kustomize looked like:

generators:
  - aws-admission-controller.yaml
  # ...

A significant change here was that this simple YAML got rendered by konfigure into not just the ConfigMap and Secret, but also the App CR that got applied to the cluster by Flux at the same time.

architect kept pushing new releases to the collection repositories, but simply updated .app_version in these files instead.

No race condition, no separate daemon, no GitHub polling, no manual tool invocations. The draughtsman problem was solved.

Where the cracks started to show

The KRM function model introduced a different kind of complexity. KRM functions are opaque by design. Rendering configurations — and App CRs — locally required understanding how kustomize, KRM functions, and Flux's kustomize-controller all interacted, and that interaction was not obvious.

We also had to maintain custom code and forks of Flux components to support the new system and inject konfigure into the kustomize-controller.

Error messages from konfigure were swallowed. When something went wrong, kustomize-controller did not surface error messages conveniently from kustomize and the konfigure binary invocation underneath. Errors were not visible on the Flux Kustomization status — only buried in the pod logs. Debugging meant hunting through log fragments.

Worse, a single bad configuration entry poisoned the entire kustomization. Nothing else in it would reconcile until the broken entry was fixed. One misconfigured app could silently stall everything running alongside it.

Making App CR-related changes was also challenging, since we hid everything under the generators.giantswarm.io interface. Supporting custom labels or annotations, for example, was not straightforward — we had to change konfigure itself. Technically it would have been possible to do such things on the kustomization side via patches, but how we had structured our GitOps code at the time — just barely starting out, with a lot to learn — meant it was not possible to do so at the management cluster level, where each cluster ran a different provider and thus a different collection.

At the surface: the generalized system

So the switch to GitOps and the first version of konfigure had done the hard work of replacing draughtsman — but it inherited the configuration structure of the vintage era unchanged. It could generate configuration, but only for the shapes it already knew about. The system was extensible in theory, but in practice it required reworking each time.

Eventually, the need to categorize management clusters into different stages arose, and with it the need to support different configurations for each stage. The need for different kinds of configurations also kept growing — for example, to support workload cluster configurations.

At this point, with the experience and problems that had accumulated over the years, we decided the best way forward was to make the system more general. To create a system that allows far more flexibility and extensibility, but in one permutation of its configuration, can perfectly — 1-to-1 — map the original management cluster configuration that was hardcoded and inherited from the vintage era. That fidelity was crucial: even a slight change in the rendered configuration would break the system, causing wrong configuration to roll out for potentially thousands of applications across dozens of management clusters at once.

We completely rewrote the internals of konfigure to parse and render any configuration, based on a schema file that describes the structure of the configuration repository. The schema clearly defines what kind of variables the configuration is based on, the layers it is built from, the values files and templates each layer supports, and more features that the original hardcoded system lacked.

It was designed to be a simple CLI so that configuration can easily be generated locally. All the rendering logic was encapsulated as a library and exposed. That's what we built konfigure-operator on — a thin wrapper bundled with a couple of CRDs to constantly reconcile a set of requested configurations into a Kubernetes cluster.

We also decided to remove the generators.giantswarm.io interface entirely and replace it with raw App CRs that reference the generated configuration. Alongside these apps, a new CR was deployed to tell konfigure-operator what to generate. This time, the reconciliation of the App CRs relied purely on eventual consistency. By only generating a ConfigMap and a Secret, the system is no longer bound to App CRs. In the meantime, we decided to move away from App CRs entirely, converting everything to Flux HelmRelease CRs — and they, or whatever the future holds, can mount the generated config just the same.

konfigure-operator also addressed the poison pill problem: a single bad configuration entry no longer blocks the entire reconciliation. Errors surface cleanly where they should, for the specific configuration that caused them. The rest keeps moving.

These are the decisions that Parts 2 and 3 cover in depth — the schema, the rendering logic, the reconciliation, and more.

What matters here is why they were made: the vintage era proved that hardcoded structure doesn't survive growth, and the first konfigure proved that opacity is a form of fragility. The current system is a direct response to both lessons.