Introducing Tarmak - the toolkit for Kubernetes cluster provisioning and management

We are proud to introduce Tarmak, an open source toolkit for Kubernetes cluster lifecycle management that focuses on best practice cluster security, management and operation. It has been built from the ground-up to be cloud provider-agnostic and provides a means for consistent and reliable cluster deployment and management, across clouds and on-premises environments.

This blog post is a follow-up to a talk Matt Bates and I gave at PuppetConf 2017. The slides can be found here and a recording of the session can be found at the end of this post (click here to watch).

Tarmak logo

Where did Tarmak come from?

Jetstack have extensive experience deploying Kubernetes into production with many different clients. We have learned what works (and importantly, what works not so well) and worked through several generations of cluster deployment. In the talk, we described these challenges. To summarise:

  • Immutable infrastructure isn’t always that desirable.
  • Ability to test and debug is critical for development and operations.
  • Dependencies need to be versioned.
  • Cluster PKI management in dynamic environments is not easy.

Tarmak and its underlying components are the product of Jetstack’s work with its customers to build and deploy Kubernetes in production at scale.

In this post, we’ll explore the lessons we learned and the motivations for Tarmak, and dive into the tools and the provisioning mechanics. Firstly, the motivations that were born out of the lessons learned:

Improved developer and operator experience

A major goal for the tooling was to provide an easy-to-use and natural UX - for both developers and operators.

In previous generations of cluster deployment, one area of concern with immutable replacement of configuration changes was the long and expensive feedback loop. It took significant time for a code change to be deployed into a real-world cluster, and a simple careless mistake in a JSON file could take up to 30 minutes to realise and fix. Using tests at multiple levels (unit, integration) on all code involved, helps to catch errors that prevent a cluster from building early.

Another problem, especially with the Bash scripts, was that whilst they would work fine with one specific configuration, once you had some input parameters they were really hard to maintain. Scripts were modified and duplicated and and this quickly became difficult to maintain effectively. So our goal for the new project was to follow coding best practices: “Don’t repeat yourself” (DRY) and “Keep it simple, stupid” (KISS). This helps to reduce the complexity of later changes and helps to achieve a modular design.

With replacing instances on every configuration change, it’s not easily possible to get an idea what changes are about to happen on the instance’s configuration. It would be great to have better insights into the changes that will be performed, by having a dry-run capability.

Another important observation was that using a more traditional approach of running software helps engineers to transition more smoothly into a container-centric world. Whilst Kubernetes can be used to “self-host” its own components, we recognised that is greater familiarity (at this stage) with tried-and-tested and traditional tools in operations teams, so we adopted systemd and use the vanilla open source Kubernetes binaries.

Less disruptive cluster upgrades

In many cases with existing tooling, cluster upgrades involve replacing instances; when you want to change something, the entire instance is replaced with a new one that contains the new configuration. A number of limitations started to emerge from this strategy.

  • Replacing instances can get time and cost expensive, especially in large clusters.
  • There is no control over our rolled-out instances - their actual state might have diverged from the desired state.
  • Draining Kubernetes worker instances is often a quite manual process.
  • Every replacement comes with risks: someone might use latest tags, configuration no longer valid.
  • Cached content is lost throughout the whole cluster and needs to be rebuilt.
  • Stateful applications need to migrate data over to other instances (and this is often a resource intensive process for some applications).

Tarmak has been designed with these factors in mind. We support both in-place upgrades, as well as full instance replacement. This allows operators to choose how they would like their clusters to be upgraded, to ensure that whatever cluster=level operation they are undertaking, it is performed in the least disruptive way possible.

Consistency between environments

Another benefit of the new tools should be that they should be designed to provide a consistent deployment across different cloud providers and on-premises setups. We consistently hear from customers that they do not wish to skill-up operations teams with a multitide of provisioning tools and techniques, not least because of the operational risk it poses when trying to reason about cluster configuration and health at times of failure.

Introducing Tarmak

With Tarmak, we have developed the right tool to be able to address these
challenges.

We identified Infrastructure, Configuration and Application as the three core layers of set-up in a Kubernetes cluster.

  • Infrastructure: all core resources (like compute, network, storage) are created and configured to be able to work together. We use Terraform to plan and execute these changes. At the end of this stage, the infrastructure is ready to run our own bespoke ‘Tarmak instance agent’ (Wing), required for the configuration stage.

  • Configuration: The Wing agent is in the core of the configuration layer and uses Puppet manifests to configure all instances in a cluster accordingly. After Wing has been run it sends reports back to the Wing apiserver, which can be run in a highly available configuration. Once all instances in a cluster have successfully executed Wing, the Kubernetes cluster is up and running and provides its API as an interface.

  • Applications: The core cluster add-ons are deployed with the help of Puppet. Any other tool like kubectl or Helm can also be used to manage the lifecycle of these applications on the cluster.

Abstractions and chosen solutions

Abstractions and chosen tools

Infrastructure

As part of the Infrastructure provisioning stage, we use Terraform to set up instance that later get configured to fulfill one of the following roles:

  • Bastion is the only node that has a public IP address assigned. It is used as a “jump host” to connect to services on the private networks of clusters. It also runs the Wing apiserver responsible for aggregating the state information of instances.
  • Vault instances provide a dynamic CA (Certificate Authority)-as-a-service for the various cluster components that rely on TLS authentication. It also runs Consul as a backend for Vault and stores its data on persistent disks, encrypted and secured.
  • etcd instances store the state for the Kubernetes control plane. They have persistent disks and run etcd HA (i.e. 3+ instances): one for Kubernetes, another one dedicated to Kubernetes’ events and the third for the overlay network (Calico, by default).
  • Kubernetes Masters are running the Kubernetes control plane components in a highly available configuration.
  • Kubernetes Workers are running your organisation’s application workloads.

In addition to the creation of these instances, an object store is populated with Puppet manifests that are later used to spin up services on the instances. The same manifests are distributed to all nodes in the cluster.

Infrastructure layer

Infrastructure layer

Configuration

The configuration phase starts when an instance gets started or a re-run is requested using Tarmak. Wing fetches the latest Puppet manifests from the object store and applies the manifest on the instance until the manifests have been converged. Meanwhile, Wing sends status updates to the Wing apiserver.

The Puppet manifests are designed so as not to require Puppet once any required changes have been applied. The startup of the services are managed using standard systemd units, and timers are used for recurring tasks like the renewal of certificates.

The Puppet modules powering these configuration steps have been implemented in cooperation with Compare the Market — this should also explain the ‘Meerkats’ in the talk title! 🙂

Configuration layer

Configuration layer

Getting started

You can get started with Tarmak by following our AWS getting started guide.

We’d love to hear feedback and take contributions in the Tarmak project (Apache 2.0 licensed) on GitHub.

Stay tuned

We are actively working on making Tarmak more accessible to external contributors. Our next steps are:

  • Splitting out the Puppet modules into separate repositories.
  • Move issue tracking (GitHub) and CI (Travis CI) out to the open.
  • Improved documentation.

In our next blog post we’ll explain why Tarmak excels at quick and non-disruptive Kubernetes cluster upgrades, using the power of Wing - stay tuned!

Watch the recording from Puppet Conf 2017