Continuous Deployment and Automated Canary Analysis with Spinnaker and Kubernetes

Spinnaker is a cloud-native continuous delivery tool created at Netflix and was originally designed and built to help internal development teams release software changes with confidence. Since then it has been open-sourced and has gained the support of a growing number of mainstream cloud providers including Google, Amazon, Microsoft, IBM and Oracle.

At Jetstack we receive questions almost on a daily basis from our customers about how to deploy to Kubernetes across different environments and in some cases to clusters in multiple cloud providers/on-prem. Since Spinnaker runs natively on Kubernetes and has first-class support for Kubernetes manifests, it is a strong candidate as a tool for this purpose. However, being able to demonstrate the tool in action and more importantly how it might integrate with other tooling is vital for making a decision. For this reason we have been working on a series of demonstrators with various best-of-breed cloud-native technologies to help inform our customers. In this post, we’ll describe the architecture of the demo and how these cloud-native technologies can be used together and with Spinnaker.

Overview

The primary aim of the demo is to show how Spinnaker could be used to automate the deployment of a new version of an application to production with confidence. The chosen application is a simple webserver called the goldengoose that we use for our advanced wargaming training course. The techniques described below could of course be applied to a more complex application, but in order to keep the focus on Spinnaker’s capabilities rather than the intricacies of managing a particular application, we chose to keep the application simple.

The demo configures two pipelines within Spinnaker: Build and Deploy. When a commit is pushed to the master branch of the goldengoose GitHub repository, a GitHub webhook triggers the Build pipeline which builds an image on-cluster and pushes the result to Docker Hub. If successful, the Deploy pipeline is then triggered which deploys the new image in a controlled way to production.

One of the main components of the demo that provides the confidence and control mentioned above is the use of Spinnaker’s automated canary analysis (ACA) feature. This feature leverages Kayenta, a component responsible for querying relevant metrics from a configured sink and performing statistical analysis on the data to decide how to proceed (in this case, whether a canary deployment should be promoted to production or not). Deciding which metrics should be used to make such a decision can be challenging, however this feature provides operators with an incredibly flexible way of describing to Spinnaker what it means for a new version of their application to be ‘better’ than the previous version.

deploy pipeline

The whole demo (except for the GitHub and Docker Hub components, load balancers and disks) runs on a single GKE cluster. This cluster does not have any special requirements except that we have enabled autoscaling and made the nodes larger than the default (n1-standard-4).

More detail on how the tools used within the demo interact with each other will be described below, but the high-level steps are as follows:

  1. Make a local change to the goldengoose codebase and push to GitHub
  2. GitHub webhook triggers Spinnaker Build pipeline which applies Knative Build custom resource to the cluster
  3. Knative build controller triggers a build of the goldengoose image which is pushed to Docker Hub
  4. If the build is successful, the Spinnaker Deploy pipeline is triggered
  5. Canary deployment is deployed from the newly built image
  6. Baseline deployment is deployed using the image from the current production deployment
  7. Spinnkaker performs ACA on performance metrics collected from both the canary and baseline deployments
  8. If ACA is deemed successful, the canary image is promoted to production by performing a rolling update of the production deployment
  9. Canary and baseline deployments are cleaned up

The reason for deploying a baseline using the current production image rather than just using the production deployment itself is to avoid differences in performance metrics due to how long the deployment has been running. Heap size is one such metric that could be affected by this.

These steps could of course be extended to a more complex pipeline involving more environments and more testing, perhaps with a final manual promotion to production; a single Spinnaker deployment can interact with multiple Kubernetes clusters other than the cluster Spinnaker is running on by installing credentials for these other clusters.

Tooling

Here we list the main tools that have been used and their purpose within the demo and how they relate to Spinnaker:

  • Knative: when a code change is pushed to our goldengoose repository, we want to trigger a build so that a new canary deployment can be rolled out. Knative’s build component worked nicely for this and allowed Spinnaker to apply a Build custom resource to the cluster whenever a commit was pushed to our master branch. This CI component of the demo is not strictly within Spinnaker’s domain as a CD tool, however by having Knative controllers handle the logic involved in building a new image we could still make use of Spinnaker’s first-class support for Kubernetes resources.
  • Prometheus: Spinnaker’s ACA requires access to a set of metrics from both a canary deployment and a baseline deployment. Spinnaker supports a number of metrics sinks but some of the reasons we chose Prometheus was its ubiquity in the cloud-native space and the fact that it integrates with Istio out of the box. By configuring Spinnaker to talk to our in-cluster Prometheus instance we were able to automate the decision to promote canary images to production.
  • Istio: as we were only making use of Knative’s build component, we did not have a strict dependency on Istio; however, by using Istio’s traffic shifting capabilities we were able to easily route equal and weighted production traffic to both our baseline and canary deployments, producing performance metrics to be used by Spinnaker’s ACA feature. Istio’s traffic mirroring feature could also be used if you did not want responses from the canary to be seen by users. We also made use of the Prometheus adapter to describe to Istio which goldengoose metrics we wanted to make available in Prometheus. Finally, the Istio Gateway was used to allow traffic to reach our goldengoose deployments.
  • cert-manager: to secure Spinnaker’s UI and API endpoints we needed TLS certificates; what else would we use?
  • nginx-ingress: the NGINX ingress controller was used to allow traffic to reach both the Spinnaker UI and API endpoints as well as for cert-manager Let’s Encrypt ACME HTTP challenges.
  • GitHub: used as both a source code respository and as an OAuth identity provider for Spinnaker. There are other authentication options available.
  • OpenLDAP: used for authorisation within Spinnaker. There are other authorisation options available.

Summary

We have described how Spinnaker can be used for continuous delivery (and integration) and how it can be integrated with other cloud-native tooling to provide powerful capabilities within your organisation.

It is still relatively early days for the Spinnaker project and we can expect to see lots of future development; the documentation that exists today is clean and easy to follow, however there are a number of undocumented features that I would like to see around exposing the internals of the various microservices that make up a Spinnaker deployment. Some interesting ones exist today for example writing custom stages and adding first-class support for particular Kubernetes custom resources but other changes such as letting Spinnaker know that a new CRD exists in the cluster and the recommended way of manually adding to generated Halyard configuration (for component sizing for example) would be nice to see. Fortunately the Spinnaker community is strong and responsive and has clearly outlined how best to get in touch here.

One potential barrier to Spinnaker adoption for some users is the amount it lays on top of Kubernetes; authentication, authorisation and configuration validation (e.g. for Spinnaker pipelines) are all handled by various Spinnaker components or external services, however upstream Kubernetes already has a lot of machinery to handle these exact problems which Spinnaker does not make use of. The ability to apply a pipeline custom resource for example that Spinnaker watches for would be very powerful, allowing RBAC rules to be configured to control which users are allowed to manage pipelines. Not relying on Kubernetes for these features does of course allow for more granular authorisation for example and additionally makes Spinnaker’s deployment options more wide than just Kubernetes, however since the only production installation instructions require Kubernetes and since Kubernetes is becoming increasingly ubiquitous, it might ease adoption by working towards making that coupling tighter. Projects such as k8s-pipeliner do try to provide some of that glue but deeper integrations would be greatly valued for users already familiar with Kubernetes.

For more information on anything covered in this post please reach out to our team at hello@jetstack.io.

More Reading