Cert-manager reaches v0.6

We’re excited to announce v0.6 of cert-manager, the general purpose X.509 certificate management tool for Kubernetes. Cert-manager provides automated issuance, renewal and management of certificates within your organisation.

Certificate management in highly dynamic environments is no easy feat, and if approached without careful consideration could quickly lead to outages and service interuption when certificates begin expiring. By standardising on a single tool for managing your PKI assets, you can ensure that certificates are being automatically renewed, and that the appropriate teams are notified if there are any issues or policy violations within your cluster.

Over the last year, we’ve seen the project grow rapidly, approaching 3,000 stars on GitHub and with a community of over 100 code contributors, and thousands of people providing support, discussion and insight.

In this post, we’re going to explore some of the new features of the v0.6 release, as well as discuss our plans for the project as it works towards a “1.0” release!

Introducing ACME ‘Order’ and ‘Challenge’ CRDs

In the past, due to the way that cert-manager was initially designed, we’ve had problems controlling and managing cert-manager’s ACME client usage. This has in some cases lead to excessive use of ACME APIs, which can cause problems for public ACME providers such as Let’s Encrypt.

This release significantly refactors how we process and manage ACME certificates, and as a result we’ve seen a net reduction in API usage of up to 100x in some of the worst cases. Making this change was no small job, and has taken a few months to properly mature into what it is today.

In order to achieve these improvements, we’ve created ‘Order’ and ‘Challenge’ resource types within the Kubernetes API. This allows us to cache and reason about objects that would usually only exist within the ACME server, using our own API. By doing it this way, it also allows more advanced users and integrators to understand and control the ACME Order flow, as we present structured information about the process in the form of our CRDs.

To summarise, this restructure gives us:

  • Centralised point of logging and debugging of the ACME authorization process. Instead of searching through log messages, it’s now possible to run kubectl describe to understand what the state of a certificate is.

  • Fewer API calls to ACME servers. Information about Orders and Challenges is now stored within the Kubernetes API. This means we don’t need to query the ACME API in order to make control-flow decisions.

  • Cleaner, more understandable separation of concerns. This allows you to build your own integrations and ‘hook in’ to the authorization process.

This is largely an internal change, but with far reaching benefits. For more details, check out the details in the pull request #788.

We are keen to hear feedback on this new design, so please create issues including the /area acme text in order to report feedback.

Improved handling of ACME rate limits

Off the back of the changes discussed above, we’ve been able to implement far more intelligent handling of rate limits and quotas with ACME servers. This was previously not possible, due to the way we scheduled challenges for processing.

In large scale deployments, we’ve seen these changes have an extremely positive effect. In one case, up to 80000 domain names were validated without hitting quota troubles! We’ll be publishing more information on some of our largest users, and how we’ve helped them with their managed certificate offerings, over the coming weeks - stay tuned!

Prometheus metrics for the ACME client

So far we’ve spoken a lot about improving how we use external APIs, but how do we know we’ve made improvements?

Well, as part of v0.6 we’ve expanded out the set of Prometheus metrics we expose. This allows you to build custom dashboards and alerts to monitor your cert-manager deployment, including:

  • Certificate expiry times
  • Number of certificates
  • How the ACME client is used

In later releases we’re going to extend this further so that you can build alerting policy so you can keep ahead of the curve with upcoming or newly introduced issues!

Below is an example of a dashboard we’ve assembled, that allows you to monitor how cert-manager is interacting with Let’s Encrypt APIs. The metrics are broken down by path, status code and a number of other labels:

Image of the metrics produced by the cert-manager acme client

A sample of the metrics exposed by the ACME client

We’ll also be publishing some example dashboards that can be easily used with cert-manager over the coming releases.

Validating resource webhook enabled by default

In earlier releases, we introduced the ‘webhook’ component which performs advanced resource validation of your resources before they can be stored in the apiserver, such as ensuring that all DNS names provided are valid.

This means that when a user creates a Certificate, Issuer or ClusterIssuer resource, they can be validated and checked to ensure they are well-formed and don’t contain mistakes that could otherwise cause problems for the way the controller works.

As part of the v0.6 release, we now enable this webhook component by default. Doing this will ensure that all users are running with a ‘level playing field’ and hopefully prevent bugs/misconfigurations sneaking into production setups!

ECDSA keys supported for ACME certificates

It’s been requested for a while that we support different private key types beyond RSA. Thanks to the community, we now support ECDSA private keys in all parts of cert-manager.

In order to use the alternate key algorithm, you can simply specify certificate.spec.keyAlgorithm on your Certificate resource. As the project matures, we’ll look to add and expose new fields like this as part of the API specs.

We hope, in time, to provide a meaningful abstractions over the X.509 specification, giving you full control over the shape of your PKI assets!

Scalability improvements

As part of our validation for this release, we’ve been able to test cert-manager in larger deployment configurations. This includes running with 10s of thousands of certificate resources at a time, whilst also ensuring that our client, memory and CPU usage scale linearly.

As a result of this testing, we’ve also got numerous scale-related improvements triaged for the next release, v0.7.

What’s next?

Since we’ve moved to a monthly release cadence, cert-manager v0.7 is due to be released at the end of February. This means more frequent, smaller releases.

cert-manager v0.7 therefore contains a few new features, and a slew of bugfixes. Notable features include:

  • Webhook based DNS01 solvers (ACME): since we began supporting the ACME DNS01 challenge mechanism, we’ve had requests for some way for users to integrate cert-manager with their own DNS nameservers. @zuzzas has been working on a new DNS01 challenge provider, the ‘webhook’ provider. This will allow anyone to integrate cert-manager with their own DNS setups, without having to create pull requests upstream.

  • ARM32 and ARM64 support: this has been a long time coming - from v0.7 onwards, we’ll begin publishing both ARM32 and ARM64 docker images that can be used in your ARM based clusters.

  • Improvements to the webhook deployment strategy: we’ve previously relied on a CronJob resource that periodically ensures PKI assets for the webhook are up-to-date. After feedback, we’ve decided to move this to be handled by a new, dedicated controller. This should mean the certificate rotation process for the webhook itself is far more robust.

  • Moving to our own Helm chart repository: this will allow us to publish new copies of the Helm chart more frequently, and also expose the chart on the Helm hub.

  • Improved challenge error handling: we’ll be including failure reasons as part of the ‘reason’ field on Challenge resources, meaning you’ll no longer need to grep through the cert-manager logs in order to work out why your ACME validations are failing.

  • Alpha level support for Venafi issued certificates: a lot of enterprise users make use of the Venafi platform to procure certificates from their own CAs, and have existing processes that utilise the Venafi management capabilities across their organisations. The v0.7 release will include support for integrating cert-manager with Venafi, allowing organisations that already have automated PKI configured to begin consuming certificates within their Kubernetes clusters.

Conclusion

The v0.6 release has been a long time coming, but has set a basis for us to work and pave the way for a stable v1.0 release. We’re really looking forward to getting the next iteration of the project out there, and have goals to mature our API to beta (and finally GA) within the next 6 months.

Stay tuned, keep an eye on the project and watch the blog for more updates!