Hashicorp Vault at Numberly #1

A tale of managing thousands of TLS certificates at Numberly

Context

Numberly creates and hosts thousands of websites and web facing APIs developed by more than a hundred developers.

Representing more than 4000 Gitlab projects, these interfaces and endpoints become mission critical parts of our clients businesses once in production.

It’s been a long time necessity for us to protect the access to those resources, starting with on the wire encryption and HTTPS.

Over the years, we faced many of the scaling challenges you need to solve to overcome both the friction that SSL certificate management represent to your production delivery and the automation for their life cycle and maintenance. In many ways, our challenges are similar to the ones that full scale hosters such as OVH have been facing.

There’s a big gap between managing a dozen SSL certificates and managing thousands: one can’t just optimize the time humans spend on creating/installing/monitoring/renewing them, at some point you need to just make those operations transparent to your developers and infrastructure to scale.

We wrote this series of articles to share the 20 years journey and experience we have in managing secrets. Starting from the SSL certificate management at scale to generalizing secret management to teams and applications.

History

Numberly has been around since 2000, we went through the Internet bubble and had the chance to organize a lot of iterations for our hosting strategy.

Some things never changed though: our autonomy and technical independence.

  • 2011: Numberly get its own Autonomous System Number and migrate email routing and web hosting to our own IPs
  • 2012: We buy a bunch of F5 to handle traffic load for the launch of our RTB tracking service (30k RPS)
  • 2016: We start migrating some workloads to Kubernetes and slowly move out of F5 for our internal usages
  • 2017: Numberly becomes a Local Internet Registry
  • 2018: We host most of our web facing and data pipelining workloads on Kubernetes
  • 2020: We handle SSL certificates at scale, used in thousands of websites and APIs
  • 2021: Our internal and external network rework allowed us to scale our Kubernetes hosting to multiple datacenters seamlessly

This last challenge made us rethink our whole SSL strategy to meet both publicly and internally accessible services (not only websites).

Before this project, the stack of our web hosting platforms was composed of:

  • Load-balancers: a bunch of F5 blackboxes
  • SSL certificates: handled by Digicert with a perfectible automation for issuing and renewing certificates. Not mentioning its tremendous money cost
  • Automation: an infamous Google Spreadsheet shared between Project Managers so they could let us know which websites required a SSL certificate renewal or not through Gitlab issues

From that standpoint, any design overhaul would give us better results!

 

Our needs:

  • Not using F5 blackboxes anymore, because of its lack of automation, license costs and poor observability
  • Not using human-backed procedures to generate SSL certificates
  • Being able to generate SSL certificates at scale
  • Having all of these certificates monitored by default, with automated alerting
  • Having these certificates securely stored
  • Doing that without additional costs

 

What we decided to do:

  • Replace F5 by commodity hardware servers with NGINx to leverage on our new network BGP anycast topology running in both our datacenters
  • Use Let’s Encrypt for certificate generation
  • Leverage on our existing Prometheus and AlertManager stack for respectively our monitoring and alerting
  • Use Vault as a secure and highly-available storage

Automation of this work required us to create the following pipeline:

Securely storing our certificates using Hashicorp Vault

Here comes Hashicorp Vault. SSL certificate storage was the entry point of this technology at Numberly. We’ll cover the following use cases on later blog posts (stay tuned).

Vault solved the problem of storing and exposing in a secure and high available manner sensitive data such as SSL certificates.

Access & Audit: only members of the infrastructure team have access to this KV mountpoint. And everything is logged with vault audit logs.

Hosting & Networking: our Vault nodes are hosted in our two datacenters with AWS acting as third one.

They are all able to process request by announcing the same service anycast IP address in our internal network.

Any client would be routed to the nearest Vault server. And if this Vault server isn’t master, it’s able to process the request anyway.

 

Vault storage format for SSL certificates

$ vault secrets enable -version=2 kv-certificates

 

We needed our SSL certificate storage keys to always have the same schema. It looks like this:

  • cert: to store the SSL certificate in PEM format
  • chain: the Let’s Encrypt chain
  • fullchain: the concatenation of the chain and the cert keys
  • key: the SSL certificate key
  • owner: some information about the owner in case it’s a customer’s certificate
  • timestamp: timestamp for the certificate creation

 

Automating Policy and AppRole deployment with terraform

Because SSL certificates are very sensitive, we leverage the AppRole feature of Vault.
That way, our applications never have the same Vault token and can be made aware of their token expiration so they can renew it.

We’ve automated that part using terraform:

resource "vault_auth_backend" "approle" {
     type = "approle"
     tune {
       default_lease_ttl = "60s"
     }
}

data "vault_policy_document" "loadbalancer" {
     rule {
       description = "Used by nginx load-balancers to read SSL certificates"
       path = "kv-certificates/data/*"
       capabilities = ["read"]
     }
}

resource "vault_policy" "loadbalancer" {
     name = "loadbalancer"
     policy = data.vault_policy_document.loadbalancer.hcl
}

resource "vault_approle_auth_backend_role" "loadbalancer" {
     backend = vault_auth_backend.approle.path
     role_name = "loadbalancer"
     token_policies = [vault_policy.loadbalancer.name]
     token_ttl = 600
}

Automation pipeline

At Numberly, we run thousands of jobs a day thanks to Gitlab CI.

Our Gitlab runners run in our on-premises Kubernetes clusters and sometimes we use external runners to absorb peaks.

It was only logical to use our existing CD platforms for that automation job.

Let’s Encrypt implements the ACME protocol, we had to find a hackable ACME client to handle the integrations we wanted. More than 50 ACME clients exist and are referenced on the Let’s Encrypt website.

We’re huge fan of the KISS principle and some of us had previous experience with one client written in bash : dehydrated.

The dehydrated project implements a hook design that lets you write custom behaviors in simple bash, which came handy for our next goals.

Using dehydrated, our remaining challenges were:

  • Find a hook for handling DNS challenges with AWS route53: there are existing hooks for dehydrated on Github such as dehydrated-route53-hook-script.
  • Find a hook for pushing our certificates to Hashicorp Vault: we forked an existing project to fix some issues with KV v2 store and came up with dehydrated-vault-hook.

After implementing all of this, our .gitlab-ci.yaml looked like this:

image: registry/docker-images/alpine:latest

stages:
     - test
     - trigger

before_script:
     - apk --update-cache add curl

lint:
     # Some linting to make sure we didn't declare wrong domains
     stage: test
     script:
       - apk add bash grep
       - ./check.sh

main:
     stage: trigger
     script:
      # Generating DNS challenge with AWS route53 hook
       - dehydrated --config /etc/dehydrated/config --cron --hook /var/lib/dehydrated/dehydrated-route53-hook-script/hook.sh --keep-
      # Pushing generated certificates with our Hasicorp vault hook
       - dehydrated --config /etc/dehydrated/config --cron --hook /var/lib/dehydrated/dehydrated-vault-hook/vault-hook.sh --keep-going
     only:
       - master

Using SSL certificates seamlessly

One of the evolution of the infrastructure was to decommission F5 load-balancers and use NGINx, more precisely OpenResty which is an enhanced version of NGINx with LuaJIT support.

We use catch-all NGINx server names and the ssl_certificate_by_lua_block directive to automatically fetch the SSL certificate of a website / API.

The only known limitation for this is that the SSL requests must be done with SNI compatibility so we can have the Server Name while making the handshake during the SSL connection.

Our lua reads the server name out of the SNI and queries Vault through its HTTP API.

We leverage an AppRole with a low TTL (600s). This token is saved in a lua_shared_dict shm storage that every OpenResty worker can get to make queries to Vault.

  • First try: tries to fetch the server_name certificate, ie: foo.acme.com
  • Second try: upper wildcard, ie: *.acme.com
  • Third try: lower wildcard, ie: *.foo.acme.com

Assuming the server_name is SAN of the lower wildcard.

We use 3 different caching system with LRU cache:

  • certs_cache: working cache with a cache_expire_time parameter
  • fallback_certs_cache: a cache without expiration that covers the case of a domain expiring in certs_cache while your Vault server is down
  • unknown_certs_cache: a cache for domains that don’t have SSL certificates in Vault (meaning it reached the third try)

This third cache is really important as it prevents us from flooding our Vault cluster with queries to check if a certificate exists for every incoming request.

Kubernetes integration

To seemlessly allow our developers to use certificates in our Vault cluster, we’ve used the terribly efficient vault-secrets-operator project.

It allows our developers to create Custom Resource Definition objects that secrets-manager will use to know which SSL certificates has to be synchronized with a Kubernetes secret.

We leverage Kubernetes RBAC to only allow specific users to use this technique as it could be abused to retrieve all SSL certificates.

Here’s what a CRD looks like:

---
apiVersion: ricoberger.de/v1alpha1
kind: VaultSecret
metadata:
     name: vault-star.numberly.com
     namespace: team-xxx
spec:
     keys:
     - fullchain
     - key
     path: kv-certificates/*.numberly.com
     templates:
       tls.crt: '{% .Secrets.fullchain %}'
       tls.key: '{% .Secrets.key %}'
     type: kubernetes.io/tls

Monitoring and Alerting on SSL certificates

Now that all our SSL certificates are stored in a secure and central place, we can automate their monitoring easily.

Using a Gitlab CI job, we generate a YAML file containing the URLs of all our SSL certificates that we make available to Prometheus using file_sd_configs and an external blackbox_exporter to be scrapped by one our Prometheus cluster.

- job_name: blackbox-http-static
     file_sd_configs:
     - files:
       - /etc/prometheus/blackbox/static-http-targets/*.yml
     metrics_path: /probe
     params:
       module:
         - http_2xx

Below an example of one Prometheus alert:

- alert: SslCertExpiringShortly7days
     expr: last_over_time(probe_ssl_earliest_cert_expiry{job="blackbox-http-static"}[2h]) - time() < 86400 * 7
     labels:
       severity: critical
     annotations:
       summary: "{{ $labels.instance }} expires in {{ $value | humanizeDuration }}"
       grafana: <grafana url>
       documentation: <documentation url to know what to do>

Conclusion

Over the course of 2 years using this method, we can outline some main wins:

  • We never missed a SSL certificate renewal!
  • No human was harmed in creating/renewing a SSL certificate by hand
  • No member of the team spent time reloading web servers to add/renew a SSL certificate
  • SSL certificate waiting time for Project managers and developers was turned into valued focus time on making sure our customers were serviced promptly and efficiently
  • We were always alerted for certificates that were going to expire because of some issues (DNS change, Let’s encrypt API error, etc)
  • We did countless Vault upgrades without any downtime
  • We could not have done this engineering piece of work without several great Open Source projects and especially without Let’s Encrypt which makes the Internet more secure since late 2015.

We want to thank all the developers for the time and dedication they put in all the Open Source projects and initiatives we’ve used ❤️. As always, our own time spent in forks or new projects were contributed back or Open Sourced on our Github account.