(octo)DNS at Numberly

The story of DNS management and how we handled DNS records without friction in a multi-tenant environment at numberly

Did you ever dream of handling DNS records without friction so that you could create and modify DNS entries with the confidence of not destroying your production and get the wrath of your sysadmin friends? Let alone in a multi-tenant environment? Here’s our story of how we ultimately turned that dream into reality.

Like a lot of Tech companies out there, Numberly has more than a hundred developers creating and modifying software every day and want to give them as much freedom as possible by removing as much friction with other teams as possible. Especially when the subject is as sensitive as DNS.

But if you have a minimal sysadmin background (or just pretend to), you know that DNS is sometimes the cause of a lot of troubles.

So letting anybody change your DNS records requires a minimum of knowledge and even with that requirement, a lot of sanity checks.

Here’s a backstory of Numberly’s DNS management.

Pre-2010, the bind era

Back in the day, we managed a single Bind instance.

To administrate any bind instance without automation, you require at least:

  • a SSH authentication
  • an user with the ability to modify a zone file
  • and most importantly users with the minimal amount of DNS knowledge so they won’t break your zone file

For these reasons, the internal IT team was in charge of all DNS modifications. And even with a good DNS knowledge, human errors kept happening because we didn’t have the linting and checks features that other servers have.

As the company grew and because this process was not scaling well, we had to automate it and fix our human error issues. So instead of writing an automation script to generate bind configuration, with the necessity to implement sanity checks and such, we looked at PowerDNS and its authoritative server shipped with an API.

Post-2010 era, PowerDNS

Here comes PowerDNS, a fantastic tool developed by Open-XChange, the same company behind dovecot.

As an authoritative DNS server, it gives you the ability to handle thousands of zone files, backed with a powerful API and scalable storage such as MySQL, PostgreSQL databases.

The migration was quite seamless as PowerDNS includes some tools to import Bind zone files.

To keep it as simple and dependency free at the time, the backend that was chosen was sqlite. It doesn’t scale, it doesn’t handle replication natively. So we stumbled upon a few problems later on.

Shielding our public DNS from DDos

Also, at that time DNS-based DDoS attacks were really common. And running your own DNS servers could quickly become hard as they are often the targets of attacks.

DDoS protections are quite expensive, and if you don’t have any protection systems or enough monitoring and logging solutions to automatically implement rate-limiting or IP ban with tools such as dnsdist or Crowdsec (yes fail2ban is officially dead), you may quickly run into some problems…

That’s one of the reason we decided to host these PowerDNS servers outside of our network within a public cloud provider with DDoS protection.

Later on, a dnsdist instance was setup on top of the PowerDNS to rate-limit and avoid too many problems, it proved effective. A lot believe me.

Here’s a sample dnsdist configuration that we used:

-- tuning
setMaxUDPOutstanding(65000)

controlSocket('127.0.0.1:5199')
setKey("sup3rm4g4s3cur3k3y")
-- we should create as much addLocal() as we have CPU for intense workloads
addLocal('<our_public_ip>:53', {doTCP=true, reusePort=true})

-- allow all to recurse us
setACL("0.0.0.0/0")
newServer{address="<pdns_1_ip>", qps=10000, name="pdns-1", useClientSubnet=true, checkType="A", checkName="www.numberly.com.", mustResolve=true}
newServer{address="<pdns_2_ip>", qps=10000, name="pdns-2", useClientSubnet=true, checkType="A", checkName="www.numberly.com.", mustResolve=true}

setServerPolicy(roundrobin)

-- Drop ANY queries
addAction(QTypeRule(dnsdist.ANY), DropAction())

-- Apply Rate Limit for NXDomain and ServFail queries
local dbr = dynBlockRulesGroup()
dbr:setRCodeRate(dnsdist.NXDOMAIN, 5, 10, "Exceeded NXD rate", 60, DNSAction.Drop)
dbr:setRCodeRate(dnsdist.SERVFAIL, 5, 10, "Exceeded ServFail rate", 60, DNSAction.Drop)
dbr:excludeRange({"127.0.0.1/32", "10.0.0.0/8" })

function maintenance()
  dbr:apply()
end

 

A few years later, with the growing usage of Kubernetes at Numberly, we had to connect Kubernetes with PowerDNS in order to let our developers expose their web applications directly from their Kubernetes ingress configuration.

But one of the cons of the PowerDNS API is that it’s not multi-tenant, PowerDNS gives you a unique API key.

Fixing PowerDNS lack of multi-tenancy

As PowerDNS only ships an unique API key, we wanted to solve that problem to improve our API accesses security.

We used a nginx reverse-proxy in front of our PowerDNS API with an allow-list configured by Ansible. This only fixes authentication issues and nothing more.

We thought about doing some ACL based workflows like linking DNS zones to an API key, but it was overkill in our situation.

The configuration snippets speaks for itself:

map $http_x_api_key $key {
    default 0;
    sup3rm4g4s3cur3k3y       1; # key for k8s-pa7-1-external-dns
    m3g4sup3rm4g4s3cur3k3y   1; # key for k8s-par5-1-external-dns
    ultr4sup3rm4g4s3cur3k3y3 1; # key for octodns
}

server
{
    listen      443 ssl;
    server_name pdns.acme.internal

    ssl_certificate /etc/ssl/cert.crt;
    ssl_certificate_key /etc/ssl/cert.key;

    access_log /var/log/nginx/pdns.log main;
    error_log /var/log/nginx/pdns.log error;

    if ($key = 0) {
      return 403;
    }

    location / {
        proxy_set_header Host $http_host;
        proxy_set_header X-Forwarded-Host $host;
        proxy_set_header X-Forwarded-Server $host;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-API-Key <YOUR_UNIQUE_PDNS_API_KEY>;
        proxy_pass       http://127.0.0.1:8081/;
    }
}

Covid 2020 era, AWS Route53

When your needs and the scale of your DNS zones are large enough, managing your public DNS can become time consuming even with a lot of automation.

Managing our public PowerDNS instances was more and more time consuming, so instead of wasting our energy on trying to become a hosting provider we wanted to simply focus on using DNS. We also thought it was time to better track changes to our DNS records.

Why not go back to simple things? like git and ansible and a good old CI pipeline?

A cloud-based solution clearly matched these goals, thanks to hundreds of engineers in several cloud companies who automated this kind of problems for thousands of other companies like us. We chose AWS Route53 to achieve this.

First try using the route53 module of ansible

We started by using what we knew best and seemed natural but we quickly got discouraged because the ansible route53 module implementation is making an HTTP call to AWS per configured entry. It’s not leveraging the Route53 API bulk endpoints!

Plus, it didn’t implement idempotency, which is a tech pillar at Numberly.

With more than 900 zones totaling more than 50k DNS records, ansible took 23 hours to run. It was clearly not scalable.

Also, you have to take into account that AWS limits your Route53 API calls to only 5 requests per second.

How could we move our whole DNS zones to Route53 with such a limitation?

Second try using octoDNS

Here comes octoDNS, a tool developed by Github to tackle the exact problem we were facing (thank you Open Source)!

Looking at it as “DNS as code”, octoDNS handles multiple providers, and lets you use them as input, output or both.

Below is an octoDNS sample configuration:

manager:
  max_workers: 1
providers:
  yaml:
    class: octodns.provider.yaml.YamlProvider
    directory: ./records
  route53:
    class: octodns.provider.route53.Route53Provider
  ...

zones:
  numberly.com.:
    sources:
    - yaml
    targets:
    - route53
  ...

 

Declare your zone:

datalively:
- ttl: 86400
  type: A
  value: 195.66.82.254
- ttl: 86400
  type: MX
  values:
  - exchange: tsunami1.mm-send.com.
    preference: 5
  - exchange: tsunami2.mm-send.com.
    preference: 5
- ttl: 86400
  type: SPF
  value: v=spf1 include:mm-send.com -all
- ttl: 86400
  type: TXT
  values:
  - spf2.0/pra include:mm-send.com -all
  - v=spf1 include:mm-send.com -all

 

Run your change on each merge-request approval with a .gitlab-ci.yaml:

Update DNS:
  stage: deploy
  script: "octodns-sync --config-file=./config.yaml --doit"
  resource_group: production
  only:
    refs:
      - master
    changes:
      - config.yaml
      - records/*
  except:
    - schedules
  retry: 2

Migrating from PowerDNS to octoDNS

Our first move was to handle the migration of our big PowerDNS’s sqlite database to AWS Route53.

We wrote a script to generate .yaml files per domain-name, cleaned it and pushed it.

  • input: yaml
  • output: Route53

Our script generated this kind of octoDNS configuration output:

providers:
  route53:
    class: octodns.provider.route53.Route53Provider
    max_changes: 100
  yaml:
    class: octodns.provider.yaml.YamlProvider
    default_ttl: 3600
    directory: ./records
    enforce_order: true
    populate_should_replace: true\

zones:
  1ldb.fr.:
    sources:
    - yaml
    targets:
    - route53

 

For 900+ zones it took octoDNS less than 5 minutes to synchronize without triggering any AWS Route53 rate-limit by leveraging bulk API methods!

We couldn’t believe the simplicity and efficiency of the tool 🙂

After a lot of sanity checks (we really could not believe it!), we validated that it worked well and quickly rolled-out this project. We were pretty happy about it.

For months, dozens of Numberly users and developers on Gitlab created Merge Requests to update our DNS. Git proved useful to track and understand when, why and who made every DNS record modifications.

The only problem that remained was that this structural change removed the ability for our developers to create DNS entries through Kubernetes well-known add-on external-dns.

It was configured to write directly on AWS Route53 zones, but because of octoDNS idempotence which removes any records that it doesn’t know about it didn’t work well…

Pairing octoDNS and Kubernetes external-dns

At that point our DNS were not hosted on our infrastructure any more and we had a powerful-yet-rate-limited API.

The only thing left to crack was how to integrate that with our other workloads, such as our Kubernetes integration with external-dns.

We kept an internal-only PowerDNS cluster to serve internal zones. So we looked at it to host, without rate-limit, non-exposed DNS zones that could be merged with AWS Route53, by octoDNS.

The resulting DNS pipeline overview:

  • input: PowerDNS then yaml
  • output: Route53

We added the following octoDNS configuration:

providers:
  pdns:
    api_key: env/PDNS_API_KEY
    class: octodns.provider.powerdns.PowerDnsProvider
    host: pdns.internal.acme
    port: 443
    scheme: https
  route53:
    class: octodns.provider.route53.Route53Provider
    max_changes: 100
  yaml:
    class: octodns.provider.yaml.YamlProvider
    default_ttl: 3600
    directory: ./records
    enforce_order: true
    populate_should_replace: true
  numberly.com.:
    sources:
    - pdns
    - yaml
    targets:
    - route53

 

On the Kubernetes side:

 - args:
    - --domain-filter=numberly.com
    - --interval=30s
    - --log-level=debug
    - --policy=sync
    - --provider=pdns
    - --source=ingress
    - --pdns-server=https://pdns.acme.internal
    - --pdns-api-key=sup3rm4g4s3cur3k3y
    image: bitnami/external-dns:0.7.6

 

With that configuration, octoDNS merges PowerDNS and our git based YAML records, in that order.

This DNS pipeline has the nice side effect of making sure that any dynamic DNS record coming from PowerDNS will be overridden by the git based YAMLs records effectively preventing any user error that could overwrite and existing DNS record (even if external-dns already handles that case).

NetOps + DNS = ❤️

At Numberly, we heavily use a great Open Source tool called Netbox to manage our Datacenter infrastructure and our IP addresses thanks to its powerful API.

At this point in time, our network team finished to roll out our brand new datacenter Arista-backed network.

Having finished to automate the network configurations with ansible and Netbox, they seeked to automate their reverse DNS creation and updates.

It was easily achieved by writing a module that fetches the IP’s DNS reverse attribute from Netbox API. Once again, an Open Source plugin existed: octodns-netbox.

  • input: Netbox then yaml
  • output: Route53 then powerDNS (for internal)

We thus configured octoDNS as such:

providers:
  pdns:
    api_key: env/PDNS_API_KEY
    class: octodns.provider.powerdns.PowerDnsProvider
    host: pdns.internal.acme
    port: 443
    scheme: https
  netbox:
    class: octodns_netbox.NetboxSource
    url: https://netbox.internal.acme/api
    token: env/NETBOX_TOKEN
  route53:
    class: octodns.provider.route53.Route53Provider
    max_changes: 100
  yaml:
    class: octodns.provider.yaml.YamlProvider
    default_ttl: 3600
    directory: ./records
    enforce_order: true
    populate_should_replace: true

  82.66.195.in-addr.arpa.:
    sources:
    - netbox
    - yaml
    targets:
    - route53

 

In a matter of hours, we automated all our public subnets and internal subnets for router interconnections!

Open Source contributions along the way

You seldom follow this kind of path without hitting some rocks. The beauty of Open Source we strongly believe in is that you can do something about it to the benefit of the whole community!

Join our