DNS Delegation Intermittently Failing due to CNAME based NS records.

“Uh, we had a slight weapons malfunction, but uh… everything’s perfectly all right now. We’re fine. We’re all fine here now, thank you…. Uh, how are you?” – Han Solo

I had an interesting problem I was asked to assist with yesterday for a customer regarding DNS delegation for public DNS. The customer was migrating from one set of public DNS servers to another set. As is the standard process during these times, the old DNS servers are kept available during the NS re-delegation to ensure that any resolving DNS servers that have the old NS records cached will still be able to seek answers from the original DNS servers. Once the cached NS records expire then the resolving DNS servers will locate the new DNS servers. After a couple of days the old DNS servers are then decommissioned.

Unfortunately, not long after the delegation was changed over, they started seeing intermittent failures around the world. The lovely folks over at https://www.whatsmydns.net/ have an awesome tool for checking propagation results for DNS changes, and we saw some very startling results. It appeared that some providers globally were happy with the change, where as most of them were not. Interestingly, Google and Sprint had no problems, and a Microsoft DNS server I was using locally also had no problems resolving the new records.

After quite a few investigations with NSLOOKUP, I was able to check the delegation hierarchy from the parent domain owner, showing the NS records pointing to the delegated sub-domain, and then finally pulling NS records from the authoritative DNS server. And yet, whilst this worked for me, I was getting failures when resolving via Telstra and Internode in Australia (where I’m based).

I then noticed that the NS records for the new DNS servers were not actually NS records pointing to A records, but NS records pointing to CNAME records (long story). I had not actually seen that done before, but was initially sceptical to that being the issue, as Google, Sprint and the Microsoft DNS server were resolving that perfectly well.

Typically at this point, one might think that the problem was a propagation delay due to bad cache, but with the original DNS servers still available, that seemed unlikely. I told the customer to re-delegate the NS records again, but this time to A records instead of CNAMEs. Almost immediately, we started seeing improvements, and within 20 or so minutes of that change, DNS resolution was restored.

Some after the fact research found that section 10.3 in RFC 2181 stated:

The domain name used as the value of a NS resource record, or part of the value of a MX resource record must not be an alias. Not only is the specification clear on this point, but using an alias in either of these positions neither works as well as might be hoped, nor well fulfills the ambition that may have led to this approach.

There’s another couple of RFCs that mention the issue, although this one is probably the most specific. It would be interesting to see what DNS server products Google and Sprint use, more for academic interest though at this point.

So to the question of can you have an NS record point to a CNAME record? No….. Well, not if you want it to work properly anyway.

That was certainly a new one for me.

~ Mike

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s