Last week after making a change to our 2008 DNS infrastructure we started getting a re-occurring error where DNS requests for certain top level domain were not being resolved. After investigating the issue for a bit it turned out there was an error with the way server 2008 DNS handles TTL (Time To Live). The issue causes .co.uk domains to become unresolvable (as well as other domains that have circular dependencies).

For this post I have made a screencast that runs through the error and how to fix it but I will also post the details below. If you have a look at the screencast I would appreciate any feedback (via the comments) so I know whether i should look at recording more videos in future.

Symptoms

Clients are unable to resolve .co.uk domains and when using nslookup the following error is returned:

*** Unknown can’t find <name>.co.uk: Server failed

The error happens every two days when using Server 2008 DNS servers. If you are using forwarding then this error does not happen.

Cause

The .uk top level domain uses nameservers that are all something.uk (currently NSx.nic.uk). This means that the .uk domain is using circular dependencies and so any name resolution needs glue records supplying from the root-hint servers to allow it to process any .uk queries.

image Correct records

This error is caused by Server 2008 ignoring the TTL supplied by the root-hint servers and overriding it to be 1 day (discovered after looking through packet outputs in the DNS logs).

image Server ready to fail

This means that the glue records expire for the .co.uk namservers (something.nic.uk). After the first expiry the server copies the NS records from .co.uk into its cache for .nic.uk and everything carries on working.

image Server in a failed state

After the next 24 hours has gone past the glue records expire again but the NS records are left behind. Now if a client tries to resolve a domain the server is unable to contact the .uk namesrvers and so returns the failure message.

Solution (dirty hack!)

There are a few things that can be done to get around this issue, firstly you can clear the servers DNS cache and this resets everything. However this isn’t really a fix as the problem will reoccur in another two days time. Secondly you can configure you DNS server to use a forwarder so that all DNS requests are forwarded to another server, again this might not be appropriate depending on your requirements.

Lastly you can configure the max TTL value for a record to be 2days, this over rides the default behaviour of setting it to 1 day. I don't regard this as a proper fix to the problem because if the TTL for the uk namserver records is ever modified then the problem will just re-surface. A correct fix to this problem would be to make sure the 2008 DNS server obeyes the TTLs given to it by the root-hint servers.

To apply the change open up the registry and navigate to the following key:

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\DNS\Parameters

Create a new D-Word value called “MaxCacheTTL” and set its value (in decimal) to be 172800

Restart the DNS service and then everything should function correctly.