The Domain Name System or DNS in short, is at the heart of the internet providing a very critical service to us – name resolution. Simply put, we can reach a website by typing some user-friendly alphanumeric characters in our browser instead of numeric IP addresses which are difficult to remember (and may not be possible when they change over time). So, how does the whole DNS thing work? What is the infrastructure that makes it a highly-available service? Why should I care about it? This is not as simple as it may sound initially. In the first part of this series on DNS, we will explore the various concepts and how they work together to build the service we call DNS. We will also take a look at the DNS infrastructure at Indix and how it has evolved over time.
Let’s start with our website, www.indix.com, and let’s break it down by the dots. The part at the end (“.com”) is called a top-level domain or TLD (there are many such TLDs like .org, .net etc.), and the middle part (“indix”) is our company’s name. These two parts together form a destination in a “.com” domain which is basically our website. But when you type “www.indix.com” in your browser, how exactly does it know which IP address it should connect to?
First, a quick definition. A “name server” is a m/c which provides some sort of information around name resolution. At the very core of DNS, there are 13
special name servers called “root name servers.” Actually, there are more than 13 servers implemented via Anycast routing. The technical details of Anycast are beyond the scope of this article, but to summarize, Anycast advertises the same IP prefix from more than one location. Depending on the implementation via BGP, this mechanism can provide high-availability as well as load-balancing for multiple such servers.
Coming back to our main discussion, you can run dig “ns” to find the full list of root name servers. All these root severs provide the same information which is – the name servers of various TLDs like “.com,” “.org,” etc. Run dig command “ns” to get that complete list of “.com” name servers. Once we know the name servers of say, “.com,” we reach one such server and ask – what’s the IP address of “indix.com”? This gTLD name server will respond with the list of name servers where we need to go next (run dig “indix.com ns” to find them). Then we contact one such name server which returns one or more IP address for “www.indix.com” and finally my browser knows which IP I should send a TCP SYN packet to. You can view this whole process happening if you run – “dig +trace indix.com.”
One thing to notice here is the discovery of the final “ns” record. Say, you get a response that the name server for “indix.com” is “ns1.indix.com” and hence you need to reach “ns1.indix.com” to resolve for “www.indix.com.” But how do you get the IP address of “ns1.indix.com” in the first place? If you follow the same route then you are supposed to resolve for “ns1.indix.com” in the same way, and clearly you will be stuck. So basically if the name server is in the same sub-domain, this will cause an infinite loop. Hence in such cases, the IP address of the name server is also provided along with the name given by the gTLD servers – basically the response would be that the name server is “ns1.indix.com” and its IP address is some “a.b.c.d.” This is called a glue record as the IP address is glued together with the name.
Next, let’s talk about the management aspect of DNS. The Network Information Center (NIC or InterNIC) has been the governing body for DNS management (run by Stanford Research Institute in the beginning, followed by Network Solutions until 1998) when the authority was shifted to Internet Corporation for Assigned Names and Numbers (ICANN). Today, ICANN is the ultimate authority for DNS. Now, there are many DNS registries who manage the TLDs database (you can find the list of registries here) and managed by the International Assigned Numbers Authority (IANA). Then there are DNS registrars who are granted the access to create domain names under certain TLDs (instead of selling directly to the customers, some registrars use a re-seller to sell domains for them). The registrars (Like GoDaddy in our case) pay certain dollars to the registries for the domains they register.)
Coming back to the client side process, let’s say we want to know the one IP address of “www.indix.com” in our m/c. The name resolution process in your Linux m/c starts with looking at the /etc/nsswitch file (in the older version of glibc used/etc/host.conf, the keyword `order` was relevant), in which the order is specified to look for name resolution. Assuming the default order of /etc/hosts followed by the resolver/etc/resolv.conf and we don’t have a matching record in /etc/hosts, it will find a name server to connect to first. Usually the entries in resolv.conf will be your ISP’s DNS servers, which does a recursive DNS resolution on behalf of you. Those name servers first look into their own cache to see if they have the corresponding record. If not, they will look for the NS record for the domain – since having that in cache will allow it to directly contact the name servers for it. If that fails as well, it can look for “.com” NS records in cache to find the TLD name servers. If that too is absent, it has no other way but to reach out to one of the root name servers and ask for the record. The root zone file can be found here, and BIND, for example, stores it as /var/named/named.root or named.ca.
The above will be the case when the name server bootstraps for the first time and doesn’t have anything in its cache. So it shoots a DNS query and asks a root name server about the record. Then the process described above starts and we get the IP address needed.
We all know that DNS uses UDP (Port 53) for queries and that is true – the protocol was probably chosen in order to avoid TCP’s 3-way handshake and get a faster response. But there are cases where DNS also uses TCP. For example, historically DNS allows a maximum UDP packet size of 512 bytes. Even though the UDP packet size can be much higher, a packet larger than path MTU will cause IP fragmentation, which is not desired because of lack of reliability in various network devices. Considering the size of maximum IP/UDP headers, a value of 512 bytes was considered safe, and thus has been the standard.
But if the response is greater than 512 bytes, DNS server will respond with partial data and set the truncated header (TC flag). If so, the client is supposed to issue the same request over TCP and continue with the resolution process again. And naturally because of this and for accuracy, zone transfers also use TCP instead of UDP (note that this is no longer a limitation and with “edns”).
At Indix, we have a hybrid DNS architecture in place today. We started with one solution and gradually adopted alternatives due to the challenges or complexities posed by the previous ones. For example, we had only BIND as the DNS software at the very beginning. BIND has been the de facto standard as DNS server and is the most widely used even today. Naturally we chose that when we started. So, we had GoDaddy as our registrar which used to have all of our public records and then we delegated the private zones to these BIND servers. The same was propagated to every instance (we are hosted entirely in AWS) using DHCP Option Set of the VPCs. It worked well initially but over time, some key things changed.
The initial setup looked like this:
The number of DNS queries we hit BIND for increased over time as the infrastructure footprint grew. This started causing an occasional spike in #requests and forced us to scale out from one BIND to multiple BIND servers. Naturally, the maintenance headache got worse with this change. We also created a few more VPCs to bring up new services in an isolated environment. We used a custom version of the DNS update in our UDF (User Data File) to set ‘A’ records during bootstrap. While auto-scaling a cluster from a few to a few hundreds, lots of API calls hit the BIND more or less concurrently. But that’s not the situation all the time. Similarly, cleaning up records during instance termination also became a headache. Put all these issues together, and BIND was becoming a bottleneck.
By then, Route53 has become a stable DNS tool in AWS, and we started to think about migrating to Route53 from BIND. Now, we were aware that this was going to be pretty complex and a smooth migration would need perfect planning and significant time (WIP at this point). But the frequency of failures in our BIND became an issue to address urgently as we were continuously scaling our infrastructure. Also, we realized that the limit of Route53 API calls was too low (five requests/second per AWS account). There are two ways to deal with this – make a batch request with ChangeResourceRecordSets in a single API call or follow exponential backoff/retry. The first would mean a centralized approach of DNS updates which we didn’t want (every instance is supposed to take care of its DNS during bootstrap, via UDF), and the second would mean significant delay in provisioning which was not acceptable either.
Finally, we decided on something different. Instead of using either of the above solutions, we thought of using Amazon-provided DNS. Amazon-provided DNS is a DNS server that you get with a VPC, and it is assigned the third IP address of your VPC CIDR. This can be used to resolve the default public host name (like ec2-public-ipv4-address.compute-1.amazonaws.com) or the default private host name (like ip-private-ipv4-address.ec2.internal), and we wanted to use the latter. That was good enough and we wanted to migrate just our Hadoop clusters which was our main concern, as these are the clusters which scale out due to demand. Check out our in-house auto-scaling tool for Hadoop here.
But then there is a catch again. Since we are propagating our BIND servers via DHCP, every node gets the same name servers (BIND) in their resolver configuration. We can’t change this as that will affect every other instance. To deal with this problem, we used DHCP hooks via our configuration management system (we use Chef and Ansible, mostly). The /sbin/dhclient-script is used by the DHCP client to configure some network settings before or after renewing the lease. The dhclient-script also allows customization via “enter” and “exit” hooks. The function “make_resolv_conf” inside the script is used to configure “/etc/resolv.conf” behavior which can be overridden by using the “enter” hook. We wrote a small cookbook (applicable to the Hadoop clusters only) to create a DHCP enter hook and configure the resolver to use AWS provided DNS. And we are done!
Coming back to Route53, we have started planning the migration of other EC2 DNS names from BIND to Route53. First, we have used slightly different names for the internal sub-domains and are using them for any new services being deployed. Since services in one VPC need to access services in other VPCs, we just had to associate every additional VPCs to the newly created private-hosted zones as and when necessary. For the actual migration part, we plan to use tools like cli53. The main challenge here is to sync every record in BIND with Route53 zones while not blocking DNS updates at any point of time. We are about to start this project in the next quarter and will definitely share our experience.
This is what our architecture looks like today:
Stayed tuned for more on our DNS consolidation project and best practices around DNS management.
Also published on Medium.