"DNS is one of those things that gets overlooked… You make your voice servers super-redundant, but you take it for granted that DNS will always work." — Fonality
The Domain Name System (DNS) conventionally helps devices convert domain names, like VoIPCarrier.com, to IP addresses, like 216.128.128.50. But in Voice and Unified Communication (UC) services, it serves an additional key role in fault tolerance, and enabling encryption. For large networks, it can also assist in balancing load across many data centers.
Background: In typical Voice Network implementations, the UC client or SIP phone is configured to make requests to its local “recursive caching” DNS server. (So called “recursive”, because it makes multiple steps to arrive at the answer, and “caching” because it stores the resulting answers for some time.) This server consults with the Root and Top-Level-Domain (TLD) servers that handle domains like .com. Those TLD servers direct the DNS client’s recursive caching DNS to the authoritative DNS server for the Voice/UC provider, and that authoritative server provides the IP addresses, failover addresses, load balancing settings, and encryption settings for the UC or Voice Service.
There are some common variations on this theme:
1. Voice & UC Operators that don’t use the Internet. For Cable operators, “Metropolitan Service Operators” (MSOs) and Fiber to the Premise (FTTx) Providers, DNS can be used, but it doesn’t use the Root Servers or TLDs. These operators cannot provide their service across the Internet.
2. Operators that don’t use DNS at all. Some networks are built with the IP addresses configured in the devices, so no DNS is ever used. As 2600hz argues, this strategy can mean that inevitable IP address renumbering becomes extremely expensive and can create outages for customers. “This can be dangerous if you have customers who provision their own phones and, later on, you move data centers or change IP blocks for some reason and have no way to change the IPs of those phones.”
The most common mistake is that SIP Phones and Desktop UC Clients are relying on the local DNS servers in their data networks. For some users, this is the outdated Windows server in the closet, but it’s often the underpowered DNS software provided in the local router by the Internet Service Provider (ISP).
The risk is that when this local, underpowered, or under-managed DNS server malfunctions, the Voice/UC service fails. The Voice/UC provider is responsible for important features, like the ability to make emergency phone calls, or conduct business (video calling, desktop sharing, messaging, etc.), but all of those services fail where the local DNS server fails.
Any one DNS server on the Internet can be attacked and overloaded using botnet attacks. Even most organizations that offer DNS services can be overwhelmed. History shows that attacks tend to be launched against individual DNS operators. (Note that the Global Root DNS servers are under continuous attack, and have defenses that are effective so far.)
For example, in March 2017, GoDaddy’s DNS service in Europe were shut down. This resulted in a serious outage for operators depending exclusively on GoDaddy’s DNS services.
GoDaddy isn’t alone. In October 2016, DNS provider Dyn was hit, shutting down their customers.
In September 2016, Voice / UC Fonality suffered a four-hour outage due to DNS problems. This was due to having a single, integrated DNS system. A single human error brought down all parts of their DNS servers. After this outage, they added diversity to their DNS system so that it wasn’t hosted by a single provider (themselves!).
Some SIP Phones can store a cache of DNS records locally, provided to the device by the configuration file. For example, the Polycom UC SIP devices support these configurations, and only use them if no DNS replies are returned. The Polycom implementation supports basic A records, as well as the advanced records, SRV and NAPTR, used for load balancing and encryption management.
Yealink SIP Phones has a Static DNS Cache supporting A, SRV, and NAPTR that can be configured either as the preferred source of DNS settings, or can be used as as backup.
Mitel SIP Phones have similar support for A and SRV records in thie DNS Pre-Caching functionality.
Software vendors who provide desktop and Mobile software should build in this capability, so that the configuration file delivered to the software has a DNS cache backup in it.
Every DNS records has a “Time To Live” (TTL) setting, which specifies how long caching servers should store the record.
Large Values, such as four hours (which would be configured as 14400 seconds), have the advantage of being stable. If there’s an interruption in access in the DNS system, these ensure that the old records will stay in the cache for some time. But the downside is that when changes are needed, the operator may have to wait four hours for the change to go into effect.
Small Values, such as 60 seconds, have the advantage of allowing relatively quick changes. For example, if a provider learns that their primary Session Border Controller (SBC) site is having problems, they can make a change, and expect all DNS requests within 60 seconds to get the new value and route requests to the new location. But small values have a downside: if there’s a short interruption in the client’s access to the DNS servers, then the clients will lose all old records.
This is such a common problem that Cisco’s OpenDNS SmartCache continues to store and use records after they have expired — but only if the authoritative servers are unavailable. This only helps if the Voice/UC clients are using OpenDNS as the recursive caching servers.
Large TTLs don’t provide complete protection against DNS failures. Note that even with a large TTL, you should expect that records will begin expiring immediately upon failure of your authoritative caching service. For example, if you have a TTL of four hours, and an outage occurs at 12:00pm, then by 1:00pm approximately 25% of your users’ caches will have expired and by 2:00pm, half of your users are offline.
BGP is the system used by large organizations to configure the routes for IP traffic. An IP Prefix, such as 216.128.128.0/21, specifies the routing path for 2,048 IP addresses.
But because BGP Prefixes are subject to attack and downtime too, you don’t want to have all of your Recursive Caching, or your DNS servers, in the same BGP prefix advertisement.
The takeovers of a BGP Prefix is called BGP Prefix Hijacking, and has been a known threat on the Internet for over a decade. Zach Julian has an overview of how BGP Hijacking is done, and reports on several instances of successful BGP hijacking where traffic is intercepted and rerouted.
(For the same reasons, you also don’t want to have all of your Session Border Controllers (SBCs), or all of your Load Balancers, or all of your Production Firewalls in the same BGP prefix if you can avoid it.)
Hijacking one BGP prefix is generally easier than hijacking more than one, so by having a diverse set of BGP prefixes, you make this attack more difficult and improve the robustness of your network.
A rarer attack that is still possible is hijacking of a BGP Autonomous System Number (ASN). The ASN is used in Internet Routing to identify a large company or Internet Service Provider. An ASN attack would allow an attacker to interrupt BGP traffic for an entire ISP or large enterprise.
Any one Internet link is subject to attack and overload. One of the most common forms of Internet Attack is a Distribute Denial of Service (DDoS) attack, where millions of vulnerable devices are used to blast traffic at a single destination. DDoS mitigation is an active industry, with a number of firms providing tools and services to fight DDoS attacks.
Arbor Networks is one of the leading firms providing equipment built to fight DDoS attacks. Akamai and AT&T offer services to clients to help fight DDoS attacks.
However good all these services are, while the DDoS attack is in early stages, access to the DNS servers, Session Border Controllers, and Load Balancers will be interrupted for some time.
Many operators have redundant Internet links to provide fault tolerance. In the normal implementation, all Internet links are setup to be capable of handling all of the traffic. However, this means that every DDoS attack will hit every link simultaneously.
The DNS registrar is responsible for loading domain name information into the root and Top-Level Domain (TLD) servers. For example, if you own VoIPCarrierX.com, you’ll pay a registrar, like Register.com, GoDaddy, or EasyDNS a small annual fee to keep your data loaded in the root name servers.
But every registrar is susceptible to attack or malfunction. For example, in 2016, the DNS Registrar for New York Times was attacked, leading to an outage for the news site.
http://www.nytimes.com/2013/08/28/business/media/hacking-attack-is-suspected-on-times-web-site.html
In early 2017, a Brazilian Bank’s operation was taken over after its Registrar was hacked.
As discussed above, if a DNS registrar is attacked, then the attackers can control the access to the domain name. But most of the Top Level Domains (TLDs) are managed separately from one another, and it is conceivable that an entire Top Level Domain could be attacked.
For example, the servers that handle .com are different from the servers that handle .net. Each country’s top level domain, like .us (for the United States) and .br (for Brazil) are also distinct.
It’s easy to assume that all top-level domains are well managed. But because they are run separately, we shouldn’t assume they are. For example, China’s .cn TLD was successfully attacked in 2013.
By using different Top Level Domains, you can protect your services against excessive dependence on any one TLD server. For US-based companies, .com and .net are distinct domains, but there may be value in finding more diversity outside a single nation’s DNS management.
Voice/UC services have a distinct advantage here: users are not normally aware of the domain name in use. Google cannot easily use multiple domain names: everyone expects “gmail.com” to stay put at that name. But SIP phones and UC clients can be configured to use multiple domain names.
This is an important capability at the heart of many of the options here. To use multiple domain names, the client has to be capable of failing over.
This has been a proven technique. Voice Provider 2600hz uses multiple domain names internally, (one in .com and one in .net), and configures its devices to know how to query both.
TLS (formerly known as SSL) has an important capability of validating that the server used is the legitimate and intended server. For example, if your SIP phone is configured to connect to VoIPCarrierX.net, then the SIP phone can validate that, once it is finally connected to some IP address (say, 216.128.128.255), the server provides it a certificate that is cryptographically signed by a key that the phone already trusts. The list of trusted parties is stored in the “Root Chain” on the TLS client; if the SIP phone has been configured to trust, say, Verisign, then the certificate for could be signed by Verisign identifying the server as legitimate for VoIPCarrierX.net. TLS also has a mechanism so that the client confirms mathematically that the server actually has some internal secret information — a long password key — proving it is truly the owner of the signed certificate. StackExchange has a nice introduction to this technology.
This means that after the entire DNS process has played out, using all the techniques above, the SIP phone or UC client can be built to verify that the server it finally reached was legitimate. This is a key requirement so that, if the DNS protections put in place fail, the phone will refuse to talk to an illegitimate hijacker.
DNS can be an effective and powerful way to manage Unified Communications services, but it does expose the operator to a number of attack methods. Smart Service Providers are taking the challenge seriously, implementing a variety of techniques to introduce diversity and genuine redundancy, including BGP, TLS, Top Level Domains.