Section 5.2.1 📖 ~11 min read

DNS Load Balancing and Failover

Round-robin DNS, weighted records, health-check failover, GSLB, and the real limitations of DNS-based load balancing.

Long before application-level load balancers existed, engineers discovered they could distribute traffic using DNS. The idea is straightforward: return multiple IP addresses for a single domain name, and let clients spread themselves across your servers.

It’s simple. It works. And it has some serious limitations you need to understand before relying on it.

Round-Robin DNS: The Simplest Load Balancer

Round-robin DNS is the most basic form of DNS load balancing. You configure multiple A records for the same name, and the DNS server rotates through them:

app.example.com.    300    IN    A    198.51.100.1
app.example.com.    300    IN    A    198.51.100.2
app.example.com.    300    IN    A    198.51.100.3

When a resolver queries app.example.com, the authoritative server returns all three addresses but rotates the order. The first query gets [.1, .2, .3], the next gets [.2, .3, .1], then [.3, .1, .2], and so on.

Most clients connect to the first address in the response, so rotating the order distributes connections across servers. RFC 1794 documents this behavior, noting that name servers “should” rotate the order of RRs returned in a response.

Why It Barely Qualifies as Load Balancing

Round-robin DNS distributes DNS responses, not traffic. The distinction matters enormously:

Caching breaks the distribution: A resolver serving 10,000 users caches the response. All 10,000 users get the same order until the TTL expires. One server might get hammered while others sit idle.
No awareness of server health: If 198.51.100.2 goes down, DNS keeps handing out its address. Users who receive it will experience connection failures until the TTL expires and a new resolution happens.
No awareness of server capacity: All servers get equal share regardless of their processing power. A 2-CPU VM gets the same traffic as a 64-CPU bare metal server.
Client behavior varies: Not all clients use the first address. Some try them in order (falling back on failure), some pick randomly, and some always prefer certain addresses.

Despite these limitations, round-robin DNS remains useful as a crude distribution mechanism, especially when combined with other techniques.

Weighted DNS Records

Weighted DNS improves on round-robin by assigning relative weights to each record. A server with weight 70 receives roughly 70% of responses, while a server with weight 30 receives 30%.

While standard DNS doesn’t natively support weights, managed DNS providers implement them as a feature:

; AWS Route 53 weighted routing (conceptual)
app.example.com    A    198.51.100.1    weight=70
app.example.com    A    198.51.100.2    weight=20
app.example.com    A    198.51.100.3    weight=10

Weighted DNS is valuable for:

Gradual rollouts: Route 5% of traffic to a new deployment, then increase to 25%, 50%, and finally 100%
Capacity-proportional distribution: Send more traffic to beefier servers
A/B testing: Split traffic between different backends at the DNS level
Draining servers: Set weight to 0 to stop new traffic while existing connections finish

The same caching caveats from round-robin apply. Weights influence the probability of which answer a resolver receives, but once cached, all users behind that resolver get the same answer until TTL expiry.

Health-Check Based Failover

The most critical limitation of basic DNS load balancing is its blindness to server health. Health-check based failover addresses this by monitoring backend servers and removing unhealthy ones from DNS responses.

Here’s how it typically works:

The DNS provider’s health check system probes each server at regular intervals (every 10-30 seconds)
Probes might be TCP connections, HTTP requests checking for a 200 status, HTTPS checks, or custom health endpoints
If a server fails a configurable number of consecutive checks, it’s removed from DNS responses
When the server recovers and passes checks again, it’s added back

Normal state:
app.example.com → [198.51.100.1, 198.51.100.2, 198.51.100.3]

Server .2 fails health checks:
app.example.com → [198.51.100.1, 198.51.100.3]

Server .2 recovers:
app.example.com → [198.51.100.1, 198.51.100.2, 198.51.100.3]

The TTL Gap

Health-check failover has an inherent limitation: the TTL gap (see DNS Caching and TTL for the mechanics). Even after a server is removed from DNS responses, resolvers that already cached the old response (with the failed server) will continue directing users to it until the cache expires.

With a 300-second (5-minute) TTL, users could be directed to a dead server for up to 5 minutes after it fails. This is why DNS-based failover systems use aggressively short TTLs — often 30-60 seconds. But short TTLs increase DNS query volume and make you more dependent on your DNS infrastructure’s availability.

Some modern resolvers and browsers implement Happy Eyeballs (RFC 8305), which tries multiple addresses concurrently and uses whichever responds first. This mitigates the TTL gap for clients that support it.

Global Server Load Balancing (GSLB)

GSLB is the enterprise-grade evolution of DNS load balancing. It combines health checking, geographic routing, and performance data into a comprehensive traffic management system.

A GSLB system typically includes:

Distributed health monitors — probing servers from multiple vantage points worldwide
Geographic awareness — routing users to the nearest healthy data center
Performance metrics — factoring in server response times and current load
Policy engine — complex routing rules (primary/secondary, geographic preferences, capacity limits)
Automatic failover — detecting regional outages and rerouting traffic

Active-Passive Failover

The simplest GSLB pattern: a primary site handles all traffic, with a secondary site on standby.

Normal: app.example.com → Primary DC (New York)
Failure: app.example.com → Secondary DC (Chicago)

DNS responses always point to the primary unless health checks detect a failure, at which point they switch to the secondary. This is straightforward but wastes the secondary’s capacity during normal operation.

Active-Active with Geographic Routing

A more sophisticated pattern: multiple active sites, each serving users in their geographic region.

User in Europe:  app.example.com → Frankfurt DC
User in Asia:    app.example.com → Singapore DC
User in Americas: app.example.com → Virginia DC

If one site fails, its traffic is redistributed to the remaining sites. This provides both performance (geographic proximity) and resilience (automatic failover).

Real-World GSLB Products

GSLB is offered by both hardware vendors and cloud providers:

F5 BIG-IP DNS (formerly GTM) — hardware/virtual appliance for on-premise GSLB
AWS Route 53 — managed DNS with failover, geolocation, latency, and weighted routing policies
Cloudflare Load Balancing — combines DNS load balancing with their CDN edge
NS1 — DNS platform with real-time traffic steering based on telemetry data

Limitations of DNS-Based Load Balancing

DNS load balancing is useful, but it’s important to understand what it can’t do:

No Per-Request Granularity

DNS operates at the resolution level, not the request level. Once a client resolves a name, it may reuse that IP for thousands of subsequent requests. You can’t route individual HTTP requests via DNS — that requires an application-level load balancer (like NGINX, HAProxy, or a cloud load balancer).

Client-Side Caching Is Unpredictable

Browsers, operating systems, and stub resolvers all maintain their own DNS caches with varying behaviors. Some respect TTLs. Some enforce minimums (Chrome historically enforced a 60-second floor). Some ignore TTLs entirely. You can’t control how quickly clients react to DNS changes.

No Session Affinity

DNS can’t ensure that subsequent requests from the same user go to the same server. If a user’s resolver re-queries DNS mid-session, they might get a different server. For stateful applications, this is a problem unless you use server-side session stores.

TTL Creates a Recovery Floor

You can’t fail over faster than your TTL. Even with aggressive 30-second TTLs, there’s a minimum failover window. Hardware load balancers can detect failures and reroute traffic in seconds.

DNS Amplification of Failures

If your authoritative DNS servers go down, all your services become unreachable — even healthy ones. DNS is a single point of failure that doesn’t exist with direct IP-based load balancing.

When to Use DNS Load Balancing

DNS load balancing makes sense when:

You need geographic distribution across multiple data centers
You want a simple, infrastructure-level traffic split
You’re combining it with application-level load balancers at each site
You need cross-provider redundancy (e.g., failover between AWS and GCP)
Your failover requirements tolerate 30-60 second windows

It’s not a replacement for proper load balancers — it’s a complement. The typical architecture uses DNS to route traffic to the right data center, then an application load balancer within each data center distributes it across individual servers.

Key Takeaways

Round-robin DNS distributes DNS responses, not actual traffic — caching undermines even distribution
Weighted DNS allows proportional traffic distribution and is useful for gradual rollouts
Health-check failover removes unhealthy servers from DNS but is limited by TTL caching gaps
GSLB combines health checks, geo-routing, and performance data for enterprise-grade traffic management
DNS load balancing can’t provide per-request granularity, session affinity, or sub-minute failover
Best practice: use DNS for data center selection, application load balancers for server selection