DNS: What Every Engineer Should Know

DNS failures cause a surprising number of outages. I've seen production incidents where everything looked fine - servers running, load balancers healthy - but users couldn't connect because DNS was broken. Understanding DNS isn't optional for infrastructure work.

How DNS Resolution Works

When you type "google.com" in a browser, here's what actually happens:

Local cache check - Your machine checks if it already knows the IP address
Resolver query - If not cached, it asks a DNS resolver (like 8.8.8.8 or 1.1.1.1)
Hierarchical lookup - The resolver queries root servers → TLD servers → authoritative servers
Response cached - The answer is cached based on TTL for future requests

The caching is important. It's why DNS changes don't propagate instantly - old records stick around until their TTL expires.

Record Types That Matter

A Records - Map domain to IPv4 address. The most common record type.

AAAA Records - Same thing for IPv6.

CNAME Records - Alias one domain to another. I use these for pointing "www" to the apex domain, or for CDN integrations.

MX Records - Where to send email. Priority values determine failover order.

TXT Records - Arbitrary text. Used for SPF, DKIM, domain verification. If you're setting up email or verifying domain ownership, you'll work with these.

NS Records - Which name servers are authoritative for a domain.

Security Concerns

DNS has real security implications:

DNS Spoofing - Attackers return fake DNS responses, redirecting traffic to malicious servers. I've seen phishing attacks that relied on compromised DNS.

Cache Poisoning - Bad records injected into resolver caches affect everyone using that resolver until TTL expires.

Amplification Attacks - Open resolvers can be abused for DDoS. Never run an open resolver.

What I Do for Production DNS

Enable DNSSEC where possible. It cryptographically signs DNS responses, preventing spoofing.

Use reputable resolvers - Cloudflare (1.1.1.1), Google (8.8.8.8), or Quad9 (9.9.9.9). They implement security features and have global anycast networks.

Monitor DNS resolution from multiple locations. I've caught regional DNS issues this way before users reported them.

Set appropriate TTLs - Lower TTLs (300-600 seconds) for services that might need quick failover. Higher TTLs for stable records to reduce query load.

Troubleshooting Commands

# Check what a domain resolves to
dig google.com

# Query a specific DNS server
dig @8.8.8.8 google.com

# See the full resolution path
dig +trace google.com

# Check specific record types
dig google.com MX
dig google.com TXT

Key Takeaways

DNS failures can make healthy infrastructure unreachable - monitor it
TTL values determine how long records are cached; plan changes accordingly
CNAME records can't exist at the apex domain (use ALIAS or A records)
DNSSEC and encrypted DNS (DoH/DoT) are worth implementing
When troubleshooting connectivity, always check DNS first