DNS: What Every Engineer Should Know
How DNS actually works, the record types that matter, and security practices I implement in production environments.
DNS failures cause a surprising number of outages. I've seen production incidents where everything looked fine - servers running, load balancers healthy - but users couldn't connect because DNS was broken. Understanding DNS isn't optional for infrastructure work.
How DNS Resolution Works
When you type "google.com" in a browser, here's what actually happens:
- Local cache check - Your machine checks if it already knows the IP address
- Resolver query - If not cached, it asks a DNS resolver (like 8.8.8.8 or 1.1.1.1)
- Hierarchical lookup - The resolver queries root servers โ TLD servers โ authoritative servers
- Response cached - The answer is cached based on TTL for future requests
The caching is important. It's why DNS changes don't propagate instantly - old records stick around until their TTL expires.
Record Types That Matter
A Records - Map domain to IPv4 address. The most common record type.
AAAA Records - Same thing for IPv6.
CNAME Records - Alias one domain to another. I use these for pointing "www" to the apex domain, or for CDN integrations.
MX Records - Where to send email. Priority values determine failover order.
TXT Records - Arbitrary text. Used for SPF, DKIM, domain verification. If you're setting up email or verifying domain ownership, you'll work with these.
NS Records - Which name servers are authoritative for a domain.
Security Concerns
DNS has real security implications:
DNS Spoofing - Attackers return fake DNS responses, redirecting traffic to malicious servers. I've seen phishing attacks that relied on compromised DNS.
Cache Poisoning - Bad records injected into resolver caches affect everyone using that resolver until TTL expires.
Amplification Attacks - Open resolvers can be abused for DDoS. Never run an open resolver.
What I Do for Production DNS
Enable DNSSEC where possible. It cryptographically signs DNS responses, preventing spoofing.
Use reputable resolvers - Cloudflare (1.1.1.1), Google (8.8.8.8), or Quad9 (9.9.9.9). They implement security features and have global anycast networks.
Monitor DNS resolution from multiple locations. I've caught regional DNS issues this way before users reported them.
Set appropriate TTLs - Lower TTLs (300-600 seconds) for services that might need quick failover. Higher TTLs for stable records to reduce query load.
Troubleshooting Commands
# Check what a domain resolves to
dig google.com
# Query a specific DNS server
dig @8.8.8.8 google.com
# See the full resolution path
dig +trace google.com
# Check specific record types
dig google.com MX
dig google.com TXT
Key Takeaways
- DNS failures can make healthy infrastructure unreachable - monitor it
- TTL values determine how long records are cached; plan changes accordingly
- CNAME records can't exist at the apex domain (use ALIAS or A records)
- DNSSEC and encrypted DNS (DoH/DoT) are worth implementing
- When troubleshooting connectivity, always check DNS first
Written by Bar Tsveker
Senior CloudOps Engineer specializing in AWS, Terraform, and infrastructure automation.
Thanks for reading! Have questions or feedback?