Troubleshooting Workflow
Something's broken. Users are complaining. Don't panic. Follow a systematic approach.
The Troubleshooting Mindset
- Gather information before changing anything
- Form hypotheses based on evidence
- Test one change at a time
- Document what you find
Don't Just Restart
"Have you tried restarting it?" might fix the symptom but hides the cause. Investigate first, understand what happened, then fix properly.
Step 1: What's the Actual Problem?
Before diving in, clarify:
- What's the expected behavior?
- What's actually happening?
- When did it start?
- What changed recently?
Terminal
$# Check recent changes
$sudo grep -i 'install\|remove\|upgrade' /var/log/dpkg.log | tail -20
2025-01-14 09:30:00 install nginx:amd64 1.18.0-6ubuntu14
$
$# Recent logins
$last | head -10
john pts/0 192.168.1.50 Mon Jan 14 10:00 still logged in
$
$# System boot time
$uptime
10:30:45 up 5 days, 3:24, 1 user, load average: 0.52, 0.58, 0.59
Step 2: Check Service Status
Terminal
$# Is the service running?
$systemctl status nginx
โ nginx.service - A high performance web server
Active: failed (Result: exit-code) since...
$
$# What happened?
$journalctl -u nginx -n 50
(recent logs)
$
$# Failed services
$systemctl --failed
Step 3: Check Resources
CPU and Load
Terminal
$uptime
10:30:45 up 5 days, load average: 15.52, 12.58, 8.59
$# Load > CPU cores = overloaded
$
$# What's using CPU?
$top -bn1 | head -15
PID USER PR NI VIRT RES %CPU %MEM
1234 mysql 20 0 1.5g 500m 95.0 25.0
Memory
Terminal
$free -h
total used free shared buff/cache available
Mem: 7.7G 6.5G 200M 150M 1.0G 800M
$
$# What's using memory?
$ps aux --sort=-%mem | head -10
Disk Space
Terminal
$df -h
Filesystem Size Used Avail Use% Mounted on
/dev/sda1 50G 48G 2.0G 96% /
$
$# Find large files
$sudo du -sh /* 2>/dev/null | sort -rh | head -10
/var 35G
/home 8G
/usr 3G
$
$# Large files in /var
$sudo find /var -type f -size +100M -exec ls -lh {} \;
Disk I/O
Terminal
$iostat -x 1 3
Device %util await r/s w/s
sda 99.5 250.0 50.0 200.0
$# High %util = disk bottleneck
Step 4: Check Logs
Terminal
$# System messages
$sudo tail -100 /var/log/syslog
$
$# Authentication issues
$sudo tail -50 /var/log/auth.log
$
$# Application logs
$sudo tail -100 /var/log/nginx/error.log
$
$# Search for errors
$sudo grep -i 'error\|fail\|crit' /var/log/syslog | tail -20
Step 5: Check Network
Terminal
$# Is the network up?
$ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>...
inet 192.168.1.100/24
$
$# Can we reach the internet?
$ping -c 3 8.8.8.8
64 bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=10.5 ms
$
$# DNS working?
$dig google.com +short
142.250.185.78
$
$# Is the service listening?
$ss -tlpn | grep :80
(should show nginx)
Troubleshooting Cheat Sheet
| Problem | Commands to Run |
|---|---|
| Service won't start | systemctl status svc, journalctl -u svc |
| High CPU | top, htop, ps aux --sort=-%cpu |
| High memory | free -h, ps aux --sort=-%mem |
| Disk full | df -h, du -sh /*, find / -size +100M |
| Can't connect | ping, ss -tlpn, iptables -L |
| DNS issues | dig domain, cat /etc/resolv.conf |
| Permission denied | ls -la, namei -l /path/to/file |
Example: Web Server Down
hljs bash
# 1. Check if running
systemctl status nginx
# Result: failed
# 2. Why did it fail?
journalctl -u nginx -n 30
# Result: "bind() to 0.0.0.0:80 failed (98: Address already in use)"
# 3. What's using port 80?
ss -tlpn | grep :80
# Result: apache2 is running
# 4. Fix: stop apache
sudo systemctl stop apache2
sudo systemctl disable apache2
# 5. Start nginx
sudo systemctl start nginx
sudo systemctl status nginx
# Result: running
Knowledge Check
What's the first thing you should do when troubleshooting?
Key Takeaways
- Gather info before making changes
- Check service status and logs first
- Monitor resources: CPU, memory, disk, network
- Use
journalctlfor service logs - Test one change at a time
- Document what you find and what you fixed
Next: setting up a server from scratch.