Troubleshooting Workflow

Something's broken. Users are complaining. Don't panic. Follow a systematic approach.

The Troubleshooting Mindset

  1. Gather information before changing anything
  2. Form hypotheses based on evidence
  3. Test one change at a time
  4. Document what you find

Don't Just Restart

"Have you tried restarting it?" might fix the symptom but hides the cause. Investigate first, understand what happened, then fix properly.

Step 1: What's the Actual Problem?

Before diving in, clarify:

  • What's the expected behavior?
  • What's actually happening?
  • When did it start?
  • What changed recently?
Terminal
$# Check recent changes
$sudo grep -i 'install\|remove\|upgrade' /var/log/dpkg.log | tail -20
2025-01-14 09:30:00 install nginx:amd64 1.18.0-6ubuntu14
$
$# Recent logins
$last | head -10
john pts/0 192.168.1.50 Mon Jan 14 10:00 still logged in
$
$# System boot time
$uptime
10:30:45 up 5 days, 3:24, 1 user, load average: 0.52, 0.58, 0.59

Step 2: Check Service Status

Terminal
$# Is the service running?
$systemctl status nginx
โ— nginx.service - A high performance web server Active: failed (Result: exit-code) since...
$
$# What happened?
$journalctl -u nginx -n 50
(recent logs)
$
$# Failed services
$systemctl --failed

Step 3: Check Resources

CPU and Load

Terminal
$uptime
10:30:45 up 5 days, load average: 15.52, 12.58, 8.59
$# Load > CPU cores = overloaded
$
$# What's using CPU?
$top -bn1 | head -15
PID USER PR NI VIRT RES %CPU %MEM 1234 mysql 20 0 1.5g 500m 95.0 25.0

Memory

Terminal
$free -h
total used free shared buff/cache available Mem: 7.7G 6.5G 200M 150M 1.0G 800M
$
$# What's using memory?
$ps aux --sort=-%mem | head -10

Disk Space

Terminal
$df -h
Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 48G 2.0G 96% /
$
$# Find large files
$sudo du -sh /* 2>/dev/null | sort -rh | head -10
/var 35G /home 8G /usr 3G
$
$# Large files in /var
$sudo find /var -type f -size +100M -exec ls -lh {} \;

Disk I/O

Terminal
$iostat -x 1 3
Device %util await r/s w/s sda 99.5 250.0 50.0 200.0
$# High %util = disk bottleneck

Step 4: Check Logs

Terminal
$# System messages
$sudo tail -100 /var/log/syslog
$
$# Authentication issues
$sudo tail -50 /var/log/auth.log
$
$# Application logs
$sudo tail -100 /var/log/nginx/error.log
$
$# Search for errors
$sudo grep -i 'error\|fail\|crit' /var/log/syslog | tail -20

Step 5: Check Network

Terminal
$# Is the network up?
$ip addr
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>... inet 192.168.1.100/24
$
$# Can we reach the internet?
$ping -c 3 8.8.8.8
64 bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=10.5 ms
$
$# DNS working?
$dig google.com +short
142.250.185.78
$
$# Is the service listening?
$ss -tlpn | grep :80
(should show nginx)

Troubleshooting Cheat Sheet

ProblemCommands to Run
Service won't startsystemctl status svc, journalctl -u svc
High CPUtop, htop, ps aux --sort=-%cpu
High memoryfree -h, ps aux --sort=-%mem
Disk fulldf -h, du -sh /*, find / -size +100M
Can't connectping, ss -tlpn, iptables -L
DNS issuesdig domain, cat /etc/resolv.conf
Permission deniedls -la, namei -l /path/to/file

Example: Web Server Down

hljs bash
# 1. Check if running
systemctl status nginx
# Result: failed

# 2. Why did it fail?
journalctl -u nginx -n 30
# Result: "bind() to 0.0.0.0:80 failed (98: Address already in use)"

# 3. What's using port 80?
ss -tlpn | grep :80
# Result: apache2 is running

# 4. Fix: stop apache
sudo systemctl stop apache2
sudo systemctl disable apache2

# 5. Start nginx
sudo systemctl start nginx
sudo systemctl status nginx
# Result: running
Knowledge Check

What's the first thing you should do when troubleshooting?

Key Takeaways

  • Gather info before making changes
  • Check service status and logs first
  • Monitor resources: CPU, memory, disk, network
  • Use journalctl for service logs
  • Test one change at a time
  • Document what you find and what you fixed

Next: setting up a server from scratch.