Troubleshooting Workflow

Something's broken. Users are complaining. Don't panic. Follow a systematic approach.

The Troubleshooting Mindset

Gather information before changing anything
Form hypotheses based on evidence
Test one change at a time
Document what you find

Don't Just Restart

"Have you tried restarting it?" might fix the symptom but hides the cause. Investigate first, understand what happened, then fix properly.

Step 1: What's the Actual Problem?

Before diving in, clarify:

What's the expected behavior?
What's actually happening?
When did it start?
What changed recently?

Terminal

$# Check recent changes

$sudo grep -i 'install\|remove\|upgrade' /var/log/dpkg.log | tail -20

2025-01-14 09:30:00 install nginx:amd64 1.18.0-6ubuntu14

$# Recent logins

$last | head -10

john pts/0 192.168.1.50 Mon Jan 14 10:00 still logged in

$# System boot time

$uptime

10:30:45 up 5 days, 3:24, 1 user, load average: 0.52, 0.58, 0.59

Step 2: Check Service Status

Terminal

$# Is the service running?

$systemctl status nginx

● nginx.service - A high performance web server Active: failed (Result: exit-code) since...

$# What happened?

$journalctl -u nginx -n 50

(recent logs)

$# Failed services

$systemctl --failed

Step 3: Check Resources

CPU and Load

Terminal

$uptime

10:30:45 up 5 days, load average: 15.52, 12.58, 8.59

$# Load > CPU cores = overloaded

$# What's using CPU?

$top -bn1 | head -15

PID USER PR NI VIRT RES %CPU %MEM 1234 mysql 20 0 1.5g 500m 95.0 25.0

Memory

Terminal

$free -h

total used free shared buff/cache available Mem: 7.7G 6.5G 200M 150M 1.0G 800M

$# What's using memory?

$ps aux --sort=-%mem | head -10

Disk Space

Terminal

$df -h

Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 48G 2.0G 96% /

$# Find large files

$sudo du -sh /* 2>/dev/null | sort -rh | head -10

/var 35G /home 8G /usr 3G

$# Large files in /var

$sudo find /var -type f -size +100M -exec ls -lh {} \;

Disk I/O

Terminal

$iostat -x 1 3

Device %util await r/s w/s sda 99.5 250.0 50.0 200.0

$# High %util = disk bottleneck

Step 4: Check Logs

Terminal

$# System messages

$sudo tail -100 /var/log/syslog

$# Authentication issues

$sudo tail -50 /var/log/auth.log

$# Application logs

$sudo tail -100 /var/log/nginx/error.log

$# Search for errors

$sudo grep -i 'error\|fail\|crit' /var/log/syslog | tail -20

Step 5: Check Network

Terminal

$# Is the network up?

$ip addr

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>... inet 192.168.1.100/24

$# Can we reach the internet?

$ping -c 3 8.8.8.8

64 bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=10.5 ms

$# DNS working?

$dig google.com +short

142.250.185.78

$# Is the service listening?

$ss -tlpn | grep :80

(should show nginx)

Troubleshooting Cheat Sheet

Problem	Commands to Run
Service won't start	`systemctl status svc`, `journalctl -u svc`
High CPU	`top`, `htop`, `ps aux --sort=-%cpu`
High memory	`free -h`, `ps aux --sort=-%mem`
Disk full	`df -h`, `du -sh /*`, `find / -size +100M`
Can't connect	`ping`, `ss -tlpn`, `iptables -L`
DNS issues	`dig domain`, `cat /etc/resolv.conf`
Permission denied	`ls -la`, `namei -l /path/to/file`

Example: Web Server Down

hljs bash

# 1. Check if running
systemctl status nginx
# Result: failed

# 2. Why did it fail?
journalctl -u nginx -n 30
# Result: "bind() to 0.0.0.0:80 failed (98: Address already in use)"

# 3. What's using port 80?
ss -tlpn | grep :80
# Result: apache2 is running

# 4. Fix: stop apache
sudo systemctl stop apache2
sudo systemctl disable apache2

# 5. Start nginx
sudo systemctl start nginx
sudo systemctl status nginx
# Result: running

Knowledge Check

What's the first thing you should do when troubleshooting?

Key Takeaways

Gather info before making changes
Check service status and logs first
Monitor resources: CPU, memory, disk, network
Use journalctl for service logs
Test one change at a time
Document what you find and what you fixed

Next: setting up a server from scratch.

Troubleshooting Workflow

Something's broken. Users are complaining. Don't panic. Follow a systematic approach.

The Troubleshooting Mindset

Gather information before changing anything
Form hypotheses based on evidence
Test one change at a time
Document what you find

Don't Just Restart

"Have you tried restarting it?" might fix the symptom but hides the cause. Investigate first, understand what happened, then fix properly.

Step 1: What's the Actual Problem?

Before diving in, clarify:

What's the expected behavior?
What's actually happening?
When did it start?
What changed recently?

Terminal

$# Check recent changes

$sudo grep -i 'install\|remove\|upgrade' /var/log/dpkg.log | tail -20

2025-01-14 09:30:00 install nginx:amd64 1.18.0-6ubuntu14

$# Recent logins

$last | head -10

john pts/0 192.168.1.50 Mon Jan 14 10:00 still logged in

$# System boot time

$uptime

10:30:45 up 5 days, 3:24, 1 user, load average: 0.52, 0.58, 0.59

Step 2: Check Service Status

Terminal

$# Is the service running?

$systemctl status nginx

● nginx.service - A high performance web server Active: failed (Result: exit-code) since...

$# What happened?

$journalctl -u nginx -n 50

(recent logs)

$# Failed services

$systemctl --failed

Step 3: Check Resources

CPU and Load

Terminal

$uptime

10:30:45 up 5 days, load average: 15.52, 12.58, 8.59

$# Load > CPU cores = overloaded

$# What's using CPU?

$top -bn1 | head -15

PID USER PR NI VIRT RES %CPU %MEM 1234 mysql 20 0 1.5g 500m 95.0 25.0

Memory

Terminal

$free -h

total used free shared buff/cache available Mem: 7.7G 6.5G 200M 150M 1.0G 800M

$# What's using memory?

$ps aux --sort=-%mem | head -10

Disk Space

Terminal

$df -h

Filesystem Size Used Avail Use% Mounted on /dev/sda1 50G 48G 2.0G 96% /

$# Find large files

$sudo du -sh /* 2>/dev/null | sort -rh | head -10

/var 35G /home 8G /usr 3G

$# Large files in /var

$sudo find /var -type f -size +100M -exec ls -lh {} \;

Disk I/O

Terminal

$iostat -x 1 3

Device %util await r/s w/s sda 99.5 250.0 50.0 200.0

$# High %util = disk bottleneck

Step 4: Check Logs

Terminal

$# System messages

$sudo tail -100 /var/log/syslog

$# Authentication issues

$sudo tail -50 /var/log/auth.log

$# Application logs

$sudo tail -100 /var/log/nginx/error.log

$# Search for errors

$sudo grep -i 'error\|fail\|crit' /var/log/syslog | tail -20

Step 5: Check Network

Terminal

$# Is the network up?

$ip addr

2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP>... inet 192.168.1.100/24

$# Can we reach the internet?

$ping -c 3 8.8.8.8

64 bytes from 8.8.8.8: icmp_seq=1 ttl=117 time=10.5 ms

$# DNS working?

$dig google.com +short

142.250.185.78

$# Is the service listening?

$ss -tlpn | grep :80

(should show nginx)

Troubleshooting Cheat Sheet

Problem	Commands to Run
Service won't start	`systemctl status svc`, `journalctl -u svc`
High CPU	`top`, `htop`, `ps aux --sort=-%cpu`
High memory	`free -h`, `ps aux --sort=-%mem`
Disk full	`df -h`, `du -sh /*`, `find / -size +100M`
Can't connect	`ping`, `ss -tlpn`, `iptables -L`
DNS issues	`dig domain`, `cat /etc/resolv.conf`
Permission denied	`ls -la`, `namei -l /path/to/file`

Example: Web Server Down

hljs bash

# 1. Check if running
systemctl status nginx
# Result: failed

# 2. Why did it fail?
journalctl -u nginx -n 30
# Result: "bind() to 0.0.0.0:80 failed (98: Address already in use)"

# 3. What's using port 80?
ss -tlpn | grep :80
# Result: apache2 is running

# 4. Fix: stop apache
sudo systemctl stop apache2
sudo systemctl disable apache2

# 5. Start nginx
sudo systemctl start nginx
sudo systemctl status nginx
# Result: running

Knowledge Check

What's the first thing you should do when troubleshooting?

Key Takeaways

Gather info before making changes
Check service status and logs first
Monitor resources: CPU, memory, disk, network
Use journalctl for service logs
Test one change at a time
Document what you find and what you fixed

Next: setting up a server from scratch.