Monitoring, Logs & Troubleshooting

01 Process Monitor

top — Built-in Process Monitor

top is the quickest way to see what is consuming CPU and RAM right now. It's available on every Linux system with no install required — your first tool in any triage session.

Launch top

BASH

top

Essential top key bindings (while running)

Key	Action
`q`	Quit
`M`	Sort by memory usage
`P`	Sort by CPU usage (default)
`k`	Kill a process by PID
`1`	Show individual CPU cores
`u`	Filter by username
`d`	Change refresh interval
`Space`	Force immediate refresh

Run top non-interactively (for scripting)

BASH

# Single snapshot — 1 iteration, output to stdout
top -b -n 1

# Show only the top 20 processes
top -b -n 1 | head -30

ℹ️

Reading the top header

The first five lines show uptime and load averages, total tasks, CPU breakdown (us=user, sy=kernel, id=idle), and memory/swap usage. The load average numbers (1m / 5m / 15m) are your first sign of sustained pressure — values above your CPU core count mean the system is queued.

02 Interactive Process Viewer

htop — The Better top

htop is an interactive, color-coded process viewer that shows per-core CPU bars, memory meters, and process trees. It's the tool most sysadmins reach for first when investigating a live system.

Install htop

BASH

sudo apt install htop -y

Launch htop

BASH

htop

# Run as a specific user to filter immediately
htop -u www-data

htop key bindings

Key	Action
`F2`	Setup / configuration
`F3`	Search processes by name
`F4`	Filter (show only matching)
`F5`	Tree view — shows parent/child relationships
`F6`	Sort by column
`F9`	Kill selected process
`F10 / q`	Quit
`Space`	Tag a process
`u`	Filter by user

What to look for

CPU bars pegged at 100% on one or more cores — identify the PID and process name
Memory bar nearly full with swap growing — potential OOM situation
Load average climbing well above core count — system is being overwhelmed
Zombie processes (Z state) — parent not reaping children, may indicate a bug

✅

Tree view (F5) is invaluable

Pressing F5 in htop shows processes in a parent-child tree. This immediately reveals which web server worker, PHP-FPM pool, or spawned script is consuming resources — rather than a flat list that requires cross-referencing PIDs.

03 Disk I/O Monitor

iotop — Disk I/O by Process

When your server is slow but CPU and RAM look fine, the culprit is often disk I/O. iotop shows you exactly which process is hammering storage — essential for diagnosing slow MySQL queries, runaway log writers, or backup jobs competing with live traffic.

Install iotop

BASH

sudo apt install iotop -y

Run iotop

BASH

# Interactive mode — requires root
sudo iotop

# Show only processes with active I/O (quieter output)
sudo iotop -o

# Batch mode — useful for logging
sudo iotop -b -n 3

Column reference

Column	Meaning
`DISK READ`	Current read bandwidth for this process
`DISK WRITE`	Current write bandwidth for this process
`SWAPIN`	% time waiting on swap reads — high values = memory pressure
`IO>`	% time this process spent waiting on I/O

⚠️

iotop requires kernel I/O accounting

On some minimal Ubuntu installs iotop may report "CONFIG_TASK_IO_ACCOUNTING not set in kernel". This is rare on standard Ubuntu 24.04 LTS but can happen on custom kernels. If so, use iostat (from sysstat) as an alternative for per-device I/O stats.

Alternative: iostat for device-level I/O

BASH

sudo apt install sysstat -y

# Show device I/O stats every 2 seconds
iostat -x 2

# Focus on a specific device
iostat -x 2 /dev/sda

04 Network Monitor

nload — Network Bandwidth Monitor

nload shows real-time incoming and outgoing bandwidth per network interface with an ASCII graph. It's the fastest way to see if your server is saturating its network link or if traffic looks abnormal.

Install nload

BASH

sudo apt install nload -y

Run nload

BASH

# Monitor all interfaces (arrow keys to switch)
nload

# Monitor a specific interface
nload eth0
nload enp3s0

nload navigation

Key	Action
← →	Switch between network interfaces
`F2`	Options screen
`q`	Quit

Alternative: nethogs — bandwidth by process

nload shows interface-level bandwidth. If you need to know which process is responsible for that traffic, use nethogs:

BASH

sudo apt install nethogs -y
sudo nethogs eth0

Quick interface stats without extra tools

BASH

# Snapshot of bytes transferred on all interfaces
cat /proc/net/dev

# ip command interface stats
ip -s link show eth0

05 System Log Viewer

journalctl — Reading System Logs

On modern Ubuntu systems, journalctl is the primary interface to the systemd journal — a structured log that captures output from every service, the kernel, and the boot process. Learning to query it efficiently is one of the most valuable sysadmin skills.

Most-used journalctl commands

BASH

# Show all logs (oldest first) — pipe to less
journalctl | less

# Show logs for a specific service
journalctl -u nginx
journalctl -u apache2
journalctl -u mysql
journalctl -u php8.3-fpm

# Follow logs in real time (like tail -f)
journalctl -u nginx -f

# Show logs since the last boot only
journalctl -b

# Show logs from a previous boot (-1 = last, -2 = two boots ago)
journalctl -b -1

# Filter by time range
journalctl --since "2026-03-17 14:00:00"
journalctl --since "1 hour ago"
journalctl --since "30 min ago" --until "now"

# Show only errors and above (emerg, alert, crit, err)
journalctl -p err

# Show kernel messages only
journalctl -k

# Show the last 50 lines
journalctl -n 50

# Show logs in reverse (newest first)
journalctl -r

Combining filters

BASH

# Errors from nginx in the last 2 hours
journalctl -u nginx -p err --since "2 hours ago"

# All errors since last boot, newest first
journalctl -b -p err -r

Disk usage and maintenance

BASH

# Check how much disk the journal is using
journalctl --disk-usage

# Keep only the last 2 weeks of logs
sudo journalctl --vacuum-time=2weeks

# Keep journal under 500 MB
sudo journalctl --vacuum-size=500M

ℹ️

Persistent journal across reboots

By default on Ubuntu 24.04, the journal may be stored in /run/log/journal (volatile, lost on reboot). To make it persistent across reboots: sudo mkdir -p /var/log/journal && sudo systemctl restart systemd-journald. Persistent logs are stored in /var/log/journal.

06 Log Files

/var/log — Log File Structure

Not everything logs to the systemd journal. Many services write directly to files in /var/log. Knowing which log belongs to which service is fundamental to fast troubleshooting.

Key log locations

Log File / Directory	What It Contains
`/var/log/syslog`	General system messages (kernel, daemons, services)
`/var/log/auth.log`	Authentication: SSH logins, sudo, PAM, failed login attempts
`/var/log/kern.log`	Kernel messages only
`/var/log/dmesg`	Boot-time kernel ring buffer (hardware detection, driver messages)
`/var/log/dpkg.log`	Package installs, upgrades, and removals
`/var/log/apt/history.log`	High-level apt command history
`/var/log/apache2/access.log`	Every HTTP request served by Apache
`/var/log/apache2/error.log`	Apache errors, PHP errors via mod_php
`/var/log/nginx/access.log`	Every HTTP request served by Nginx
`/var/log/nginx/error.log`	Nginx errors and upstream failures
`/var/log/mysql/error.log`	MySQL/MariaDB startup, shutdown, and errors
`/var/log/fail2ban.log`	Fail2Ban bans, unbans, and jail activity
`/var/log/ufw.log`	UFW firewall rule matches (if logging enabled)
`/var/log/php*.log`	PHP error log (path set in php.ini)

Useful commands for working with log files

BASH

# Follow a log file in real time
sudo tail -f /var/log/syslog
sudo tail -f /var/log/apache2/error.log

# Show last 100 lines
sudo tail -n 100 /var/log/auth.log

# Search for a pattern in a log
sudo grep "Failed password" /var/log/auth.log
sudo grep "error" /var/log/nginx/error.log

# Count occurrences of a pattern
sudo grep -c "Failed password" /var/log/auth.log

# Search across all logs for a string
sudo grep -r "out of memory" /var/log/

# Check when a log was last modified
ls -lh /var/log/syslog

Log rotation

Ubuntu uses logrotate to compress and rotate log files automatically, preventing /var/log from filling your disk. Rotated logs have extensions like .1, .2.gz.

BASH

# Check logrotate configuration
cat /etc/logrotate.conf
ls /etc/logrotate.d/

# Manually trigger rotation (useful for testing)
sudo logrotate --force /etc/logrotate.conf

# Check disk usage of /var/log
sudo du -sh /var/log/*  | sort -h | tail -20

07 Kernel Ring Buffer

dmesg — Kernel Messages

dmesg prints the kernel ring buffer — hardware detection at boot, driver messages, disk errors, OOM (out-of-memory) kills, USB events, and network interface state changes. It's essential for diagnosing hardware problems and kernel-level errors that never appear in service logs.

Basic dmesg usage

BASH

# Print full kernel ring buffer
dmesg

# Human-readable timestamps (requires root on some systems)
sudo dmesg -T

# Follow new messages in real time
sudo dmesg -w

# Show only errors and above
sudo dmesg --level=err,crit,alert,emerg

# Show only warnings and above
sudo dmesg --level=warn,err,crit,alert,emerg

Filtering dmesg output

BASH

# Look for disk/storage errors
sudo dmesg | grep -i "error\|fail\|fault" | tail -30

# Look for OOM killer activity
sudo dmesg | grep -i "oom\|out of memory\|killed"

# Look for network interface events
sudo dmesg | grep -i "eth\|enp\|link"

# Look for USB device events
sudo dmesg | grep -i "usb"

# Look for hardware errors
sudo dmesg | grep -iE "mce|hardware error|corrected"

Common dmesg findings and what they mean

Message Pattern	Likely Cause
`oom-killer: ...killed process`	System ran out of RAM — a process was killed to free memory
`EXT4-fs error`	Filesystem corruption — run `fsck` on the affected device
`ata... failed command`	Disk I/O error — check SMART status with `smartctl`
`NVRM: GPU... error`	NVIDIA driver error — check GPU temperature and driver version
`eth0: link down / link up`	Network cable event or switch port issue
`segfault at...`	Application crash with memory fault — usually a software bug
`MCE: ... HARDWARE ERROR`	Machine Check Exception — potential RAM or CPU hardware fault

08 Boot Performance

systemd-analyze — Boot Time Profiling

systemd-analyze profiles your system boot — showing total boot time, which services are slowest, and generating visual timelines. It's the right tool when a server is taking longer than expected to come online after a reboot.

Total boot time

BASH

# Show total time broken down: firmware + loader + kernel + userspace
systemd-analyze

# Example output:
Startup finished in 1.832s (kernel) + 8.471s (userspace) = 10.303s
graphical.target reached after 8.412s in userspace

Find the slowest services

BASH

# List services by activation time, slowest first
systemd-analyze blame

# Show only the top 10 slowest
systemd-analyze blame | head -10

# Example output:
8.431s mysql.service
4.012s networking.service
2.204s cloud-init.service
1.891s apt-daily-upgrade.service
0.844s fail2ban.service

Critical path — what's actually blocking boot

BASH

# Show the critical chain — services on the longest dependency path
systemd-analyze critical-chain

# Critical chain for a specific target
systemd-analyze critical-chain multi-user.target

Visual SVG timeline (for desktop/local servers)

BASH

# Generate an SVG boot timeline (open in a browser)
systemd-analyze plot > /tmp/boot-timeline.svg

# Then open it on your local machine:
# scp user@server:/tmp/boot-timeline.svg ~/Desktop/

Verify unit configuration

BASH

# Check a unit file for errors
systemd-analyze verify /etc/systemd/system/myapp.service

# Check the security exposure level of a service
systemd-analyze security nginx
systemd-analyze security mysql

ℹ️

systemd-analyze security

The security subcommand scores each service unit against systemd's sandboxing capabilities. A high exposure score means the service has broad access to the system. This is a useful hardening audit tool — look for services running as root with PrivateTmp=no or NoNewPrivileges=no.

09 Structured Approach

Basic Performance Triage

When something is slow or broken, guessing wastes time. A structured triage process gets you to the cause faster. Work through these layers in order — most problems reveal themselves within the first three steps.

Step 1 — Is the server actually under load?

BASH

# Load average and uptime
uptime
14:22:01 up 12 days, 3:41, 2 users, load average: 0.42, 0.58, 0.62

# Load averages above your CPU core count = queued processes
# Check core count:
nproc
4

# If load average >> nproc, you have sustained pressure
# Launch htop to find the culprit:
htop

Step 2 — Memory pressure

BASH

# Quick memory overview
free -h
              total   used   free   shared  buff/cache  available
Mem:           7.7G   5.1G   312M    182M        2.3G      2.1G
Swap:          2.0G   820M   1.2G

# If "available" is very low and swap is growing, you have memory pressure
# Check for OOM events:
sudo dmesg | grep -i "oom\|killed"
sudo journalctl -p err -b | grep -i "oom\|memory"

Step 3 — Is it a disk I/O problem?

BASH

# Check disk usage — full disks cause silent failures
df -h

# Check inode usage (can fill even when disk space is available)
df -i

# Find what's consuming disk in /var/log
sudo du -sh /var/log/* | sort -h | tail -10

# Check active I/O
sudo iotop -o

Step 4 — Check service logs for errors

BASH

# All errors since last boot
journalctl -b -p err -r | head -40

# Check the specific service that seems broken
systemctl status nginx
journalctl -u nginx -n 50

# Check authentication failures (brute force / intrusion)
sudo grep "Failed password\|Invalid user" /var/log/auth.log | tail -20

Step 5 — Network saturation

BASH

# Live bandwidth by interface
nload eth0

# Check open connections
ss -s

# Show established connections count by remote IP (detect floods)
ss -nt | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10

# Check for unusually high connection counts to a port
ss -nt state established | wc -l

Quick reference: triage flow

BASH · Triage Checklist

# 1. Load average
uptime

# 2. CPU / process
htop

# 3. Memory
free -h

# 4. Disk space
df -h && df -i

# 5. Disk I/O
sudo iotop -o

# 6. Recent errors
journalctl -b -p err -r | head -30

# 7. Service status
systemctl status <servicename>

# 8. Kernel messages
sudo dmesg -T --level=err,crit | tail -20

# 9. Network
nload && ss -s

✅

Diagnose, don't guess

The difference between a junior admin and a senior one isn't knowing more commands — it's working through a structured process. Start wide (is the whole server struggling?), then narrow (which service, which resource, which log entry). Each step eliminates a category before you move to the next.