Linux Basics · Monitoring & Troubleshooting

Monitoring, Logs & Troubleshooting

top · htop · iotop · nload  ·  journalctl · /var/log · dmesg  ·  systemd-analyze

This is where real sysadmins earn their keep. When something is slow, broken, or on fire, you need to diagnose — not guess. This guide covers the essential monitoring tools, log locations, and structured triage process used to find and fix problems on production Linux systems.

top / htop iotop journalctl dmesg systemd-analyze nload
root@server: ~
root@server:~# htop
  CPU[||||||||||||||||||||||||          48.2%]
  Mem[||||||||||||||||||||||||||||  3.12G/7.77G]
root@server:~# journalctl -u nginx -f
  Mar 17 14:22:01 nginx[1234]: [error] 13 Permission denied
root@server:~# systemd-analyze blame | head -5
  8.431s mysql.service
  4.012s networking.service
Created: 2026‑03‑17  ·  Written by Nicole M. Taylor  ·  Contact
⚠️
Software Changes Over Time

This guide was written in March 2026 for Ubuntu 24.04 LTS. Tool names, output formats, and package availability may differ on other distributions or future releases. Always verify commands against your system's installed version.

01

top — Built-in Process Monitor

top is the quickest way to see what is consuming CPU and RAM right now. It's available on every Linux system with no install required — your first tool in any triage session.

Launch top

BASH
top

Essential top key bindings (while running)

KeyAction
qQuit
MSort by memory usage
PSort by CPU usage (default)
kKill a process by PID
1Show individual CPU cores
uFilter by username
dChange refresh interval
SpaceForce immediate refresh

Run top non-interactively (for scripting)

BASH
# Single snapshot — 1 iteration, output to stdout
top -b -n 1

# Show only the top 20 processes
top -b -n 1 | head -30
ℹ️
Reading the top header

The first five lines show uptime and load averages, total tasks, CPU breakdown (us=user, sy=kernel, id=idle), and memory/swap usage. The load average numbers (1m / 5m / 15m) are your first sign of sustained pressure — values above your CPU core count mean the system is queued.


02

htop — The Better top

htop is an interactive, color-coded process viewer that shows per-core CPU bars, memory meters, and process trees. It's the tool most sysadmins reach for first when investigating a live system.

Install htop

BASH
sudo apt install htop -y

Launch htop

BASH
htop

# Run as a specific user to filter immediately
htop -u www-data

htop key bindings

KeyAction
F2Setup / configuration
F3Search processes by name
F4Filter (show only matching)
F5Tree view — shows parent/child relationships
F6Sort by column
F9Kill selected process
F10 / qQuit
SpaceTag a process
uFilter by user

What to look for

  • CPU bars pegged at 100% on one or more cores — identify the PID and process name
  • Memory bar nearly full with swap growing — potential OOM situation
  • Load average climbing well above core count — system is being overwhelmed
  • Zombie processes (Z state) — parent not reaping children, may indicate a bug
Tree view (F5) is invaluable

Pressing F5 in htop shows processes in a parent-child tree. This immediately reveals which web server worker, PHP-FPM pool, or spawned script is consuming resources — rather than a flat list that requires cross-referencing PIDs.


03

iotop — Disk I/O by Process

When your server is slow but CPU and RAM look fine, the culprit is often disk I/O. iotop shows you exactly which process is hammering storage — essential for diagnosing slow MySQL queries, runaway log writers, or backup jobs competing with live traffic.

Install iotop

BASH
sudo apt install iotop -y

Run iotop

BASH
# Interactive mode — requires root
sudo iotop

# Show only processes with active I/O (quieter output)
sudo iotop -o

# Batch mode — useful for logging
sudo iotop -b -n 3

Column reference

ColumnMeaning
DISK READCurrent read bandwidth for this process
DISK WRITECurrent write bandwidth for this process
SWAPIN% time waiting on swap reads — high values = memory pressure
IO>% time this process spent waiting on I/O
⚠️
iotop requires kernel I/O accounting

On some minimal Ubuntu installs iotop may report "CONFIG_TASK_IO_ACCOUNTING not set in kernel". This is rare on standard Ubuntu 24.04 LTS but can happen on custom kernels. If so, use iostat (from sysstat) as an alternative for per-device I/O stats.

Alternative: iostat for device-level I/O

BASH
sudo apt install sysstat -y

# Show device I/O stats every 2 seconds
iostat -x 2

# Focus on a specific device
iostat -x 2 /dev/sda

04

nload — Network Bandwidth Monitor

nload shows real-time incoming and outgoing bandwidth per network interface with an ASCII graph. It's the fastest way to see if your server is saturating its network link or if traffic looks abnormal.

Install nload

BASH
sudo apt install nload -y

Run nload

BASH
# Monitor all interfaces (arrow keys to switch)
nload

# Monitor a specific interface
nload eth0
nload enp3s0

nload navigation

KeyAction
← →Switch between network interfaces
F2Options screen
qQuit

Alternative: nethogs — bandwidth by process

nload shows interface-level bandwidth. If you need to know which process is responsible for that traffic, use nethogs:

BASH
sudo apt install nethogs -y
sudo nethogs eth0

Quick interface stats without extra tools

BASH
# Snapshot of bytes transferred on all interfaces
cat /proc/net/dev

# ip command interface stats
ip -s link show eth0

05

journalctl — Reading System Logs

On modern Ubuntu systems, journalctl is the primary interface to the systemd journal — a structured log that captures output from every service, the kernel, and the boot process. Learning to query it efficiently is one of the most valuable sysadmin skills.

Most-used journalctl commands

BASH
# Show all logs (oldest first) — pipe to less
journalctl | less

# Show logs for a specific service
journalctl -u nginx
journalctl -u apache2
journalctl -u mysql
journalctl -u php8.3-fpm

# Follow logs in real time (like tail -f)
journalctl -u nginx -f

# Show logs since the last boot only
journalctl -b

# Show logs from a previous boot (-1 = last, -2 = two boots ago)
journalctl -b -1

# Filter by time range
journalctl --since "2026-03-17 14:00:00"
journalctl --since "1 hour ago"
journalctl --since "30 min ago" --until "now"

# Show only errors and above (emerg, alert, crit, err)
journalctl -p err

# Show kernel messages only
journalctl -k

# Show the last 50 lines
journalctl -n 50

# Show logs in reverse (newest first)
journalctl -r

Combining filters

BASH
# Errors from nginx in the last 2 hours
journalctl -u nginx -p err --since "2 hours ago"

# All errors since last boot, newest first
journalctl -b -p err -r

Disk usage and maintenance

BASH
# Check how much disk the journal is using
journalctl --disk-usage

# Keep only the last 2 weeks of logs
sudo journalctl --vacuum-time=2weeks

# Keep journal under 500 MB
sudo journalctl --vacuum-size=500M
ℹ️
Persistent journal across reboots

By default on Ubuntu 24.04, the journal may be stored in /run/log/journal (volatile, lost on reboot). To make it persistent across reboots: sudo mkdir -p /var/log/journal && sudo systemctl restart systemd-journald. Persistent logs are stored in /var/log/journal.


06

/var/log — Log File Structure

Not everything logs to the systemd journal. Many services write directly to files in /var/log. Knowing which log belongs to which service is fundamental to fast troubleshooting.

Key log locations

Log File / DirectoryWhat It Contains
/var/log/syslogGeneral system messages (kernel, daemons, services)
/var/log/auth.logAuthentication: SSH logins, sudo, PAM, failed login attempts
/var/log/kern.logKernel messages only
/var/log/dmesgBoot-time kernel ring buffer (hardware detection, driver messages)
/var/log/dpkg.logPackage installs, upgrades, and removals
/var/log/apt/history.logHigh-level apt command history
/var/log/apache2/access.logEvery HTTP request served by Apache
/var/log/apache2/error.logApache errors, PHP errors via mod_php
/var/log/nginx/access.logEvery HTTP request served by Nginx
/var/log/nginx/error.logNginx errors and upstream failures
/var/log/mysql/error.logMySQL/MariaDB startup, shutdown, and errors
/var/log/fail2ban.logFail2Ban bans, unbans, and jail activity
/var/log/ufw.logUFW firewall rule matches (if logging enabled)
/var/log/php*.logPHP error log (path set in php.ini)

Useful commands for working with log files

BASH
# Follow a log file in real time
sudo tail -f /var/log/syslog
sudo tail -f /var/log/apache2/error.log

# Show last 100 lines
sudo tail -n 100 /var/log/auth.log

# Search for a pattern in a log
sudo grep "Failed password" /var/log/auth.log
sudo grep "error" /var/log/nginx/error.log

# Count occurrences of a pattern
sudo grep -c "Failed password" /var/log/auth.log

# Search across all logs for a string
sudo grep -r "out of memory" /var/log/

# Check when a log was last modified
ls -lh /var/log/syslog

Log rotation

Ubuntu uses logrotate to compress and rotate log files automatically, preventing /var/log from filling your disk. Rotated logs have extensions like .1, .2.gz.

BASH
# Check logrotate configuration
cat /etc/logrotate.conf
ls /etc/logrotate.d/

# Manually trigger rotation (useful for testing)
sudo logrotate --force /etc/logrotate.conf

# Check disk usage of /var/log
sudo du -sh /var/log/*  | sort -h | tail -20

07

dmesg — Kernel Messages

dmesg prints the kernel ring buffer — hardware detection at boot, driver messages, disk errors, OOM (out-of-memory) kills, USB events, and network interface state changes. It's essential for diagnosing hardware problems and kernel-level errors that never appear in service logs.

Basic dmesg usage

BASH
# Print full kernel ring buffer
dmesg

# Human-readable timestamps (requires root on some systems)
sudo dmesg -T

# Follow new messages in real time
sudo dmesg -w

# Show only errors and above
sudo dmesg --level=err,crit,alert,emerg

# Show only warnings and above
sudo dmesg --level=warn,err,crit,alert,emerg

Filtering dmesg output

BASH
# Look for disk/storage errors
sudo dmesg | grep -i "error\|fail\|fault" | tail -30

# Look for OOM killer activity
sudo dmesg | grep -i "oom\|out of memory\|killed"

# Look for network interface events
sudo dmesg | grep -i "eth\|enp\|link"

# Look for USB device events
sudo dmesg | grep -i "usb"

# Look for hardware errors
sudo dmesg | grep -iE "mce|hardware error|corrected"

Common dmesg findings and what they mean

Message PatternLikely Cause
oom-killer: ...killed processSystem ran out of RAM — a process was killed to free memory
EXT4-fs errorFilesystem corruption — run fsck on the affected device
ata... failed commandDisk I/O error — check SMART status with smartctl
NVRM: GPU... errorNVIDIA driver error — check GPU temperature and driver version
eth0: link down / link upNetwork cable event or switch port issue
segfault at...Application crash with memory fault — usually a software bug
MCE: ... HARDWARE ERRORMachine Check Exception — potential RAM or CPU hardware fault

08

systemd-analyze — Boot Time Profiling

systemd-analyze profiles your system boot — showing total boot time, which services are slowest, and generating visual timelines. It's the right tool when a server is taking longer than expected to come online after a reboot.

Total boot time

BASH
# Show total time broken down: firmware + loader + kernel + userspace
systemd-analyze

# Example output:
Startup finished in 1.832s (kernel) + 8.471s (userspace) = 10.303s
graphical.target reached after 8.412s in userspace

Find the slowest services

BASH
# List services by activation time, slowest first
systemd-analyze blame

# Show only the top 10 slowest
systemd-analyze blame | head -10

# Example output:
8.431s mysql.service
4.012s networking.service
2.204s cloud-init.service
1.891s apt-daily-upgrade.service
0.844s fail2ban.service

Critical path — what's actually blocking boot

BASH
# Show the critical chain — services on the longest dependency path
systemd-analyze critical-chain

# Critical chain for a specific target
systemd-analyze critical-chain multi-user.target

Visual SVG timeline (for desktop/local servers)

BASH
# Generate an SVG boot timeline (open in a browser)
systemd-analyze plot > /tmp/boot-timeline.svg

# Then open it on your local machine:
# scp user@server:/tmp/boot-timeline.svg ~/Desktop/

Verify unit configuration

BASH
# Check a unit file for errors
systemd-analyze verify /etc/systemd/system/myapp.service

# Check the security exposure level of a service
systemd-analyze security nginx
systemd-analyze security mysql
ℹ️
systemd-analyze security

The security subcommand scores each service unit against systemd's sandboxing capabilities. A high exposure score means the service has broad access to the system. This is a useful hardening audit tool — look for services running as root with PrivateTmp=no or NoNewPrivileges=no.


09

Basic Performance Triage

When something is slow or broken, guessing wastes time. A structured triage process gets you to the cause faster. Work through these layers in order — most problems reveal themselves within the first three steps.

Step 1 — Is the server actually under load?

BASH
# Load average and uptime
uptime
14:22:01 up 12 days, 3:41, 2 users, load average: 0.42, 0.58, 0.62

# Load averages above your CPU core count = queued processes
# Check core count:
nproc
4

# If load average >> nproc, you have sustained pressure
# Launch htop to find the culprit:
htop

Step 2 — Memory pressure

BASH
# Quick memory overview
free -h
              total   used   free   shared  buff/cache  available
Mem:           7.7G   5.1G   312M    182M        2.3G      2.1G
Swap:          2.0G   820M   1.2G

# If "available" is very low and swap is growing, you have memory pressure
# Check for OOM events:
sudo dmesg | grep -i "oom\|killed"
sudo journalctl -p err -b | grep -i "oom\|memory"

Step 3 — Is it a disk I/O problem?

BASH
# Check disk usage — full disks cause silent failures
df -h

# Check inode usage (can fill even when disk space is available)
df -i

# Find what's consuming disk in /var/log
sudo du -sh /var/log/* | sort -h | tail -10

# Check active I/O
sudo iotop -o

Step 4 — Check service logs for errors

BASH
# All errors since last boot
journalctl -b -p err -r | head -40

# Check the specific service that seems broken
systemctl status nginx
journalctl -u nginx -n 50

# Check authentication failures (brute force / intrusion)
sudo grep "Failed password\|Invalid user" /var/log/auth.log | tail -20

Step 5 — Network saturation

BASH
# Live bandwidth by interface
nload eth0

# Check open connections
ss -s

# Show established connections count by remote IP (detect floods)
ss -nt | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -rn | head -10

# Check for unusually high connection counts to a port
ss -nt state established | wc -l

Quick reference: triage flow

BASH · Triage Checklist
# 1. Load average
uptime

# 2. CPU / process
htop

# 3. Memory
free -h

# 4. Disk space
df -h && df -i

# 5. Disk I/O
sudo iotop -o

# 6. Recent errors
journalctl -b -p err -r | head -30

# 7. Service status
systemctl status <servicename>

# 8. Kernel messages
sudo dmesg -T --level=err,crit | tail -20

# 9. Network
nload && ss -s
Diagnose, don't guess

The difference between a junior admin and a senior one isn't knowing more commands — it's working through a structured process. Start wide (is the whole server struggling?), then narrow (which service, which resource, which log entry). Each step eliminates a category before you move to the next.