The Problem Every Sysadmin Knows Too Well
It’s 3 AM. Your phone buzzes with a critical alert. Production is down, customers are angry, and your manager is asking questions you don’t have good answers to.
Sound familiar? You’re not alone. According to a recent survey, 78% of sysadmin emergencies could have been prevented with better proactive monitoring. But here’s the thing: most monitoring solutions are overkill for what you really need.
The 5-Minute Daily Ritual
Instead of relying solely on complex monitoring stacks, I’ve developed a simple daily health check that takes exactly 5 minutes and has saved my team countless headaches. You can do this right now, and you should.
Step 1: The Dashboard Glance (1 minute)
Before you even touch your terminal, open your monitoring dashboard and ask yourself three questions:
- Are there any red indicators? (Obvious, but people miss this)
- Are the numbers within expected ranges? (Know your baselines)
- Are there any unusual patterns? (Trends matter more than single data points)
Pro tip: If you’re looking at more than 10 metrics, you’re probably over-engineering your monitoring.
Step 2: The Disk Space Reality Check (1 minute)
# The command that saves careers
df -h
# But here's what you actually need to check
df -h | grep -E '9[0-9]%'
If anything shows 90% or higher, you’ve got a problem brewing. 95%+ means you need to act today, not tomorrow.
Common culprits:
- Log files growing uncontrollably
- Temporary files never cleaned up
- Someone’s home directory full of movies
- That “temporary” test database that’s been running for 6 months
Step 3: The Process Health Scan (1 minute)
# Find the real troublemakers
ps aux --sort=-%mem | head -10
ps aux --sort=-%cpu | head -10
Look for:
- Processes consuming abnormal amounts of CPU or memory
- Zombie processes (they never fix themselves)
- Processes that should have stopped but didn’t
Step 4: The Log Pattern Check (1 minute)
# Check for recent errors
journalctl --since "1 hour ago" | grep -i error | tail -5
# Or for systems using traditional syslog
tail -100 /var/log/syslog | grep ERROR
What to look for:
- Repeated error patterns (indicates systemic issues)
- Permission denied errors (usually means a service broke)
- Connection timeouts (network or service issues)
Step 5: The Service Status Reality (1 minute)
# Quick service health check
systemctl list-failed
# Or for older systems
service --status-all 2>&1 | grep -E "(FAIL|STOP)"
Making This Actionable
The secret isn’t having the perfect monitoring setup—it’s developing the habit of doing this daily check before you get your coffee.
Create Your Script
Here’s a simple script I use:
#!/bin/bash
# Daily sysadmin health check
echo "=== $(date) ==="
echo "Disk Usage:"
df -h | grep -E '9[0-9]%'
echo -e "
Top Memory Processes:"
ps aux --sort=-%mem | head -5
echo -e "
Top CPU Processes:"
ps aux --sort=-%cpu | head -5
echo -e "
Failed Services:"
systemctl list-failed --no-pager
echo -e "
Recent Errors (last hour):"
journalctl --since "1 hour ago" --no-pager | grep -i error | tail -3
Run this every morning at 9 AM. Set it as a cron job, or better yet, do it manually while you wait for your coffee to brew.
The Real Secret
The most important part isn’t the commands—it’s the consistency. When you do this every day, you start to recognize what’s normal for your environment. That baseline knowledge is more valuable than any monitoring tool.
Warning Signs You Should Know
- Sudden spikes in error rates (even if still “low”)
- Processes that restart frequently (usually indicates resource issues)
- Disk usage that increases daily (disk leaks are real)
- Memory usage that never decreases (memory leaks are realer)
When This Isn’t Enough
This 5-minute check won’t catch everything, and that’s not its purpose. It’s designed to catch the 80% of problems that are obvious if you just look.
For the remaining 20%, you’ll need proper monitoring, alerting, and incident response procedures. But start with this—it’s the foundation everything else builds on.
Make It Part of Your Morning Routine
Pick a consistent time, make it non-negotiable, and do it every single day. Your future self (and your sleep schedule) will thank you.
Remember: the best time to find problems is when you’re not in crisis mode. This simple habit has saved me countless late nights and probably my job more than once.
What are your daily health check routines? Share your favorite commands or scripts in the comments below. Let’s build a community of proactive sysadmins.