Why Your Monitoring is Broken (And How to Fix It Before Your Boss Notices)

Last Monday, my phone started buzzing at 3 AM. “CRITICAL: Database server down!” it screamed. I stumbled to my laptop, logged in, and found… nothing wrong. The database was running fine. My monitoring system had been crying wolf for the past month.

Sound familiar? Yeah, monitoring systems are like smoke detectors - they’re either screaming bloody murder all the time, or they’re mysteriously silent right before your house burns down.

After 15 years of dealing with broken alerts, I’ve figured out how to make monitoring actually work. Here’s what I wish someone had told me when I started.

The Problem with Most Monitoring Systems

Most monitoring setups fail for the same reason: they’re designed by people who have never had to respond to a 3 AM alert.

Common mistakes I see everywhere:

Alerting on things that don’t matter No context in alerts (just “ERROR!” with no details) The classic “disk space low” alert when you’ve got 90% free Getting 47 alerts for the same problem (thanks, dependency chains!) Alerts that require 15 minutes of investigation to determine if it’s real Here’s the thing: every false positive teaches your team to ignore alerts. And when you finally get a real problem, everyone assumes it’s another false alarm.

The 3 AM Test

Before I set up any alert, I ask myself: “Would I want to get paged for this at 3 AM?”

If the answer is no, it doesn’t need to be an alert. It might need a dashboard, a log entry, or a weekly report. But not a page.

Examples of 3 AM worthy alerts:

Website is down (can’t serve customers) Database is unreachable (affects everything) Authentication systems are broken (nobody can log in) Security breaches detected (the building is on fire) Examples of NOT 3 AM worthy alerts:

CPU usage above 80% (happens constantly during normal business) Single web server instance down but load balancer still works Non-critical backups failed (fix it during business hours) Log file growing (might be normal) The Fix: Alerts That Don’t Suck

Start with Business Impact

Before you alert on any technical metric, ask: “What business impact does this have?”

Good alert:

ALERT: Payment Processing Down Server: payments-prod-01 Issue: Cannot process credit card transactions Impact: $0 revenue for past 5 minutes Action: Immediately investigate payment gateway connectivity Bad alert:

ERROR: TCP connection failed on port 443 2. Use the PagerDuty Rule

Every alert should include the answer to these questions:

Problem: What’s actually broken? Affect: What business impact does this have? Guide: What should the on-call person do? Expectation: When should this be resolved? Example of a complete alert:

[CRITICAL] Website Down - payment-prod-01

PROBLEM: HTTP requests to payment gateway returning 503 errors IMPACT: Cannot process payments, estimated $500/minute in lost revenue GUIDE:

Check gateway status page: https://status.stripe.com
Verify DNS resolution: nslookup stripe.com
Test from different network: curl -I https://api.stripe.com
If external issue, notify accounting to suspend online orders EXPECTATION: Resolution within 15 minutes or escalation to infrastructure team
Implement Alert Fatigue Prevention

Stack alerts instead of stacking alerts:

python

Bad: 47 separate alerts

ALERT: CPU high on web-01 ALERT: CPU high on web-02 ALERT: CPU high on web-03

… and 44 more

Good: One intelligent alert

ALERT: Web tier under stress SERVICE: All web servers reporting >85% CPU IMPACT: Response times degrading, user experience affected ACTION: Auto-scale web tier or investigate load increase Use sliding severity:

Warning: Might become a problem soon Critical: Definitely a problem that needs attention Emergency: Everything is on fire, call everyone The Monitoring Stack That Actually Works

Here’s what I’ve found works best for small to medium teams:

Base Layer: Metrics (Grafana + Prometheus)

yaml

prometheus.yml

global: scrape_interval: 15s evaluation_interval: 15s

rule_files:

“alert_rules.yml”

alerting: alertmanagers: - static_configs: - targets: - alertmanager:9093

scrape_configs:

job_name: ‘web-servers’ static_configs:
- targets: [‘web-01:9100’, ‘web-02:9100’, ‘web-03:9100’]
job_name: ‘database’ static_configs:
- targets: [‘postgres-01:9187’]

Alert Management (AlertManager)

yaml

alertmanager.yml

global: smtp_smarthost: ‘smtp.company.com:587’

route: group_by: [‘alertname’] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: ‘pagerduty’ routes:

match: severity: critical receiver: ‘immediate-alert’
match: severity: warning receiver: ’email-notifications’

receivers:

name: ‘immediate-alert’ pagerduty_configs:
- routing_key: ‘your-pagerduty-key’ description: ‘{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}’
name: ’email-notifications’ email_configs:
- to: ‘oncall@company.com’ subject: ‘Warning: {{ range .Alerts }}{{ .Annotations.summary }}{{ end }}’ body: | {{ range .Alerts }} Alert: {{ .Annotations.summary }} Description: {{ .Annotations.description }} {{ end }}

Alert Rules That Don’t Suck

yaml

alert_rules.yml

groups:

name: business_critical rules:
- alert: WebsiteDown expr: up{job=“web-servers”} == 0 for: 1m labels: severity: critical annotations: summary: “Website is down” description: “Web server {{ $labels.instance }} has been down for more than 1 minute” guide: “Check if server is running, check load balancer, verify DNS” impact: “Customers cannot access the website”
- alert: PaymentProcessingDown expr: http_requests_total{job=“payment-service”,status=~“5..”} > 100 for: 5m labels: severity: critical annotations: summary: “Payment processing degraded” description: “Payment service error rate is {{ $value }} requests/second” guide: “Check payment gateway status, verify API keys, check network connectivity” impact: “Cannot process payments, revenue impact”
name: infrastructure_warnings rules:
- alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 < 20 for: 15m labels: severity: warning annotations: summary: “Disk space getting low” description: “Disk usage is {{ $value }}% on {{ $labels.instance }}” guide: “Clean up log files, archive old data, or expand disk” impact: “May cause application failures if space runs out”
- alert: HighMemoryUsage expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90 for: 10m labels: severity: warning annotations: summary: “High memory usage detected” description: “Memory usage is {{ $value }}% on {{ $labels.instance }}” guide: “Check for memory leaks, restart services, or add more RAM” impact: “Performance degradation, potential OOM kills” Testing Your Alerts (Before You Need Them)

The graveyard test: Set up a test environment with intentionally broken services and verify:

Do you get the right alerts? 2. Do you get them quickly enough? 3. Is the information actionable? 4. Do you get too many alerts? The chaos engineering approach:

bash #!/bin/bash

test-alerts.sh - Intentionally break things to test monitoring

echo “Testing monitoring system…”

Kill a web server temporarily

ssh web-01 “systemctl stop nginx” sleep 60 ssh web-01 “systemctl start nginx”

Fill up disk space

ssh web-01 “dd if=/dev/zero of=/tmp/bigfile bs=1G count=9” sleep 300 ssh web-01 “rm /tmp/bigfile”

Simulate high CPU

ssh web-01 “yes > /dev/null &” sleep 120 ssh web-01 “killall yes” The Phone Number Problem

Here’s a hard truth: your monitoring system is only as good as your on-call rotation.

On-call best practices:

Rotate regularly (no one should be on call more than 1 week at a time) Document escalation procedures Practice incident response (run drills) Have backups for critical systems Set realistic response time expectations Pro tip: If you’re always the one getting called, it’s either because:

You’re the only one who knows how to fix things (document more!) 2. Your alerts are broken (fix them!) 3. You’re too nice to escalate (be more assertive!) The Dashboard Problem

Alerts tell you when something is wrong. Dashboards tell you why.

Essential dashboards every team needs:

Business metrics: Orders, revenue, user registrations 2. Infrastructure health: CPU, memory, disk, network 3. Application performance: Response times, error rates, throughput 4. Security: Failed logins, unusual traffic patterns, resource access Dashboard design principles:

Show the last 24 hours by default (you care about recent trends) Use red/yellow/green colors consistently Include time ranges (1h, 6h, 24h, 7d) Link related metrics (don’t make me hunt for context) The Final Truth About Monitoring

Good monitoring is boring. When it’s working, you don’t think about it. When something breaks, you get the right alert at the right time with enough information to fix it quickly.

Bad monitoring is exciting. It screams at you constantly, wakes you up for false alarms, and leaves you guessing when something is actually wrong.

My philosophy: Spend 80% of your monitoring effort on reducing false positives. The remaining 20% will take care of itself.

Quick Implementation Checklist

Identify top 5 business-critical systems Define clear business impact for each system Create 3 AM tests for all current alerts Implement structured alerting (PagerDuty rule) Set up basic metrics collection (Prometheus) Configure alert management (AlertManager) Test alerts in staging environment Document escalation procedures Train team on alert response Review and tune alerts monthly Remember: The goal of monitoring isn’t to alert on everything. It’s to alert on the right things at the right time with the right context.

Your users don’t care if your CPU usage spikes at 2 AM. They care if the website is down when they try to place their order. Focus on what matters, and your monitoring will actually work.

Bad: 47 separate alerts#

… and 44 more#

Good: One intelligent alert#

prometheus.yml#

alertmanager.yml#

alert_rules.yml#

test-alerts.sh - Intentionally break things to test monitoring#

Kill a web server temporarily#

Fill up disk space#

Simulate high CPU#

Bad: 47 separate alerts

… and 44 more

Good: One intelligent alert

prometheus.yml

alertmanager.yml

alert_rules.yml

test-alerts.sh - Intentionally break things to test monitoring

Kill a web server temporarily

Fill up disk space

Simulate high CPU