Dead Man's Switch Monitoring, Explained

June 18, 2026 · PingGuard Team

6 min read

1. What is a dead man's switch?
2. The same idea, applied to monitoring
3. Why "silence = alarm" beats error-catching
4. When to use one (and when not to)
5. Picking the interval and grace period
6. A concrete example

The term sounds dramatic, but the idea is simple and it solves one of the most common blind spots in infrastructure monitoring: things that fail by not happening.

What is a dead man's switch?

A dead man's switch is a mechanism that triggers an action when a human (or a system) stops doing something, rather than when they actively do it. The classic example is a train: the driver has to keep a pedal or handle held down. If they let go - because they fell asleep, or worse - the switch releases and the train automatically brakes. The default state is "stop." Continued operation requires a continuous signal saying "everything is fine."

The key inversion is this: the absence of a signal is itself the trigger. You do not need to detect a problem directly. You only need to notice that the regular "I'm okay" signal stopped arriving.

The same idea, applied to monitoring

In monitoring, a dead man's switch (usually called heartbeat monitoring) works exactly the same way:

A job, script, or service sends a periodic "I ran successfully" ping to a monitor.
The monitor knows how often it should hear that ping.
If the ping does not arrive on schedule, the monitor raises the alarm.

The job holds the pedal down by checking in. The moment it stops checking in - for any reason at all - the switch releases and you get alerted.

Why "silence = alarm" beats error-catching

The instinct most people have is to catch errors: wrap the job in a try/except, log failures, send an alert when something throws. That is good practice, but it has a fatal gap - it only fires when your code runs at all.

If the whole job never starts - the server is down, the scheduler died, the crontab entry was deleted, the container failed to launch - there is no code running to catch anything. Your error handling is irrelevant because it never executed. This is the exact failure mode that takes down backups for weeks without anyone noticing.

A dead man's switch closes that gap because it does not depend on your code running. It depends on your code not running being visible from the outside. Silence is detectable even when everything on your side is dead.

In one line: error alerts tell you when a job fails. A dead man's switch tells you when a job does not happen - which is the failure you are least likely to catch any other way.

When to use one (and when not to)

Use a dead man's switch for anything that runs on a schedule or should run continuously:

Cron jobs and scheduled tasks (backups, ETL, cleanup, billing runs)
Background workers and queue consumers that should always be processing
Kubernetes CronJobs and systemd timers
Recurring data syncs and report generation
Any "set it and forget it" automation where forgetting is the danger

Do not use it for things you can poll directly. For a website or an API, ordinary uptime monitoring is better - there is a live endpoint to check, and you get richer data (status codes, response times) by reaching out to it. The dead man's switch shines specifically where there is nothing to poll.

In practice, most teams want both: uptime monitoring for their public-facing services, and heartbeat monitoring for their scheduled jobs. The two patterns cover different failure shapes.

Picking the interval and grace period

Two settings make or break a heartbeat monitor:

Expected interval - how often the job should check in. Match it to the schedule. A daily backup checks in roughly every 24 hours.
Grace period - how long to wait past the deadline before alerting. A job that normally takes 10 minutes might legitimately take 40 on a heavy day. Set the grace period so normal variance does not page you, but real failures do.

Too tight and you get false alarms every time a job runs a little long. Too loose and you find out about a dead backup half a day late. Start a little generous, then tighten once you have seen how the job actually behaves.

A concrete example

Say you run a nightly export at 2am that should finish within 30 minutes. You would:

Create a heartbeat monitor with an expected interval of 24 hours and a grace period of, say, 1 hour.
Have the export ping the monitor only when it finishes successfully.
If 3am passes with no check-in, the monitor alerts you - whether the job errored, hung, or never started.

# Pings only if the export succeeds
0 2 * * *  /usr/local/bin/nightly-export.sh && curl -fsS https://pingguard.org/heartbeat/your-token

One line of setup, and a category of silent failure that used to be invisible becomes a clear, timely alert.

Set up a dead man's switch in 60 seconds

PingGuard's heartbeat monitoring is free on the starter plan - and lives alongside your website, API, and SSL monitors in one dashboard.

Start free → How cron monitoring works