
Runbooks for Real Incidents: What to Write Before Something Breaks
Runbooks for Real Incidents: What to Write Before Something Breaks
Short answer
A useful incident runbook tells an engineer what to check, what not to touch, when to escalate, how to verify recovery, and how to hand off the work. It should be written before the outage, not while everyone is already stressed.
A practical incident runbook should include:
- What the runbook is for
- When to use it
- When not to use it
- Target systems and access paths
- Safety checks before commands
- Read-only diagnostic commands
- Approved change commands
- Rollback or recovery steps
- Verification steps
- Escalation rules
- Handoff notes
- Final summary format
The best runbooks are not long essays. They are clear operating guides that help someone move from symptom to safe action.
Why runbooks matter during real incidents
During an incident, people are tired, rushed, and trying to make decisions with incomplete information.
That is the worst time to invent the process.
Without a runbook, teams often:
- Repeat the same checks
- Miss basic verification steps
- Type commands into the wrong system
- Restart services too early
- Save temporary network changes permanently
- Forget rollback conditions
- Close console sessions too soon
- Lose track of who owns which terminal
- Hand off work with no current-state summary
- Declare recovery before the affected path is actually fixed
A good runbook does not remove the need for judgment. It gives judgment a safer structure.
Write for the person who is under pressure
A runbook should be written for someone who is capable but busy.
Assume the person reading it:
- May not know the system as well as you do
- May be joining halfway through the incident
- May have several terminal sessions open
- May be working over SSH, serial console, or a jump host
- May need to decide whether to continue, pause, or escalate
- May be afraid of making the situation worse
That means the runbook should be practical, direct, and easy to scan.
Avoid vague instructions like:
Check the network.
Write something usable:
Check whether the management VLAN is present on the uplink trunk:
show interfaces trunk
show vlan brief
If the VLAN is missing, do not replace the full allowed VLAN list from memory. Escalate or follow the approved VLAN recovery step.
Start with the runbook purpose
Every runbook should begin with a short purpose statement.
Example:
Purpose:
Use this runbook when SSH access to a network device fails but serial console access is available.
Another example:
Purpose:
Use this runbook when nginx is running but external HTTPS checks fail after a certificate change.
A clear purpose prevents people from using the wrong runbook for the wrong problem.
Define when to use it
Write the trigger conditions clearly.
Example:
Use this runbook when:
- Monitoring shows app-03 HTTPS check failing.
- SSH access to app-03 still works.
- nginx is the suspected service.
- No full host outage is confirmed.
For network equipment:
Use this runbook when:
- A switch is unreachable over SSH.
- The device is reachable through serial console.
- The device appears booted into the normal OS.
- The goal is to restore management access, not perform password recovery.
This keeps the runbook focused.
Define when not to use it
This section is just as important.
Example:
Do not use this runbook when:
- The device is in bootloader or ROMMON.
- The hostname does not match the ticket.
- The task requires password recovery.
- A firmware upgrade is currently running.
- No approved change window exists for configuration changes.
For service restarts:
Do not use this runbook when:
- The service controls your only SSH path.
- No fallback access is available.
- The config test fails.
- The maintenance window is not active.
This helps prevent a runbook from being stretched into an unsafe situation.
List the systems involved
A runbook should tell the operator which systems matter.
Use a simple table or list:
Primary system:
app-03
Related systems:
jump-01
load-balancer-01
monitoring
DNS
Access path:
operator laptop -> VPN -> jump-01 -> app-03
For network work:
Primary device:
core-sw-02
Related devices:
core-sw-01
fw-01
jump-01
monitoring host
Access paths:
SSH through jump-01
Serial console through rack controller port 4
This reduces confusion during incidents with multiple terminals.
For multi-session work, see How to Organize Multiple Device Sessions During an Incident.
Add target verification steps
Before any command that changes state, the runbook should make the operator confirm the target.
For Linux or Unix-like systems:
hostname
hostname -f
whoami
pwd
date
ip addr show
For network devices:
show running-config | include hostname
show version
show clock
show users
show ip interface brief
For serial console work:
Confirm:
- Rack label
- Console controller port
- Device prompt
- Device model
- Ticket target
Runbook note:
Stop if the hostname, prompt, rack label, or ticket target does not match.
For a deeper workflow, see How to Avoid Working on the Wrong Server or Network Device.
Separate read-only checks from change steps
A good runbook makes it obvious which steps are safe observations and which steps change state.
Use headings like:
Read-only checks
and:
Change steps
Example:
READ-ONLY CHECKS
systemctl status nginx --no-pager
sudo nginx -t
journalctl -u nginx --since "30 minutes ago" --no-pager
Then:
CHANGE STEP
Only run this after nginx -t passes and the maintenance window is active:
sudo systemctl reload nginx
For network devices:
READ-ONLY CHECKS
show interfaces trunk
show vlan brief
show logging | last 50
Then:
CHANGE STEP
Only after approval:
configure terminal
interface Gi1/0/24
switchport trunk allowed vlan add 40
end
This helps operators avoid accidentally treating a change command as a diagnostic command.
Include expected output
Commands are more useful when the runbook says what to look for.
Weak:
Run nginx -t.
Better:
Run:
sudo nginx -t
Expected:
syntax is ok
test is successful
If it fails:
Do not reload nginx. Capture the error and escalate.
For network work:
Run:
show interfaces trunk
Expected:
Management VLAN appears in the allowed VLAN list on the expected trunk.
If missing:
Do not replace the full trunk VLAN list from memory. Use the approved narrow recovery step or escalate.
Expected output helps less experienced operators make better decisions.
Include stop conditions
A runbook should tell people when to stop.
Example:
Stop and escalate if:
- The hostname does not match the ticket.
- The config test fails.
- The device is in bootloader or recovery mode.
- The command output does not match the expected state.
- The change would affect the only management path.
- No rollback path is available.
- A required approver is not present.
This is especially important for remote work.
A confident stop is much safer than improvising under pressure.
Add rollback before the change
Rollback should appear before the change command, not after it.
Use this format:
Rollback trigger:
Rollback owner:
Rollback command or procedure:
Verification after rollback:
Example:
Rollback trigger:
nginx reload fails or external HTTPS check fails.
Rollback action:
Restore /etc/nginx/sites-enabled/app.conf.bak-2026-05-21.
Run sudo nginx -t.
Reload nginx only if syntax passes.
Verification:
systemctl status nginx.
Local curl.
External HTTPS check.
For network devices:
Rollback trigger:
Adding VLAN 40 causes unexpected trunk impact.
Rollback action:
Remove VLAN 40 from Gi1/0/24 trunk.
Verification:
show interfaces trunk.
Ping from jump host.
Monitoring status.
Save status:
Do not save until rollback decision is closed.
The operator should know how to back out before they move forward.
Define verification from outside the system
Local success is not always real success.
A service may be active locally but unreachable from users. A switch may accept a config command but still fail from the management network.
Include outside verification:
curl -I http://127.0.0.1
curl -I https://app.example
ssh user@app-03
nc -vz app-03 443
For network devices:
Verify from jump host:
- ping management IP
- SSH to management IP
- check monitoring recovery
A verification note:
Do not mark resolved until the affected path is verified from outside the device.
Include handoff notes
Incidents often last longer than one person’s shift.
Add a handoff template directly to the runbook:
HANDOFF
Current owner:
New owner:
Target:
Access path:
Current state:
Last command:
Last verified result:
Changes made:
Unsaved changes:
Still running:
Do not do:
Next step:
Rollback:
Open questions:
This prevents the next engineer from guessing.
For more detail, see Command Handoffs: How to Pass Terminal Work to Another Engineer Safely.
Include notes for copy-paste safety
Runbooks often contain commands. That means they can create copy-paste risk.
Add a short warning near risky command blocks:
Before pasting:
- Confirm target hostname or prompt.
- Confirm current mode.
- Read the command.
- Paste one command at a time.
- Do not paste the save command until verification is complete.
For network devices, avoid putting permanent save commands immediately after change blocks unless that is intentionally approved.
Riskier:
configure terminal
interface Gi1/0/24
switchport trunk allowed vlan add 40
end
write memory
Safer:
configure terminal
interface Gi1/0/24
switchport trunk allowed vlan add 40
end
Then verify before any save decision.
For safer paste habits, see Safe Copy-Paste Habits for SSH and Serial Console Work.
Write for SSH, serial console, and browser workspaces
A real incident may move across access methods.
Your runbook should say which access method is expected:
Primary access:
SSH through jump-01
Fallback access:
Serial console through rack controller port 4
Do not close:
Serial console session until SSH and monitoring are stable.
This matters during network outages, firmware upgrades, and risky remote changes.
For console recovery, see Console Access During a Network Outage: A Practical Recovery Checklist.
Include a final summary format
At the end of the incident, the operator should write a summary.
Add this to the runbook:
FINAL SUMMARY
Problem:
Start time:
End time:
Systems affected:
Root cause or likely cause:
Commands run:
Changes made:
Verification:
Rollback used:
Save/finalize status:
Follow-up:
Example:
FINAL SUMMARY
Problem:
Management SSH to core-sw-02 failed from jump-01.
Systems affected:
core-sw-02 management path.
Likely cause:
VLAN 40 missing from uplink trunk.
Changes made:
Added VLAN 40 back to trunk Gi1/0/24.
Verification:
Ping and SSH from jump-01 restored.
Monitoring recovered.
Rollback used:
No.
Save/finalize status:
Running config changed.
Startup config not updated.
Incident lead to approve save.
Follow-up:
Review previous change that removed VLAN 40.
This turns a stressful incident into a useful operational record.
A practical incident runbook template
Use this as a starting point:
INCIDENT RUNBOOK