In the high-pressure world of IT service delivery, major incidents are inevitable. What sets successful organizations apart is not just how quickly they resolve the incident—but how effectively they communicate throughout the process.

Far too often, the communication during a major incident is vague, incomplete, or overly technical. This leads to confusion, delays in decision-making, and a loss of stakeholder confidence—especially in outsourcing contracts, where transparency and trust are paramount.

That’s why having a clear, structured, and repeatable communication checklist is essential. It helps technical teams focus on what matters most: providing timely, accurate, and business-relevant updates to both internal leadership and external clients.

In this article, I’ll walk you through a practical Major Incident Communication Checklist designed specifically for war rooms and command centers. This checklist ensures that every update—whether it’s the initial notification, ongoing progress report, or final resolution—contains the critical elements that stakeholders need to know.

Whether you’re a Major Incident Manager, Service Delivery Lead, or Technical Team Member, this guide will help you elevate your incident communications from reactive to professional, proactive, and client-aligned.

Initial Notification – What to Include

  1. Incident ID: [Ticket number or reference]
  2. Issue Summary: What is broken? (1-2 lines max)
  3. Impact Summary:
    • Who/what is affected?
    • Number of users/regions impacted
    • Key business functions affected
  4. Start Time: When did the issue begin?
  5. Reported Source: How was the issue detected? (e.g., Monitoring Alert, User Report, etc.)
  6. Current Status: New / Investigating / Mitigation in Progress / Resolved
  7. Technical Teams Involved: (e.g., Network, DB, App Support)
  8. Initial Suspected Cause (if any): Optional, only if known
  9. Next Update Time: When will stakeholders hear from us again?
  10. Point of Contact: Who is the Major Incident Manager?

🛠️ Remediation Actions (e.g., Server Reboot / Restart / Infrastructure Actions)

Before Proceeding with a Reboot or Infrastructure Action:

  1. Approval Requirement
    • Confirm if client approval is needed (per contract/SOP)
    • Identify who must approve the action (name, role, contact)
    • If approver is unavailable:
      • Escalate to Incident Manager / Duty Manager
      • Follow predefined escalation path or invoke emergency authority if allowed (must document this)
  2. Pre-Reboot Checklist (Cross-Team Validation)
    • Application Team: Is this server currently handling transactions? Is there a risk of data loss?
    • Database Team: Any active DB sessions, locks, transactions? Safe to restart?
    • Middleware Team: Impact on integrations, services, or queues?
    • Monitoring Team: Ensure alerting is disabled/enabled as appropriate during reboot
    • Dependencies Check: Is this server hosting shared services for other systems?
    • Cluster/Failover Configurations: Ensure cluster failover won’t be triggered accidentally
  3. Communication Requirements
    • Notify all impacted stakeholders before reboot (app teams, client POCs, service desk)
    • Include in update:
      • Reason for reboot
      • Expected downtime
      • Expected impact (if any)
      • Time window
      • Reboot plan and rollback plan
  4. Post-Reboot Checklist
    • Application startup confirmation
    • Service health check
    • DB availability check
    • Monitoring re-enabled
    • Confirm service functionality end-to-end
    • Update stakeholders of successful reboot and service status

🔄 Ongoing Updates – Keep it Consistent

  1. Time of Update: [Timestamp]
  2. Status Summary: What’s changed since last update?
  3. Actions Taken: What have we done so far?
  4. Mitigation Steps (if any): Any short-term fixes in place?
  5. ETR: Estimated Time to Resolution (or say “No ETR yet”)
  6. Next Steps: What’s happening now?
  7. Next Update Time: Always include this.

Resolution Notification – When Resolved

  1. Time of Resolution: [Timestamp]
  2. What Fixed It: Brief explanation (1-2 lines)
  3. Services Restored: Confirm services are stable
  4. Monitoring Check: Confirm no residual alerts/issues
  5. Next Steps: RCA timeline / post-incident review

💡 Guiding Principles

  • Never reboot systems without upstream/downstream checks and approvals.
  • Document approvals and escalations.
  • Communicate clearly and proactively.
  • Always validate post-action service health before closing the loop.