Information Technology Reference
In-Depth Information
• Are staff geographically distributed (i.e., can other regions cover for each other for
extended periods of time)?
• Do you write postmortems? Is there a deadline for when a postmortem must be
completed?
• Is there a standard template for postmortems?
• Are postmortems reviewed to assure action items are completed?
• If there is a corporate standard practice for this OR, what is it and how does this
service comply with the practice?
Level 1: Initial
• Outages are reported by users rather than a monitoring system.
• No one is ever oncall, a single person is always oncall, or everyone is always on-
call.
• There is no oncall schedule.
• There is no oncall calendar.
• There is no playbook of what to do for various alerts.
Level 2: Repeatable
• A monitoring system contacts the oncall person.
• There is an oncall schedule with escalation plan.
• There is a repeatable process for creating the next month's oncall calendar.
• A playbook item exists for any possible alert.
• A postmortem template exists.
• Postmortems are written occasionally but not consistently.
• Oncall coverage is geographically diverse (multiple time zones).
Level 3: Defined
• Outages are classified by size (i.e., minor, major, catastrophic).
• Limits (and minimums) for how often people should be oncall are defined.
• Postmortems are written for all major outages.
• There is an SLA defined for alert response: initial, hands-on-keyboard, issue re-
solved, postmortem complete.
Search WWH ::




Custom Search