Information Technology Reference
In-Depth Information
As part of your analysis, produce a report showing the most common causes of alerts.
Look for multiple alerts with the same bug ID or ticket numbers. Also look for the most
severe outages and give them special scrutiny, examining the postmortem reports and re-
commendations to see if causes or fixes can be clustered or applied to more than one out-
age.
14.5 Being Paged Too Much
Asmallnumberofalertsisreasonablebutifthenumbergrowstoomuch,interventionmay
be required. What constitutes too many alerts is different for different teams. There should
be an agreed-upon threshold that is tolerated.
If the threshold is constantly being violated and things are getting worse, here are some
interventions one may consider:
• If a known bug results in frequent pages after a certain amount of time (say, two
release cycles), in the future this alert should automatically be directed to the deve-
lopers' oncall rotation. If there is no developers' oncall rotation, push to start one.
This aligns motivations to have problems fixed.
• Any alerts received by pager that are not directly related to maintaining the SLA
should be changed from an alert that generates a page to an alert that generates a
ticket in your trouble-ticketing system.
• Meet with the developers about this specific problem. Ensure they understand the
seriousness of the issue. Create shared goals to fix the most frequent or recurring
issues. If it is part of your culture, set up a list of bugs and have bug bash parties or
a Fix-It Week.
• Negotiate to temporarily reduce the SLA. Adjust alerts accordingly. If alerting
thresholds are already not aligned with SLA (i.e., you receive alerts for low-prior-
ity issues), then work to get them into alignment. Get agreement as to the condi-
tions by which the temporary reduction will end. It might be a fixed amount of
time, such as a month, or a measurable condition, such as when three successive
releases have been pushed without failure or rollback.
• If all else fails, institute a code yellow: allow the team to defer all other work until
the situation has improved. Set up a metric to measure success and work toward
that goal.
14.6 Summary
As part of our mission to maintain a service, we must have a way to handle exceptional
situations. To assure that they are handled properly, an oncall rotation is created.
Search WWH ::




Custom Search