Information Technology Reference
In-Depth Information
Level 4: Managed
• The oncall pain is shared by the people most able to fix problems.
• How often people are oncall is verified against the policy.
• Postmortems are reviewed.
• There is a mechanism to triage recommendations in postmortems and assure they
are completed.
• The SLA is actively measured.
Level 5: Optimizing
• Stress testing and failover testing are done frequently (quarterly or monthly).
• “Game Day” exercises (intensive, system-wide tests) are done periodically.
• The monitoring system alerts before outages occur (indications of “sick” systems
rather than “down” systems).
• Mechanisms exist so that any failover procedure not utilized in recent history is ac-
tivated artificially.
• Experiments are performed to improve SLA compliance.
A.3 Monitoring and Metrics (MM)
Monitoring and Metrics covers collecting and using data to make decisions. Monitoring
collectsdataaboutasystem.Metricsusesthatdatatomeasureaquantifiable componentof
performance. This includes technical metrics suchasbandwidth, speed, orlatency; derived
metrics such as ratios, sums, averages, and percentiles; and business goals such as the ef-
ficient use of resources or compliance with a service level agreement (SLA). These topics
are covered in Chapters 16 , 17 , and 19 .
Sample Assessment Questions
• Is the service level objective (SLO) documented? How do you know your SLO
matches customer needs?
• Do you have a dashboard? Is it in technical or business terms?
• How accurate are the collected data and the predictions? How do you know?
• How efficient is the service? Are machines over- or under-utilized? How is utiliza-
tion measured?
• How is latency measured?
• How is availability measured?
Search WWH ::




Custom Search