Assessments - The Practice of Cloud System Administration

Information Technology Reference

In-Depth Information

Level 4: Managed

• The oncall pain is shared by the people most able to fix problems.

• How often people are oncall is verified against the policy.

• Postmortems are reviewed.

• There is a mechanism to triage recommendations in postmortems and assure they

are completed.

• The SLA is actively measured.

Level 5: Optimizing

• Stress testing and failover testing are done frequently (quarterly or monthly).

• “Game Day” exercises (intensive, system-wide tests) are done periodically.

• The monitoring system alerts before outages occur (indications of “sick” systems

rather than “down” systems).

• Mechanisms exist so that any failover procedure not utilized in recent history is ac-

tivated artificially.

• Experiments are performed to improve SLA compliance.

A.3 Monitoring and Metrics (MM)

Monitoring and Metrics covers collecting and using data to make decisions. Monitoring

collectsdataaboutasystem.Metricsusesthatdatatomeasureaquantifiable componentof

performance. This includes technical metrics suchasbandwidth, speed, orlatency; derived

metrics such as ratios, sums, averages, and percentiles; and business goals such as the ef-

ficient use of resources or compliance with a service level agreement (SLA). These topics

are covered in Chapters 16 , 17 , and 19 .

Sample Assessment Questions

• Is the service level objective (SLO) documented? How do you know your SLO

matches customer needs?

• Do you have a dashboard? Is it in technical or business terms?

• How accurate are the collected data and the predictions? How do you know?

• How efficient is the service? Are machines over- or under-utilized? How is utiliza-

tion measured?

• How is latency measured?

• How is availability measured?

Search WWH ::

Custom Search

Home