Information Technology Reference
In-Depth Information
-
Which RPC limits are in place with these dependencies? (Link to limits and
confirmation from external groups they can handle the traffic.)
-
What will happen if these RPC limits are exceeded ?
-
For each dependency, list the ticket number where this new service's use of
the dependency (and QPS rate) was requested and positively acknowledged.
• Monitoring:
-
Are all subsystems monitored? Describe the monitoring strategy and docu-
ment what is monitored.
-
Does a dashboard exist for all major subsystems?
-
Do metrics dashboards exist? Are they in business, not technical, terms?
-
Was the number of “false alarm” alerts in the last month less than
x
?
-
Is the number of alerts received in a typical week less than
x
?
• Documentation:
-
Does a playbook exist and include entries for all operational tasks and alerts?
-
Have an LRE review each entry for accuracy and completeness.
-
Is the number of open documentation-related bugs less than
x
?
• Oncall:
-
Is the oncall schedule complete for the next
n
months?
-
Is the oncall schedule arranged such that each shift is likely to get fewer than
x
alerts?
• Disaster Preparedness:
-
What is the plan in case first-day usage is 10 times greater than expected?
-
Do backups work and have restores been tested?
• Operational Hygiene:
-
Are “spammy alerts” adjusted or corrected in a timely manner?
-
Are bugs filed to raise visibility of issues—even minor annoyances or issues
with commonly known workarounds?
-
Do stability-related bugs take priority over new features?
-
Is a system in place to assure that the number of open bugs is kept low?
• Approvals:
-
Has marketing approved all logos, verbiage, and URL formats?
Search WWH ::
Custom Search