Information Technology Reference
In-Depth Information
- Which RPC limits are in place with these dependencies? (Link to limits and
confirmation from external groups they can handle the traffic.)
- What will happen if these RPC limits are exceeded ?
- For each dependency, list the ticket number where this new service's use of
the dependency (and QPS rate) was requested and positively acknowledged.
• Monitoring:
- Are all subsystems monitored? Describe the monitoring strategy and docu-
ment what is monitored.
- Does a dashboard exist for all major subsystems?
- Do metrics dashboards exist? Are they in business, not technical, terms?
- Was the number of “false alarm” alerts in the last month less than x ?
- Is the number of alerts received in a typical week less than x ?
• Documentation:
- Does a playbook exist and include entries for all operational tasks and alerts?
- Have an LRE review each entry for accuracy and completeness.
- Is the number of open documentation-related bugs less than x ?
• Oncall:
- Is the oncall schedule complete for the next n months?
- Is the oncall schedule arranged such that each shift is likely to get fewer than
x alerts?
• Disaster Preparedness:
- What is the plan in case first-day usage is 10 times greater than expected?
- Do backups work and have restores been tested?
• Operational Hygiene:
- Are “spammy alerts” adjusted or corrected in a timely manner?
- Are bugs filed to raise visibility of issues—even minor annoyances or issues
with commonly known workarounds?
- Do stability-related bugs take priority over new features?
- Is a system in place to assure that the number of open bugs is kept low?
• Approvals:
- Has marketing approved all logos, verbiage, and URL formats?
Search WWH ::




Custom Search