Service-Level Agreements (Networking)

Service-level agreements (SLAs) are contracts that specify the performance parameters within which a network service is provided. Although the contracts usually cover the services telecommunications carriers provide to corporate customers, they can also cover the services an information technology (IT) department provides to other business units within a company.

The SLA might define parameters such as the type of service, data rate, and what the expected performance level is to be in terms of delay, error rate, port availability, and network uptime. Response time to system repair and/or network restoral also can be incorporated into the SLA, as can financial penalties for noncompliance.

SLAs are available for just about any type of service, from traditional T-carrier services to frame relay. Internet service providers (ISPs) also offer SLAs for IP-based Virtual Private Networks (VPNs), intranets, and ex-tranets. Some ISPs even guarantee levels of accessibility for their dial-up remote-access customers. IBM, for example, offers an SLA for its remote-access customers that guarantees a 95-percent success rate on dial-up connections to the IBM Global Network.

Although SLAs usually cover the services telecommunications carriers provide to corporate customers, they can also cover the services an IT department provides to other business units within the organization.

Service Provider SLAs

Among the growing number of service providers that offer SLAs is AT&T. The carrier offers SLAs at no extra charge in three different frame relay environments: domestic, international, and managed.

DOMESTIC The five SLAs for AT&T’s domestic frame relay service include the following measures of network performance:

PROVISIONING If an agreed-on due date is missed for a port or PVC, recurring charges on the port or PVC are free for one month.

RESTORATION TIME If a customer reports a frame relay service outage (even if the problem is with local access) and it is not restored in four hours, recurring charges for the affected ports and PVCs are free for one month.

LATENCY If the customer reports a one-way delay from service interface to service interface (SI to SI) across the frame relay network of more than 60 milliseconds and AT&T can’t fix the problem in 30 days, recurring charges for the affected PVC are free each month until repaired.

THROUGHPUT If 99.99 percent of the packets offered to the frame relay network within a PVC CIR (committed information rate) are not successfully transported through the network and AT&T can’t fix the problem in 30 days, recurring charges on the PVC are free each month until repaired.

NETWORK AVAILABILITY If the customer’s network is not available at least 99.99 percent of the time each month, the customer receives credits commensurate with network size.

AT&T also offers a suite of network management tools to support the SLAs. As part of its Customer Network Management (CNM) options, customers use two free Web-based tools: Order Manager and Ticket Manager. Using Order Manager, customers can enter service orders via the Internet and track their progress. With Ticket Manager, customers can enter trouble tickets related to performance metrics. CNM provides weekly and monthly reports on port and PVC usage, discards, network congestion, and other performance factors. CNM can also be used to create exception reports with definable thresholds.

INTERNATIONAL AT&T also offers SLAs for international customers, guaranteeing on-time provisioning and premise-to-premise availability of individual PVCs. The metric for on-time delivery is 100 percent, and there is no limit to the number of orders that can qualify for this credit. If the mutually agreed-to due date is not met, a one-time credit of $500 will be given per missed qualified order.

International network availability is measured from customer premise to customer premise. For example, if a 64-Kbps PVC experiences less than 99.8 percent availability; credits ranging from $25 to $525 will be applied. Credits are allocated based on PVC size and the amount of down time. Latency and throughput measures for international services are also available.

MANAGED Customers who use AT&T Managed Network Solutions (MNS) to manage their networks are covered by end-to-end SLAs, ensuring that AT&T will implement services on time, deliver predictable and reliable network performance, and provide consistent ongoing support. These comprehensive SLAs, available at no extra cost, apply both to AT&T’s network transport services and to all related equipment up through the router on the customer’s premises.

Individualized performance measures are defined for each customer based on specific network designs, including router, hardware, dial-backup, configuration and transport service designs. If service levels fall below the agreed-upon metrics, AT&T will credit customers for monthly charges and maintenance fees based on the terms outlined in each customer’s contract.

Of course, AT&T is not the only service provider that offers SLAs. Others include Concentric Network, GTE Internetworking, IBM Global Network, Infonet Services Corp., MCI WorldCom, NaviSite Internet Services, Sprint, and UUNET. Eventually, most service providers will feel compelled to offer SLAs, or risk losing business to competitors.

Intracompany SLAs

An intracompany SLA describes the level of service required to support the various corporate applications. The metrics could be OnLine Transaction Processing (OLTP) response times, batch turnaround times for end-of-day reports, actual hours of system availability, and bandwidth availability. In essence, the SLA documents what a particular group of workers and managers need from IT to best fulfill their responsibilities to the company.

IT managers usually have access to online software tools to monitor system performance and resource consumption, giving them a general idea of how all systems on the network are behaving at any given moment. This information enables networks and systems to be managed effectively and provides the starting point for developing the SLA.

With information about current performance levels available and the expectations of business units and end users quantified, the IT department can write an effective SLA. At a minimum, this document should contain the following components:

BACKGROUND This section should contain enough information to acquaint a nontechnical reader with the application, and to enable that person to understand current service levels and why they are important to the continued success of the business.

PARTIES This section should identify the parties to the agreement, including the responsible party within IT and the responsible party within the business unit and/or application user group.

SERVICES This section should quantify the volume of the service to be provided by the IT department. The application user group should be able to specify the average and peak rates, and the time of day they occur. The user may be provided with incentives to receive better service, or a reduced cost for service, if peak resource usage periods can be avoided.

TIMELINESS This section should provide a qualitative measure of most applications to let end users know how fast they can expect to get their work accomplished. For OLTP applications, for example, the measure might be stated as “95 percent of transactions processed within two seconds.” For more batch-oriented applications, the measure might be stated as “Reports to be delivered no later than 10:00 a.m. if input is available by 10:00 p.m. the previous evening.”

AVAILABILITY This section should describe when the service will be available to the end users. The end users must be able to specify when they expect the system to be available in order to achieve their specified levels of work. IT must be able to account for both planned and unplanned system unavailability, and work these factors into an acceptable level of performance for end users.

LIMITATIONS This section should describe the limits of IT support during conditions of peak period demand, resource contention by other applications, and general overall application workload intensities. These limitations should be explicitly stated and agreed to by all parties to prevent finger pointing when problems arise.

COMPENSATION Ideally, a chargeback system should be implemented in which end users are charged for the resources they consume to provide the service they expect. This gives business units the incentive to apply management methods that optimize costs and performance. If this is impractical, the costs should still be identified and reported back to the business units to account for IT resources. The frequency and format of this information should also be described.

MEASUREMENT This section should describe the process by which actual service levels will be monitored and compared with the agreed upon service levels, as well as the frequency of monitoring. A brief description of the data collection and extrapolation processes should be included, and how users are to report problems to IT.

RENEGOTIATION This section should describe how and under what circumstances the SLA can be changed to reflect changes in the environment.

When the SLA is ready for implementation, the IT department must implement procedures to determine if service levels are being met. Additionally IT needs to be able to forecast when the service levels can no longer be met due to growth or other external factors.

Standards

The effort toward standardizing SLAs has gone the farthest with respect to frame relay service. The Frame Relay Forum offers a set of common network service parameters that are described in its Service-Level Definition (SLD) implementation agreement. The SLD defines three metrics that should constitute the main elements of an SLA: delay, frame delivery rate, and connection availability. These metrics are the benchmarks by which network performance can be measured, whether the frame relay network is private or carrier provided.

DELAY Delay metrics describe the time required to transport data from one end of the network to the other. Measuring delay involves three interdependent elements: access line speed, frame size, and wide-area network (WAN) delay. To be useful, measuring and reporting delay should be in the context of these elements.

Access line speed refers to the delay caused by the speed of the line from the user site to the frame relay network at both the local and remote ends of the network. Access line delay can contribute significantly to the overall delay of the network. For example, a 4000-byte packet might take approximately 500 ms to cross a 64-Kbps line. If the local and remote-access lines are 64 Kbps, the access lines alone could add nearly a second of delay.

Delay caused by the access line can be managed by increasing line speed or segmenting the data into smaller frames, which is usually handled by the router or frame relay access device (FRAD). Changing the frame size to 128 bytes reduces the access line delay to approximately 16 ms. Alternatively, increasing the line speed to T1 (1.536 Kbps), reduces the 4000-byte frame delay to a more manageable 20 ms.

The third element of delay, network delay, is difficult to manage in a public network scenario. However, measurement of delay across the WAN, separate from the delay imposed by the access line, can help the user pinpoint performance problems. Eliminating WAN delay will help the user focus on other causes of inadequate performance, such as configuration or application difficulties which account for as much as 70 percent of performance problems.

FRAME DELIVERY RATE Frame relay networks typically categorize frames in two ways: below the Committed Information Rate (CIR) and above CIR. To provide a valid FDR, it must be determined if the measurement is for frames within CIR, in excess of CIR, or for the total number of frames presented to the network for delivery.

CONNECTION Availability Connection availability measures the percentage of time the network connection is accessible to support the communications needs of the network. There are several elements to connection availability: overall availability, mean time to restore (MTTR) in the event a connection is lost, and mean time between service outages (MTBSO).

Overall availability refers to the total time the network connection is available, compared to the total measured time. If a network did not experience any services outages in a 30-day period, then its availability would be expressed as 100 percent. If the network is down for six hours, the availability would be 99.17 percent. Availability can be measured network-wide (all sites together) or individual measurements can be taken for each site and brought together to reflect a total network calculation.

Connection MTTR has a direct impact on availability because the longer it takes to restore a connection, the longer the service is unavailable. Most SLAs have specific measurements for MTTR. One method of reducing the impact of a service outage, usually caused by a failure in the local loop, is the use of ISDN as a backup service.

The degree of impact caused by service outages is more apparent if the time between outages is measured. Connection MTBSO measures the availability time between outages. Having four 6-hour outages in one day obviously has a greater impact than one 6-hour outage a week for a month. MTBSO gives the network manager the information needed to evaluate the other availability metrics.

Last Word

To meet the challenges of the new competitive era, telecom service providers and ISPs are exploring new approaches to serving customers. A major step has been taken with service-level agreements that offer performance guarantees and credits to customers if various metrics are not sustained. In some cases, the customer need not even report the problem and provide documentation to support the claim. The carrier or ISP will report the problem to the customer and automatically apply appropriate credits to the invoice, as stipulated in the SLA.

At the same time, SLAs are becoming more important for ensuring the peak performance of enterprise networks. The purpose of the intracom-pany SLA is to specify, in mutually agreeable metrics, what the various end user groups can expect from IT in terms of resource availability and system response. SLAs also specify what IT can expect from end users in terms of system usage and cooperation in maintaining and refining the service levels over time. SLAs also provide a useful metric against which IT department performance can be measured. How well the IT department fulfilled its obligations, as spelled out in the SLA, can determine future staffing levels, budgets, raises, and bonuses.

Next post:

Previous post: