In Depth Tutorials and Information

How NTP Avoids Errors (Computer Network Time Synchronization)

Years of accumulated experience running NTP in the Internet suggest the most common cause of timekeeping errors is a malfunction somewhere on the NTP subnet path from the client to the primary server or its synchronization source. This could be due to broken hardware, software bugs, or configuration errors. Or, it could be an evil mischief maker attempting to expire Kerberos tickets. The approach taken by NTP is a classic case of paranoia and is treatable only by a dose of Byzantine agreement principles. These principles require in general multiple redundant sources together with diverse network paths to the primary servers. In most cases, this requires an engineering analysis of the available servers and Internet paths specific to each server.

However, the raw time values can have relatively large time variations, so it is necessary to accumulate a number of them and determine the most trusted value on a statistical basis. Until a minimum number of samples has accumulated, a server cannot be trusted, and until a minimum number of servers has been trusted, the composite time cannot be trusted. Thus, when the daemon first starts up, there will be a delay until these premises have been verified.

To protect the network and busy servers from implosive congestion, NTP normally starts out with a poll interval of 64 s. The present rules call for at least four samples from each server and for this to occur for a majority of the configured servers before setting the clock. Thus, there can be a delay on the order of 4 min before the clock can be considered truly valid. Various distributed network applications have at least some degree of pain with this delay, but if justifiable by network loads, it is possible to use the burst feature described in this topic. Using this feature, the delay is usually not more than 10 s.

It can happen that the local time before NTP starts up is relatively far (like a month) from the composite server time. To conform to the general spirit of extreme reliability and robustness, NTP has a panic threshold of 1,000 s earlier or later than the local time in which the server time will be believed. If the composite time offset is greater than the panic threshold, the daemon shuts down and sends a message to the log advising the operator to set the clock manually. As in other thresholds, the value can be changed by configuration commands. In addition, the daemon can ignore the panic threshold and set the clock for the first time but observe it for possible subsequent occasions. This is useful for routers that do not have battery backup clocks.

Another feature, or bug depending on how you look at it, is the behavior when the server time is less than the panic threshold but greater than a step threshold of 128 ms. If the composite time offset is less than this, the clock is disciplined in the manner described; that is, by gradual time and frequency adjustments. However, if the offset is greater than this, the clock is stepped instead. This might be considered extremely ill mannered, especially if the step is backward in time. To minimize the occasions when this might happen, due, for example, to an extreme network delay transient, the offset is ignored unless it continues beyond the step threshold for a stepout threshold of 900 s. If a succeeding offset less than the step threshold is found before reaching the stepout threshold, the daemon returns to normal operation and amortizes offsets.

There are important reasons for this behavior. The most obvious is that it can take a long time to amortize the clock to the correct time if the offset is large. Correctness assertions require a limit to the rate that the clock can be slewed, in the most common case no more than 500 PPM. At this rate, it takes 2,000 s to slew 1 s and over 1 day to slew 1 min. During most of this interval, the system clock error relative to presumably correct network time will be greater than most distributed applications can tolerate. Stepping the clock rather than slewing it if the error is greater than 128 ms is considered the lesser of two evils. With this in mind, the operator can configure the step threshold to larger values as necessary or even avoid the step entirely and accept the consequences.

When the daemon starts for the first time, it must first calibrate the intrinsic frequency correction of the hardware clock. In the general case, it may take a surprisingly long time to determine an accurate correction, in some cases several hours to a day. To shorten this process when the daemon is restarted, the current correction is written to a local file about once per hour. When this file is detected at restart, the frequency is reset immediately to that value. If the daemon is started without this file, it executes a special calibration procedure designed to calculate the frequency correction directly over a period of 15 min. The procedure begins the first time the clock is set in normal operation and does not adjust the clock during the procedure. After the procedure, the frequency correction is initialized, and the daemon resumes normal operation.

Next post: How NTP Performance Is Determined (Computer Network Time Synchronization)

Previous post: How NTP Manages Network Resources (Computer Network Time Synchronization)

How NTP Avoids Errors (Computer Network Time Synchronization)

Related Links

:: Search WWH ::