Information Technology Reference
In-Depth Information
Figure 1. Life cycle of a volunteer peer
A HEURISTICS-BASED FAILURE
PROBABILITY ESTIMATION
each class's availability ( TTF ) and unavailability
( Mean Time to Reboot (MTR) ) data. While other
works found one or two best fitted distributions,
this work found different best fitted distributions
for different class.
The prediction methods of resource available
status reviewed in Section 2 provide a different
accuracy for their selected environments. Since
this paper targets at finding optimized task as-
signment with estimated task failure probabilities,
the distribution of empirical availability data can
provide enough information. Here, a simple and
straight heuristics-based failure probability esti-
mation method is employed.
Availability Prediction
Brevik et al. (2004) assumed a homogeneous en-
vironment, and proposed an availability prediction
method on top of the found Weibull distribution.
This method answered the question what is the
largest availability duration for a given confidence
value and a desired percentile. Iosup et al. (2007)
proposed a resource availability model that con-
sidered the failure distribution among clusters, the
TTF distribution, failure duration distribution, and
the distribution of the failure size, which is the
number of failed processors. This model is used to
predict the failures in a multi-cluster grid system.
Some other works (Ren, 2006; Rood, 2007)
utilized the availability pattern on weekdays and
weekends to predict the availability. Nadeem et
al. (2008) used Bayes Rule and Nearest Neighbor
Rule to predict the resource availability. Mickens et
al.(2006) proposed saturating counter predictors,
state-based history predictors, a linear predictor,
and a hybrid predictor that dynamically selects
the best predictor. These predictors have been
evaluated with trace data sets of distributed serv-
ers, peer-to-peer network, and corporation PCs.
Life Cycle of a Volunteer Peer
The life cycle of a volunteer peer can be modeled
as shown in Figure 1. TTF is the time between a
peer's start/restart and the next failure/shutdown.
DT is the time between a failure and the next peer
restart. Given a statistical distribution of TTF ,
the cumulative distribution function (CDF) of
this distribution's value at each uptime x is the
probability that a peer's TTF is smaller than or
equal to x , which equals to the failure probability
at uptime x . The failure probability monotonously
increases with time. Since none of a single distri-
bution can characterize the resource availability
accurately for any systems in large scale comput-
ing environments (Nurmi, 2005; Nadeem, 2008),
a heuristics-based mechanism is proposed to
estimate the failure probability at runtime with
gathered TTF data.
Search WWH ::




Custom Search