Information Technology Reference
In-Depth Information
from the local monitor. The local monitor reports the status of each moni-
tored process to the data collector between the time of process registration
and unregistration at a i xed specii ed interval.
The reasons why grid users could not get the results of submitted tasks
include task failure, grid node crash, and GRB crash. Fault recovery from
task failure is shown in Figure 4.8. The local monitor maintains a slot table
for all monitored processes. When the local monitor detects the failure of a
registered task, it will report the error status to a local resource management
component and disconnect the error process. The component also maintains
a table responsible for recording the scheduled task information. Task infor-
mation in the table consists of the program name, parameter i les, identii ca-
tion of the corresponding process, and other execution conditions. Once a
monitored process fails, the resource management component restarts the
same task according to registration information in the table, and registers the
process with a fault-tolerant requirement to the local monitor again.
Fault recovery from a grid node crash is shown in Figure 4.9 . If the data
collector could not receive information from a local monitor in a valid
period, it will consider the grid node with the local monitor to have failed
and reports such information to the GRB. The GRB starts a thread to
search for the tasks located in that grid node from its database, and records
their error status. These failed tasks will enter the scheduling queue of
GRB again. The GRB allocates required resources to these tasks and
reschedules them to the selected resources.
Fault recovery from the GRB crash is shown in Figure 4.10 . In order to
guarantee that the scheduled tasks continue to operate when the GRB
Task 1
Grid
resource broker
Grid information
index server
Data collector
Register
Status
feedback
Schedule
Local
components
Local
monitor
Report
Reschedule
Monitor
Register
Task 1
Task 1
FIGURE 4.8
Fault recovery from task failure.
 
Search WWH ::




Custom Search