next up previous print clean
Next: Parallel file Up: Parallel job Previous: Initialization

Running

When a parallel job is started it first checks the status of all of the nodes, removing any that aren't functioning properly from its potential list. A second thread is then created.

This new thread handles job distribution. It requests a list of free nodes from the machine class. It then finds the list of the jobs that haven't been completed. It matches the job to the available nodes and requests from the parallel file object a SEPlib tag for a version, or portion, of the parallel file given the requestion section and node. It then builds a command string for the job. In addition to a rsh call to the machine and the program name the command string includes the global parameters and section parameters discussed above along with information on how to communicate with the server process. The job and machine's status files are then updated to note what job was sent to what node.

These jobs are started by a series of forked processes. These forked processes will run the job and record in the status file if they ran to completion or failed with an error. These processes will then exit. The job creation thread will run in a loop, checking for available machines and occasionally (every 5 minutes) checking to see if a node has become inaccessible.

The original thread starts an INET server socket. It receives communication, in the form of small ASCII strings, from the slave processes. It recognize four different messages: job started, job finished, error in the job, and progress of the job. How the serial code sends these messages is discussed later. If the socket server receives a start message it changes the status of the job from sent to running. A progress message is added in status file. A finish message changes the job status to finished and marks the node as available. A failure message results in the node status being checked, the job being re-listed as `todo', and an updating of an internal list to insure that the same job is not sent to the same node. In addition the last 100 lines of stdout and stderr of the failed job are sent to stdout. All messages are also recorded in each parallel file's status file.

When then job creation thread notes that all jobs have run to completion it exits. Finally, when the socket server notes that all jobs have been finished it loops through all of the parallel files. Each output file is combined, and the job is exited.


next up previous print clean
Next: Parallel file Up: Parallel job Previous: Initialization
Stanford Exploration Project
10/23/2004