next up previous print clean
Next: Parallel files Up: Parallel objects Previous: Parallel objects

Parallel building blocks

There are several basic objects that are needed to do any remote processing. You need to know how to execute remote commands, what machines to run on, how to execute commands that run on multiple machines at once, and how to send messages between the master process and remote processes.

The SEP.rc is the simplest of these build blocks. It defines two variables, SEP.rc.shell and SEP.rc.cp, which is the shell for a remote process and the copy command. It defaults to using rsh and rcp for these variables but can be set to using the secure alternatives. It also provides the functions cp_to(mach,file_in,file_out) and cp_from(mach,file_in,file_out) which return the command strings needed for transferring a file.

The SEP.mach_base.mach provides the framework for keeping track of what machines are available. It inherits from the SEP.stat_sep.status object to store its current state. It provides a mechanism for testing whether a node is functional. It requires that its children provide a mechanism to create an initial list of machines to run on. It identifies each processor on a machine through a machine label, which takes the form mach-X, where mach is the machine name and X is a processor number associated with that machine. The child classes SEP.mach_file.mach and SEP.mach_list.mach are the simplest two examples, which read their list of available machines from a file or from a supplied list. A future module might interact with a master server allowing a job to shrink or grow based on current computer usage.

In an environment where a master node isn't exporting a disk, or when you don't want to rely on that master node being up, it is necessary to copy a program to all the slave nodes. The class SEP.distribute_prog.distribute provides a mechanism to distribute a given program to these nodes. It copies an executable to the /tmp directory with a unique name, and returns that name to the calling program. It also has a cleaning method to remove the program from the nodes.

The class SEP.pc_base.communicator provides framework for running a job on multiple machines simultaneously. It is initialized with the speed of the network. It expects its children to override the function prep_run which defines how to run a parallel job given the list of nodes, the command line arguments, and how many bytes are going to be processed. The last argument is used as a mechanism to calculate how long a job should take, therefore a mechanism to test whether a job is hung and should be killed. The class SEP.pc_mpich.communicator is the only current example. It uses MPICH as its communication model.

Communicating with a series of remote processes can be a tricky proposition. The standard Unix approach, a socket, has an important limitation in that it can not have more than X, where X is a small number, of processes waiting to establish a connection. This limitation can be reached either by having to many processes talking on a given socket or by the actions brought on by the socket communication taking too long. The library accounts for both these limitations. The basic concept is that a parallel job might spawn several sockets simultaneously. Each socket will communicate with a maximum number of processes (60 by default). The actions taken after receiving a message will be limited writing to a text file. The class SEP.par_msg.msg_object has the ability to read and write a message. Its child, SEP.par_msg.server_msg_obj, receives the message over a socket using SEP.sep_socket.sep_server class.


next up previous print clean
Next: Parallel files Up: Parallel objects Previous: Parallel objects
Stanford Exploration Project
5/3/2005