At the start of a parallel job, several communication threads are forked. Each of these threads' purpose will be to handle communication between a set of slave processes (the jobs on remote machines) and the master machine. The master thread then requests a list of all of the machines that are available. It checks to make sure each of these machines is functional. It then begins a loop that runs until each job has run to completion.
The loop begins by requesting from the machine object a list of available machine labels. It has to parse this list if any of the parallel files are of type SEP.pf_copy.parfile and are being used as output. Only a single process can be started on a given node until the file has been created. It then matches available jobs to the remaining machine labels, and requests from each parallel file object a local version of that file. It takes the parameters in global_pars, the task parameters in sect_pars, and adds in parameters telling the jobs how to communicate with the socket it has been assigned to. Then the command line for a given job is constructed by the command_func routine. By default this routine builds the command line based on program defined in the initialization. This function can be overwritten for more complex tasks. It forks a new thread for each job, and records that a job has been sent. These forked threads will exit when the job has been completed. If the exit status of the job is not 0, the job will be listed as failing.
Once a series of jobs has been started, the master thread reads the series of files written to by the SEP.par_msg.server_msg_obj objects, and updates the status of each job. The status messages come in several forms:
If the node is working,
the task is guaranteed to be assigned to another node. If it fails
more than twice (also configurable) the job is exited.
There are two extensions to the SEP.pj_base.par_job object. The SEP.pj_simple.par_job class is all that is needed for most parallel jobs. It takes the additional command line arguments:
The final parallel job class, SEP.pj_split.par_job, is useful for many inversion problems. It is initialized with a dictionary assign_map linking the job with the machine, or more precisely a machine label, specifying where the jobs should be run. By always running a specific portion of the dataset on a given node, you can avoid collecting the dataset at each step in the inversion process. It can also be useful in things like wave-equation migration velocity analysis where a large file, in the velocity analysis case the wave-field, is needed for calculations. The downside of this approach is, if a node goes down, the job can not run to completion but must terminate when it has accomplished the work on all the remaining nodes.