Next: CONCLUSIONS Up: Clapp and Sava: Cluster Previous: BUILDING AND SETUP

FUTURE

So far we have generally been happy with our new cluster. The speed up on large applications has been consistent with our benchmark results. We still see nodes occasionally hanging, but less so than with our previous cluster. In the future we plan to significantly expand our number of nodes, but we first need to deal with the problem of hanging nodes and find/develop a batch system that will meet our needs.

For nodes that die or hang we can, and do, use checkpoint systems, but these require significant coding by the programmer and are generally sub-optimal. A better solution seems to be to use/build upon one of the fault tolerant MPI versions. We should be able to easily build into/add on a batch system that will meet our needs.

We envision writting applications as a group of tasks (these could be a set of CMPS, a set of frequencies, or a set of shots for example). The batch system would:

Initially send out the set of tasks to each available node.
If a task is finished it would check to see if another process, with higher priority, needs the node. If it does the new process would get the node. If not a new task would be sent to the node.
If a task does not finish it will be resent to a new processor.

This approach is similar to the models used by Seti@home . It has the advantages that it is fault tolerant and allows dynamic reassignment of nodes based on need.

Next: CONCLUSIONS Up: Clapp and Sava: Cluster Previous: BUILDING AND SETUP

Stanford Exploration Project
6/8/2002