next up previous print clean
Next: CONCLUSIONS Up: Clapp and Sava: Cluster Previous: BUILDING AND SETUP

FUTURE

So far we have generally been happy with our new cluster. The speed up on large applications has been consistent with our benchmark results. We still see nodes occasionally hanging, but less so than with our previous cluster. In the future we plan to significantly expand our number of nodes, but we first need to deal with the problem of hanging nodes and find/develop a batch system that will meet our needs.

For nodes that die or hang we can, and do, use checkpoint systems, but these require significant coding by the programmer and are generally sub-optimal. A better solution seems to be to use/build upon one of the fault tolerant MPI versions. We should be able to easily build into/add on a batch system that will meet our needs.

We envision writting applications as a group of tasks (these could be a set of CMPS, a set of frequencies, or a set of shots for example). The batch system would:

This approach is similar to the models used by Seti@home [*]. It has the advantages that it is fault tolerant and allows dynamic reassignment of nodes based on need.


next up previous print clean
Next: CONCLUSIONS Up: Clapp and Sava: Cluster Previous: BUILDING AND SETUP
Stanford Exploration Project
6/8/2002