Next: CONCLUSIONS
Up: Clapp and Sava: Cluster
Previous: BUILDING AND SETUP
So far we have generally been happy with our new cluster. The
speed up on large applications has been consistent with our
benchmark results. We still see nodes occasionally hanging, but
less so than with our previous cluster.
In the future we plan to significantly expand our number of nodes,
but we first need to deal with the problem of hanging nodes and
find/develop a batch system that will meet our needs.
For nodes that die or hang we can, and do, use checkpoint systems,
but these require significant coding by the programmer and are
generally sub-optimal.
A better solution seems to be to use/build upon one of
the fault tolerant MPI versions. We should be
able to easily build into/add on a batch system that will
meet our needs.
We envision writting applications as
a group of tasks (these could be a set of CMPS, a set of frequencies,
or a set of shots for example). The batch system would:
- Initially send out the set of tasks to each available node.
- If a task is finished it would check to see if another process, with
higher priority,
needs the node. If it does the new process would get the node.
If not a new task would be sent to the node.
- If a task does not finish it will be resent to a new processor.
This approach is similar to the models used by Seti@home
.
It has
the advantages that it is fault tolerant and allows dynamic reassignment
of nodes based on need.
Next: CONCLUSIONS
Up: Clapp and Sava: Cluster
Previous: BUILDING AND SETUP
Stanford Exploration Project
6/8/2002