Next: Global parameters
Up: DESIGN
Previous: Distributing and collecting
One of the biggest problems with running on cheap hardware
is the failure rate is high. Writing migration code that
is able to figure out where it died and continue is challenging.
When the migration is part of a larger inversion problem
the tasking becomes even more difficult. One of the goals
of the library is to make check-pointing easier.
Any thread can write a status parameter to a distributed
tag. This status parameter is written to its local sections
rather than the global tag, so clobbering of the text file
isn't an issue.
Restarting becomes a much simpler matter. You can request
the status parameter from each section with a single call.
Figuring out what portion of job has finished, and what
portion is remaining becomes a trivial matter.
Next: Global parameters
Up: DESIGN
Previous: Distributing and collecting
Stanford Exploration Project
5/23/2004