Initialization

Next: I/O Up: DESIGN Previous: Files

Initialization

There are three mechanisms to initialize a new distributed dataset. A distributed dataset can be created by a call to the library. The call defines which axis to distribute, how many sections to create, and what distribution pattern to use. The library then determines the number of threads it is running on and then distributes the sections to the various threads in a round-robin fashion. Normally you would think of having a single section for each thread, but this is not required, and in certain situations might not be desirable.

A distributed dataset can be read from a file. When initializing a distributed dataset from a tag, data locality tries to be preserved. The various sections are assigned to threads on the same machine as much as possible without sacrificing load balance. For example, if a dataset has been broken up into six sections, three on machine A and three on machine B, and you then start a job on machine A, B, and C, A and B will each be assigned two sections that were written locally, while C be assigned one section from both the original A and B group. Whenever a dataset is initialized from a tag, the tag is only read by the master node. Its contents are then sent to all slave processes. This keeps every process from having to have access, and to simultaneously read, a single file. When scaling to a large number of processors, this ability can be useful.

The final method to create a distributed dataset is from another distributed dataset. If you are running on multiple threads and you initialize a new dataset from an already distributed dataset, the new dataset will inherit the original dataset's distribution pattern.

Next: I/O Up: DESIGN Previous: Files

Stanford Exploration Project
5/23/2004