A distributed dataset can be read from a file. When initializing a distributed dataset from a tag, data locality tries to be preserved. The various sections are assigned to threads on the same machine as much as possible without sacrificing load balance. For example, if a dataset has been broken up into six sections, three on machine A and three on machine B, and you then start a job on machine A, B, and C, A and B will each be assigned two sections that were written locally, while C be assigned one section from both the original A and B group. Whenever a dataset is initialized from a tag, the tag is only read by the master node. Its contents are then sent to all slave processes. This keeps every process from having to have access, and to simultaneously read, a single file. When scaling to a large number of processors, this ability can be useful.
The final method to create a distributed dataset is from another distributed dataset. If you are running on multiple threads and you initialize a new dataset from an already distributed dataset, the new dataset will inherit the original dataset's distribution pattern.