The second thing we have to be concerned with is the speed of our FFT. We generally use FFTW Frigo and Johnson (1999) which can handle any size axes but transforming an axis of length 512 will be significantly faster than transforming an axis of length 511. The easiest way to handle this problem it to make a list of `good' axis lengths (e.g., combination of prime factors and power of two) and only decrease the offset domain at depths where the next lowest `good' axis length is reached.
Finally, when this process is parallelized to be run on a linux cluster load, balancing becomes more difficult. Each frequency block has a different number of depth steps, a different interpolation cost, and a fixed IO cost. Ensuring that each processor finishes at approximately the same time is a non-linear optimization problem. Presently I assign costs for depth steps, interpolation, and IO and then try 1000 different random solutions, selecting the one that shows the least cost differential between the nodes.