next up previous [pdf]

Next: Different Precisions Up: Exploration of Design Options Previous: Multiple Stencil Operators

Processing Multiple Time Steps

Instead of putting concurrent cores, another strategy is to process multiple time steps in one pass. Figure 4 shows the basic structure of a circuit that processes three time steps in one pass. The three units process three time steps separately with the output of each unit as the input of the next unit. The example in the figure uses a 3-by-3-by-3 `cube' stencil. In general, the computation of a wave-field data in slice $n$ requires the wave-field data in slices $(n+1)$, $n$, and $(n-1)$ in the previous time step. Therefore, when the unit $1$ starts processing slice $n$, the unit $2$ can start processing slice $(n-1)$. Meanwhile, unit $2$ needs intermediate buffers to store the results for slices $(n-1)$ and $n$ from unit $1$.

multi-steps
Figure 4.
Basic circuit structure for processing multiple time steps ($a_i$ denotes the wave-field data in the time step $i$). $\mathbf{[NR]}$
multi-steps
[pdf] [png]

An advantage of processing multiple time steps over putting multiple stencil operators is that the performance will not be constrained by the memory bandwidth, as the unit for each time step is getting inputs from the previous time step, and does not consume the memory bandwidth of the FPGA.

However, on the data side, as we are doing a 3D blocking of the array, processing multiple time steps requires extra data items to start with. Given a convolution stencil with $ns$ non-zero lags in each direction, to process $n$ time steps in one pass for a $nx\times ny$ array, we need to start with an array of the size $(nx+2\times n\times ns)\times(ny+2\times n \times ns)$. Considering doing 10 time steps for a 100x100 size, the data overhead is 44% for the `cube' and 156% for the `star'.

Meanwhile, as the unit at each time step needs to store the results of the previous time step, this approach also increases the requirement for BRAM resources. Therefore, to increase the number of time steps, we need to reduce the blocking size, and thus increasing the cost of streaming overlapping data items and doing a larger number of streams.

Another advantage of this multiple-time-step architecture is that we can improve the order of time accuracy with relatively small costs. For example, for the unit $3$ in Figure 4, instead of only getting the previous wave-field data $a_2$ from unit $2$, we can get in the wave-field data $a_2$ and $a_1$ from both units $2$ and $1$ to achieve 4th order in time accuracy. The cost for improving the time order is the extra buffer to store the wave-field data from unit $1$ and the increased number of adders and multipliers.

Figure 5 shows the estimated performance for FPGA convolution designs that process multiple time steps in one pass. The `star', the 2nd and 4th order `cube' are compared here. For this approach, the `cube' stencil shows a much better performance than the `star' stencil due to its smaller requirement for BRAM resources (`star' needs to buffer six slices for the convolution operation, while `cube' only needs to buffer two). Due to the constraint of logic slices, the FPGA can fit eight time steps for the `star', six and five steps for the 2nd and 4th order `cube'. The `star' gets its peak performance of 11x speedup with four time steps. After that, the performance becomes worse with more time steps. The 2nd order `cube' stencil increases all the way to 29x speedup with 6 time steps. The 4th order `cube' achieves 25x speedup with 5 time steps.


next up previous [pdf]

Next: Different Precisions Up: Exploration of Design Options Previous: Multiple Stencil Operators

2009-05-05