Next: FPGA Acceleration Up: Pell and Clapp: Accelerating Previous: Streaming processors

Computing with FPGAs

FPGAs are Complementary Metal Oxide Semiconductor's (CMOS) technology based chips containing logic which can be configured to any digital circuit and a limited number of memory elements including RAMs and registers. In fact, FPGAs can be re-configured several times per second, offering a flexible substrate for application specific circuits. The price of reconfigurability is a 10x slower clock frequency compared to today's state-of-the-art Pentium and Opteron processors. Modern FPGAs contain on the order of 10⁵ independent logic cells, all of which can operate in parallel. This massive parallelism more than compensates for the 10x reduction in clock frequency versus a state-of-the-art CPU, delivering orders of magnitude more compute power within a reasonable power budget. FPGAs have shown excellent potential as hardware accelerators for a wide class of applications. Compute-intensive algorithms can be mapped directly into parallel FPGA hardware, tightly coupled to a conventional CPU through a high-speed I/O bus, enabling key hotspots in an application to be accelerated by over an order-of-magnitude. The performance potential of FPGAs arises from exploiting stream processing. In a typical CPU, instructions are executed sequentially (Figure

). Despite the high clock frequency, data throughput can be quite limited since there is limited scope for parallelism, even in modern superscalar processors with vector (SIMD) units. For many algorithms a streaming approach (Figure

) delivers significant benefits. FPGA stream processors operate continuously on streams of data. Data is transferred to the accelerator once, over a high-speed I/O bus such as PCI Express, then it passes from one processing element to the next as it is required for each operation. The FPGA circuit computes one or more results each and every cycle without any of the control overhead associated with CPU conditionals, loops, etc. On-chip memory implements a custom ``perfect cache'' which retains data on-chip for precisely as long as it is required for the computation. A large number of compute units operating in parallel overcome the compute limitations of the CPU, while the on-chip storage structure and MISD (multiple instruction, single data) operation significantly mitigate the memory limitations of the CPU. Stream processors show potential for accelerating seismic applications operating on large datasets, since only a small fraction of the data needs to be stored on-chip at any one time. This makes the approach scalable to multi-dimensional problems with tens of gigabytes of data, since the primary storage medium remains CPU main memory. FPGAs are usually regarded as hard to program, with building FPGA accelerators essentially being a matter of hardware design. We develop this accelerator at a higher level of abstraction using the ASC (Mencer, 2006) compiler. ASC, A Stream Compiler for FPGAs, provides a software-like interface to FPGA design based on C++, while retaining the performance of hand-designed circuits. At the top level, ASC code closely resembles C code, allowing a relatively low cost transition from a C-based software implementation to the FPGA hardware implementation. One key difference between ASC and a conventional imperative programming language is that the standard semantics for all operations performed in parallel and all operators are vector operations performed on streams of data. To transfer code to an FPGA accelerator we identify loops to be accelerated, then re-write those loops in ASC code, replacing the original loop with code which transfers data to/from the accelerator. For example, a C loop can describe a vector increment operation as below:

int i;
int a[SIZE], b[SIZE];
for (i = 0; i < SIZE; i++)
    b[i] = a[i] + 1;

This can be rewritten for FPGA implementation as:

STREAM_START;
    HWint a(IN), b(OUT);    
    b = a + 1; // Loop is implicit
STREAM_END;

The loop has been replaced with STREAM_START and STREAM_END declarations, which identify the boundaries of the code to be implemented on the FPGA. The integer arrays a and b are declared as Hardware Integer type variables, one input and one output. This ASC code can be compiled using GCC producing an executable which, when executed, generates an FPGA circuit.

microproc
Figure 1 When computing with a microprocessor, instructions are executed sequentially on data items retrieved from memory.

streaming
Figure 2 Computing with an FPGA involves ``streaming'' data through an FPGA which has been configured to implement a function. Here two input arrays are transferred, processed, and a single output array is produced.

Next: FPGA Acceleration Up: Pell and Clapp: Accelerating Previous: Streaming processors

Stanford Exploration Project
5/6/2007