CUDA Programming Methodology

Nvidia's novel technology, ``Compute Unified Device Architecture'' (CUDA) is a software interface and compiler technology for general purpose GPU programming (Nvidia, 2008). The CUDA technology includes a software interface, a utility toolkit, and a compiler suite designed to allow hardware access to the massive parallel capabilities of the modern GPU, without requiring the programmer to construct logical operations as graphical instructions. The latest release of CUDA, version 2.1, exposes certain features only available in the Tesla T10 GPU series. Below, all specifications are given based on the capabilities of the T10 GPU using CUDA 2.1 software. For easy reference, Table 1 summarizes the terminology and acronyms that apply to the software and hardware tiers. An acronym-guide is also provided in Table 2 in the Appendix.

Table 1: CUDA software and hardware mapping. This table briefly summarizes the CUDA software architecture and its implementation on a T10 GPU.

Software Model		Hardware Model
Element	Maximum	Physical Unit	#
Thread	512 threads per block Arranged in 3D block not exceeding $512\times512\times64$ in and 512 total	Scalar Processor (SP) or ``Streaming Core'' Each core executes one thread at a time	8
Warp	Each 32 threads are statically assigned to a warp	SP Pipeline A full warp (32 threads) executes in 4 clock cycles (pipelined 4-deep across 8 cores)	16
Block	Arranged in 2D grid not exceeding $65535\times65535$ in	Streaming Multiprocessor (SM)	30
Kernel Grid	Problem or simulation representation	GPU Only one kernel is running on the GPU at a time (More are possible, but this is complicated).	4

CUDA programs have two parts: ``host'' code, which will run on the main computer's CPU(s); and ``device'' code, which is compiled and linked with the Nvidia driver to run on the GPU device. Most device code is a ``kernel,'' the basic functional design block for parallelized device code. Kernels are prepared and dispatched by host code. When the kernel is dispatched, the host code specifies parallelism parameters, and the kernel is assigned to independent threads which are mapped to device hardware for parallel execution.

The coarsest kernel parallelism is the ``block,'' which contains several copies of threads running the same code. Each block structure maps to a hardware multiprocessor. Blocks subdivide a large problem into manageable units which will execute independently. It should be noted that inter-block synchronization and communication is difficult without using expensive global memory space or coarse barriers. Inside each block there are up to 512 threads, organized into sub-groups or ``warps'': these groups of up to 32 threads. At this level of parallelism, shared memory and thread synchronization is very cheap, and specific hardware instructions exist for thread synchronization. As of CUDA 1.3, available on the Tesla T10, synchronization ``voting'' can be used to enable single-cycle inter-thread control. As an extra performance boost, threads which are running the same instructions are optimized with ``Single Instruction, Multiple Thread'' (SIMT) hardware, sharing the Instruction Fetch (IF) and Decode (DEC) logic and efficiently pipelining operations. If conditional program control flow requires different instructions, the threads must then serialize some of these pipeline stages. Peak performance is achieved when all conditional control-flow is identical for threads in a single warp. In the case of Finite Difference Time Domain (FDTD) wave propagation code, it is generally possible to have all threads operating in SIMT mode. The boundary conditions at the edges of thread blocks, and at the edges of the simulation space, are currently the only exceptions to this SIMT mode.