In the not too distant past, graphics card producers started writing software that allowed use of their hardware for purposes other than graphical display. This is called General Purpose Computing on Graphics Processing Units. The “C” is left out of the acronym, which becomes GPGPU. The graphics processor producers provide software development kits, comprising hardware drivers, a compiler that generates instructions for the Graphics Processing Unit from source written in a language such as C, and function libraries that provide the application programming interface between the CPU and GPU. NVIDIA calls their version of this “CUDA”, which is the acronym for “Compute Unified Device Architecture.” A couple folks in the Prototype Lab got the opportunity to do a project using CUDA. One person worked on a project that did map algebra computations on raster data, a natural fit for graphics cards, because they are built to handle 2D and 3D cell-based data structures. He reported on his results some time ago. I was able to work with feature class geometries, data structures with a less direct fit into the grid-cell based world.
The overall processing model is that a program running on the host machine passes off the data, stored as an array, to the memory on the GPU using CUDA API runtime functions. The GPU is called the “device”. The CPU machine is called the “host”. Code compiled for the GPU gets loaded on the GPU and run. The output data are written to device memory as they are processed, and then copied back to host memory. This is a bit like a batch-processing model, but there are applications that appear to work in real time. This model has changed somewhat with later NVIDIA CUDA drivers. The GPU can now do direct reads and writes to host memory. In other words, the code running on the GPU can use a pointer to regular memory on the host. This eliminates the need to make the data transfer from host memory to device memory. The catch is that the host memory must be allocated as a block of what is called “pinned” or “page-locked” memory, that is, memory that cannot be paged out by the operating system.
The computational model is data-parallel, where a single instruction is performed on many data elements at the same time. Looking at it from the GPU, each input datum is processed by a single thread of execution, and all threads currently executing perform the same instruction at the same time. The acronym is, SIMT, Single Instruction Multiple Threads. In a perfectly data-parallel computation, the state of the datum on one thread cannot have any dependency on the state of a datum on another thread. Conditional branching is not allowed either, because some threads would be processing different instructions than other threads. In the face of an imperfect world, CUDA allows some degree of both inter-thread data dependency and conditional branching. The price is processing time, because either threads must be synchronized, or some threads have to be processed while other threads wait.
The CUDA compiler understands standard ANSII C and the several extensions to the language NVIDIA has added. The CUDA runtime functions can be called in a C++ application. Functions that are to be run on the GPU in SIMT mode are called “kernels”. A special declaration specifier is used when declaring the kernel function, and special syntax is used when calling the kernel function from the host. Other functions callable within a kernel when running on the GPU are defined with another special declaration specifier. These are called “device functions”. A kernel cannot call another kernel, nor can a device function executing on the GPU call another kernel. Memory cannot be allocated inside a kernel or device functions with malloc.
GPGPU with CUDA
To the uninitiated GPGPU with CUDA is a bit daunting. To the initiated, it still is not what could be called easily approachable. Experience and reading the programming manual multiple times are the great teachers. There are two APIs supported by CUDA. One is based on C with a few extensions added by NVIDIA, and is called “C for CUDA”. The other uses ANSI C without extensions, and is called the “Driver API”. The Driver API is finer-grained, and not necessary for most applications. There is a third way to generate object code for the GPU, and that is with CUDA’s instruction set architecture, ISA, called “PTX”. PTX is assembler for the NVIDIA GPU. The CUDA compiler, called “NVCC” produces PTX code from both C for CUDA and Driver API source, and then turns PTX into object code with the hooks necessary to make object code that will link with object code from the regular C/C++ compiler on the host. The complexity of the C for CUDA API stems from the different types of memory available for use on the GPU, the threading model, how the multiprocessor handles threads and instructions, and the interaction among the three. The complexity is there so that individual applications can obtain peak performance from the GPU – down to worrying about clock cycles and memory latency. Access times for each memory type, segment size, alignment of data, order in which data may be read or written, how and where data are read and written, and caching all can be taken into account.
At the hardware level there are multiple multiprocessors each of which has eight processing cores. The graphics card has a large amount of 32-bit addressable device memory, called “global” or “linear” memory. This is the target for any memory transfer from host to device. Small amounts of linear memory are set aside as read-only constant memory (writes are done from the host before kernel invocation), and as local memory which is opaquely allocated by the NVCC compiler. Global memory is also used to store textures, which are normally used for graphics, but can be exploited by GPGPU. Each multiprocessor has its own on-chip memory divided into shared memory filled from global memory, constant cache memory filled from constant memory, and texture cache memory filled from textures. Current NVIDIA architecture does not allow for caching of global memory at the multi-processor. Each processing core has its own memory called register memory. Register memory is divided up by the threads so that variables local to each thread are stored in register memory. If there are too many local variables to be stored in register memory, the NVCC compiler allocates local memory for the excess. Local memory access is as slow as global memory access. Each processing core clicks through program instructions on groups of 32 threads called a “warp”, and each thread must keep its own state in register variables.
The threading hierarchy is flexible and designed to let the allocation of threads fit the structure of the data and the computations using the data. Threads are allocated at runtime each time a kernel is called. Threads are allocated in a 1-, 2- or 3-dimensional group called a “thread block”. This lets the threads in a thread block be indexed in a typical i-j-k fashion to match input data arrays laid out as vectors, matrices, or fields. The maximum number of threads per thread block is 512. The CUDA driver matches thread blocks to processor cores and ultimately to warps which get scheduled by the SIMT thread scheduler. Thread blocks execute independently and in any order across any number of processor cores, but an individual thread block processes on one processor core. The number of registers required for automatic variables defined in device functions may limit the number of threads per block, because register memory is limited in size.
Any but the smallest problems will require more than one thread block. Thread blocks are arrayed in a two-dimensional “grid”. The size, in thread blocks, of each dimension of a grid is specified at runtime each time a kernel is called. It is not worth doing GPGPU unless there is a grid of thread blocks with at least twice as many thread blocks as there are processing cores. Because thread blocks are switched in and out of being processed on a core, the scheduler can mask latency of a global and local memory access on an individual thread block by switching in another thread block for processing while the first one waits for data to arrive.
Invoking the kernel tells the GPU how many threads to process, by specifying the number of thread blocks in the grid, and the number of threads per thread block. Kernel invocations are asynchronous, so processing on the host does not stop when a kernel is invoked. If the desire is for host processing to stop after the kernel is called, there is kernel-wide synchronization, meaning all thread blocks must complete the calculations being done on their threads, i.e. the kernel completes. Once synchronized, another kernel with another grid can be run. The new grid can share the results of the previous grid, because its thread blocks can access updated and synchronized global memory, and do further processing.
Every thread knows what thread block within a grid, and what thread within a thread block it is. This is done through internal read-only variables. The grid dimension variable, gridDim, is a built-in read-only variable with an x and y dimension that tells the number of thread blocks in the grid in each dimension. The block index, blockIdx, is a built-in read-only variable with an x and y component that locates a thread block within the grid. The block dimension variable, blockDim, is a built-in read-only variable with three components, x, y, and z, that return the width, height, and depth of a thread block in threads. The thread index variable, threadIdx, is a built-in read-only variable with three components, x, y, and z, that locates a thread within a thread block. All thread blocks in a grid have the same dimensions.
Thread blocks executing independently on their own processor core are not just for the convenience of the thread scheduler, they also offer a structured way for using fast, but limited memory called “shared memory”. Shared memory is located on the multiprocessors, and is analogous to CPU cache memory. Accessing shared memory is very fast, just a few clock cycles. Each thread block has access to its own shared memory, and cannot access shared memory on any other thread block. The lifetime of shared memory is that of its thread block. Shared memory allows the threads within the thread block to read and write data to each other.
The maximum on current devices is 512 threads per thread block. A grid can have up to 65,535 thread blocks in each x and y dimension, giving a possible 4GB of thread blocks.
For my first try at GPGPU, I chose geodesic point buffering. I made buffer polygons by computing points at a user-defined buffer distance on a series of azimuths clockwise around the origin point. Depending on the buffer distance, from 30 to 360 points are generated for each buffer. For origin points, I used several featureclasses in longitude-latitude coordinates on the WGS84 Geographic Coordinate System. The size of the featureclasses ranged from 1,400 to 1,500,000 records. The calculations for “point-at-distance-and-azimuth” and “distance, forward azimuth, and back azimuth between two points” used the Projection Engine.
I used Visual Studio 2008 to make an Active Template Library Simple COM object for the host code. This let me do the testing from a VBA application in ArcMap. Two other files were coded, and compiled using the CUDA compiler. One file had a single function called from the COM class. This function provides the interoperation between the host and the device. Interoperation comprises memory allocation on the GPU, filling GPU memory with input data, kernel invocation, and copying memory from the GPU containing the results of the computation back to host memory. The second file contained the code to be run on the GPU. This code comprised the kernel code and device functions called by the kernel. This structure allowed separate timing for the memory allocation and transfer, and separate timing for the kernel execution. The exact same code that was compiled for the GPU was also compiled into the COM class. This allowed separate timing for comparing buffer geometry generation on the GPU versus buffer geometry generation on the CPU. The output polygon featureclass was made from the original point featureclass using the Geoprocessing Buffer Tool. The geometries of the output featureclass were overwritten by the geometries made by the GPU and CPU buffer generation functions. The input point geometries were read from a featurecursor, and the output geometries were written by a featurecursor. This allowed separate timings of the Geoprocessing buffering, the CPU buffering from reading by the cursor through writing by the cursor, and the GPU buffering from reading by the cursor through writing by the cursor.
The hardware used was a Dell Precision T3400 Convertible MiniTower, Intel Dual Core E8500 3.16GHz 64-bit Processor with 13MB L2 Cache, 525W power supply, 4GB of 800MHz DDR2 ECC SDRAM Memory. An NVIDIA Quadro FX 500 video card was used for the display. An NVIDIA Quadro FX4800 GPU with 1.5 GB RAM, Compute Capability 1.3 for the GPU computing. NVIDIA GPUs with Compute Capability 1.3 can do native double precision arithmetic. Double precision is slower than arithmetic using the 4-byte float type, but it lets us use our data as they are stored. The well-equipped computer was nice to have, but the 525 W power supply was the necessary component in order to supply power to the Quadro FX4800. The Quadro FX4800 did not have a video monitor plugged into it. In the interest of getting some reliable timings, I stopped every service I could find that might suddenly start intensive processing on the machine, such as virus detection services. I did not remove the internet connection from the wall. The processes I had running were ArcMap, Excel, and Visual Studio. I used the debug build of the COM component, and ran ArcMap as a debugging session from Visual Studio. The NVCC compiler does not have an option to make a debug build of the device code. No other user-initiated processes were running.
The GPU cannot make function calls to the Projection Engine .dll directly, because the instruction sets are different. It was necessary to recompile the Projection Engine source with the NVCC compiler. I altered the Projection Engine source to remove code that kept newly computed longitudes within the range of -180 to 180 degrees. This eliminated the need for conditional branching on every input origin point to catch newly generated coordinates for a buffer that might cross the -180 or 180-degree meridians. Successive buffer coordinates with longitudes that switch from near -180 to near 180, or vice versa, make self-intersecting polygons that span the globe. Somewhat the same problem exists for buffers having an origin point at one of the Poles, or having an origin point near enough so that the resulting buffer encompasses one of the Poles. In this case, extra points need to be added to the buffers to keep them from self-intersecting and closing incorrectly. This means a branch in processing within individual threads must happen.
Threads are processed in batches of 32 called “warps.” When there is a branch, threads in one branch are processed while the other threads wait. If all the threads in a warp take the same branch then none will wait. Fortunately, origin points near or at the poles are rare, so not very many warps will split the processing paths for their threads.
Point buffering must map an input datum, the contents at an array index, to one thread. The buffer calculation on each thread produces many new points for each input point. That means there has to be a mapping between an input array index, a thread, and the output array indices. Fortunately, just which core of which multiprocessor executes the instructions on which warp comprised of which threads in which thread block is handled by the CUDA runtime. So is the scheduling for when and in what order the processing happens. That still leaves how the input point at a certain index in the input array gets assigned to a thread, and how the resulting output points get placed into the output array starting at the correct address. It would be nice if the point at index ‘i’ of the input array could be assigned to thread ‘i’ for processing and put into ‘beginning address’ + ‘i’ * ‘number-of-points-per-buffer’ of the output array.
Geodesic point buffering uses one-dimensional linear memory mapping. The origin points were read into a C-style array from the shape field of the point featureclass using a cursor. The input array type was a 16-byte CUDA runtime type called a “double2”. This is the same as a WKSPoint. An array of the same size and type was allocated on the device using a CUDA runtime function equivalent to malloc. The contents of the input array were transferred from the host to the device using another CUDA runtime function equivalent to memcpy. A second C-style array of unsigned char the same size as the number of input points was allocated on the host and the device. This array was used as a bitmap, to designate whether an input origin point was, at the North Pole, at the South Pole, within the buffer distance of the North Pole, within the buffer distance of the South Pole, or not any of these if no bits were set. The bits of each unsigned char in the bitmap array were set per each thread in parallel on the device, and copied back to the array allocated on the host. The bitmap is necessary, because output buffer polygons near or at the Poles will have an extra four points added. The number of output buffer polygons is the same as the number of input origin points, but they have many vertices each. The number of vertices depends on the buffer distance and proximity to the Poles. The number of vertices required for each buffer due to the buffer distance is known when memory is allocated before the kernel is run, but the number of vertices that need be added to the buffer polygon due to proximity to the Poles is not. That means memory for the extra four points per buffer (64 bytes) will be allocated even though the contents may remain undefined for most buffer polygons. The bitmap lets both the kernel and the host code know how to fill and read the output array. The buffer geometries are written to the output array on the device, and copied to the output array on the host, then written to the output featureclass using a cursor.
All buffer geometry calculations on the GPU used one-dimensional thread blocks of 128 threads. This was the largest number of threads, divisible by 32, that could be specified. Specifying thread blocks with more threads caused kernel invocation to fail. The limited amount of on-processor memory available was the cause. Core memory has to be shared with register memory for local/automatic variables, and the program instructions have to reside on core memory. Because of this limitation, as problem size grew all I could do is use more 128-thread thread blocks, so that a 1:1 correspondence between input points and threads could be maintained.
The programmer is responsible for constructing the correspondence between threads and the data the threads work on. In this application, the origin point at index ‘i’ in the input array, the unsigned char at index ‘i’ in the bitmap array, and the ‘i-th’ set of vertices in the output array all have to be the same ‘i’, but that doesn’t mean that the ‘i-th’ thread on the device has to process the ‘i-th’ index in the input arrays. It is just simpler to conceive of, and to write the code if it does. The kernel for geodesic point buffering was written this way.
Reads and writes to global memory can be speeded up if the 16 threads of a half-warp can be read from or written to contiguous locations in global memory, and the data at those locations is sized so that each read or write can fetch or put 32, 64, or 128 bytes at a time. This is called “coalesced” memory access. For example, a double2 type is 16 bytes. Each thread needs to read one double2 origin point when constructing a buffer. The thread ids for the threads on a half-warp will map to adjacent indices in the input origin-point array. That means that the 16 threads of a half-warp can have their input data read in two 128-byte fetches from global memory, instead of 16 16-byte fetches. That saves a couple thousand clock cycles. The writes from the half-warp to the output bitmap (unsigned char) could be done in a single 16-byte write, but coalesced reads and writes are done in chunks of 32 bytes at a minimum, so 16 bytes of the 32-byte transaction would be empty, and presumably not written, or the transaction many span the entire warp – the documentation is unclear.
Writes to the output vertex array happen inside a for-loop being executed by each thread. The for-loop generates buffer polygon vertex coordinates. Writes from the threads on the half-warp will not occur at adjacent addresses in the output array. This means that they will not be coalesced, so there will be 16 individual writes to global memory for each index in the for-loop when it is executed across a half-warp. This is not good, but there is not much that can be done. The use of shared memory to store the entire vertex set for each thread, and then writing to global memory from shared memory after the loop would help. This does not work, because the maximum size of shared memory on the processor is not large enough to store the output coordinates for every thread of a single half-warp, let alone all the warps in a thread block. Shared memory is shared among all the threads in a thread block. The maximum number of threads for a thread block is 512, so the number of warps may be as many as 16.
I buffered 14 different point featureclasses ranging from 1,439 to 1,452,501 records. For each featureclass, I made geodesic buffers of 30, 100, and 300 miles, which had 120, 180, and 360 vertices respectively. The number of vertices generated ranged from 478,080 to 522,887,400. Each buffer calculation was run on the CPU and the GPU using the same source code, so that timing of the buffer calculations could be directly compared. The buffer calculation on the GPU involved the extra step of transferring the input and output arrays to and from the GPU. The timing for I/O and the timing for GPU computation (running the kernel) were computed separately and summed for comparison to the CPU timing. The output buffer featureclass was made by using the Buffer Geoprocessing Tool on the input point featureclass. The time for the Buffer GP tool to run was also recorded.
As the featureclasses being buffered became larger, memory allocation problems arose. Getting memory on the GPU, which had 1.5 GB, was not the problem. Getting enough memory on the host machine, which had 4 GB, was. This happened as the input featureclasses neared 50,000 records which would need an output array of around 300MB when making 360-vertex buffers. At this point I changed the program to run the buffer polygon generation code on both the CPU and GPU in several iterations, and added a function that would repeatedly halve the amount of memory asked for upon a failure of a malloc. The allocated output array memory was filled repeatedly as batches of input points were given to the buffer polygon generation code. The allocated output array was filled as many as 64 times when working with the largest input featureclass. Later I changed to a recursive binary-search memory allocation function that allocated as much memory for the output array as possible. This reduced the number of times the output array had to be filled for the largest input featureclass from 64 to 42. The largest piece of contiguous memory I ever was able to allocate on the host machine was slightly more than 350MB.
Read – calculate – write times, i.e. geodesic point buffer featureclass generation, scaled linearly with the number of output vertices on the CPU. (see chart below)
Calculating just the geometry for a geodesic point buffer featureclass scaled linearly with the number of output vertices on the CPU. (see chart below)
Calculate geometry scaled linearly with the number of output vertices, but there was some variability at higher numbers of output vertices. This happened when the allocated memory on the host had to be filled repeatedly. Because two different memory allocation functions were used, different amounts of memory were allocated for the output arrays during different runs using the same input featureclass, even though the total number of output vertices was the same. The more memory that could be allocated meant the output vertex array had to be filled fewer times. Fewer times meant fewer memory transfers from host to device and back, and fewer kernel calls. This translated into less time needed to calculate the output geometries. This is especially evident in the three runs that made output geometries having 500-million vertices. Times for the three runs were; 29.9 seconds with 24 fills, 34.7 seconds with 42 fills, and 38.4 seconds with 64 fills. (see chart below)
Kernel part of calculate geometry scaled linearly with the number of output vertices. There was still some variability due to the number of times a kernel was run, but it was not nearly as much as the variability when the memory I/O is included. (see chart below)
CPU to GPU ratios
Except for output buffer featureclasses with relatively few records, the time to make a geodesic point buffer featureclass on the CPU took between 1.5 and 2 times as long as doing the same thing on the GPU. The scattering of GPU to CPU ratios show no pattern as the size of the output featureclass increased. (see chart below)
The same pattern happened when plotting the ratio of time to calculate the buffer geometries on the CPU to the same calculation on the GPU, leaving out database reads and writes. Aside from output featureclasses of small size, the ratio ranged between 21 and 33 with most falling between 25 and 30. The scattering of the ratios showed no pattern as the number of records in the output featureclass increased. Three of the four times the GPU was over 30 times faster than the CPU were for output featureclasses of less than 100-million vertices, but the highest ratio of over 33 times faster happened with the largest output featureclass that had over 500-million vertices. (see chart below)
Kernel to Total calculate geometry time (included memory copies)
Part of the time spent calculating buffer geometries on the GPU is spent transferring the input and output arrays to and from the memory on the GPU, and part of the time is spent doing the calculation. With relatively few output vertices, the amount of time transferring data is much greater than the time calculating the geometries. As there are more and more vertices, the ratio of data transfer time to computation time settles into a range where around 50% to 70% of the time is the kernel doing the geometry calculation. (see chart below).
I recorded the execution times for 445 kernel invocations that calculated between 120 and 21,652,200 vertices each. Execution times scale in a nearly linear fashion. (see chart below)
The Geoprocessing Buffer Tool makes geodesic point buffers when the input featureclass is in a Geographic Coordinate System. I used the Geoprocessing Buffer Tool to make all the output featureclasses, and then I changed the geometries with the geodesic buffering code run on the CPU and GPU. The chart below shows processing times for all three. The GP Buffer Tool was used twice for each test, once when making the output featureclass for running the buffering code on the CPU, and once when running the code on the GPU. The same very small buffer distance was used every time the GP Buffer Tool was run. The yellow triangles and black diamonds are the elapsed times for each time the GP Buffer Tool was run. The magenta squares are the elapsed times to make the geodesic buffers using the CPU. The cyan Xs are elapsed times to make the geodesic buffers using the GPU. The processing times for the CPU and GPU rise in a linear fashion as number of output vertices increase. The processing times for the GP Buffer Tool seem to be less consistent. A couple times, at around 200-million output vertices, and again at 500-million output vertices, the GP Buffer Tool was faster than the geodesic buffer code running on the CPU. Once, the GP Buffer Tool was nearly as fast as the buffer code running on the GPU. I cannot account for this behavior of the GP Buffer Tool.
The obvious and significant conclusion is that using the GPU as a coprocessor can speed up computation times for this particular data-parallel problem by a large amount, 25 to 30 times. Even when diluted by the Geodatabase cursor reads and writes, the time to make geodesic buffers is 1.5 to 2 times faster on the GPU than the same buffer computations done on the CPU. Times for the same code run on the CPU and GPU increased in a linear fashion as the number of output vertices increased. This linear increase persisted, with some variance due to multiple memory reads and writes and multiple kernel calls were needed, because of memory availability on the host computer. Processing times did not plateau, or have any range of output vertices where times increased or decreased more than the general linear trend.
I used the simplest CUDA implementation for this problem. It was to serve as a baseline for comparison to other implementations. No effort was made to reduce latency of reads and writes to global memory by figuring out ways to make coalesced writes, or by increasing the number of thread blocks used. More thread blocks can be used to help cover memory latency. I did not attempt to write the kernel and device functions using 32-bit floating point numbers instead of 64-bit floating point. Single precision math is about 8 times faster than double precision math. On-processor shared memory is very limited on current NVIDIA GPUs, so no attempt was made to restructure the problem so that it could use shared memory to avoid uncoalesced writes to global memory. I did not try to use access to page-locked memory on the host machine by the GPU. This would avoid the memory copies from host to device, which occupy from 30 to 50% of kernel computation time. I did not experiment with placement of conditional branching or with warp-vote functions to see if a single branch on large blocks of code is better than repeated branches inside loops within large blocks.
Contributed by Tom G.