Quantitative Analysis
Parallel Processing
Numerical Analysis
C++ Multithreading
Python for Excel
Python Utilities

I. Introduction into GPU programming.
1. What are GPU and CUDA?
2. Selecting GPU.
3. Setting up development environment.
4. Combined use of Cuda, C++ and boost::python.
5. Debugging of boost::python binary using Visual Studio.
6. Debugging of boost::python/Cuda binary using Visual Studio.
7. Using printf in device code.
II. Exception safe dynamic memory handling in Cuda project.
III. Calculation of partial sums in parallel.
IV. Manipulation of piecewise polynomial functions in parallel.
V. Manipulation of localized piecewise polynomial functions in parallel.
Downloads. Index. Contents.

What are GPU and CUDA?

PU is a graphic card with programmable interface. Insert it into a slot on your motherboard, install drivers and gain access to upward from 100 of processing units sharing 1GB or more of random access memory. It now takes milliseconds to do what used to take seconds to do. You go from noticeable delay to no delay at all. This is what happens if GPU processing is an afterthought: for an average PC system, spend $400 dollars for a mid-range GPU card and you are in a new world. It is possible to assemble a system that will drive a stack of high-end GPU units. For an investment of below $100K one can assemble a supercomputer capable of providing real time quantitative information for industrial size portfolios.

CUDA is a software interface for GPU. Not every graphic card supports CUDA. CUDA allows for GPU coding with a version of C++. It is possible to build applications that run concurrently on CPU and one or more GPUs. There is straightforward interface for memory exchanges between CPU and GPUs. For later versions of GPUs, there is interface for memory mapping between CPU and GPUs. There are barrier, event and stream-based synchronizations, atomic arithmetic and a constructs for seamless scalability. Threads are extremely lightweight. For example, it makes sense to create 256 threads to add two vectors in 256 dimensions.

One crucial feature is presence of very fast cache memory of significant size on every processing core. For example, we no longer do matrix multiplication by straightforward utilization of definition. Instead, we copy matrix blocks in cache memory in parallel and then do block-matrix multiplication in parallel. For numerical techniques this means that triangular factorization based methods are no longer a good way to invert equations because these are adapted for consecutive calculations. Matrix multiplication, on the other hand, is ideally suited for calculations with this technology.

To understand limitations of the technology one needs to understand the notion of warp. The device (GPU) code is executed in groups of 32 threads (called "warp") controlled by the same command sequence. Thus, every flow control operation (if,while,for) has potential to split the warp and introduce substantial performance penalty. If too much of such divergence is encountered then Cuda runtime throws "global stack overflow" exception. Such exception requires restarting Cuda runtime. Even though the flow control commands are available in device code, the programmer is expected to shift most flow control to host (CPU) and submit code with minimal amount of flow control into the device. An elaborate example of such separation is presented in the section ( Scalar product in N dimensions ). Naturally, there is no exception throwing or handling in the device code.

Downloads. Index. Contents.

Copyright 2007