Quantitative Analysis
Parallel Processing
Numerical Analysis
C++ Multithreading
Python for Excel
Python Utilities
Services
Author

I. Introduction into GPU programming.
II. Exception safe dynamic memory handling in Cuda project.
1. Allocating and deallocating device memory. ots::cuda::memory::Device class.
2. Accessing device memory. ots::cuda::memory::Host class.
3. Crossing host/device boundary. ots::cuda::memory::DeviceHandler class.
4. Accessing memory in __device__ code. ots::cuda::memory::Block class.
5. Handling two dimensional memory blocks. Do not use cudaMallocPitch.
6. Allocation of memory from Host scope.
7. Tagged data. Compiler assisted data verification.
III. Calculation of partial sums in parallel.
IV. Manipulation of piecewise polynomial functions in parallel.
V. Manipulation of localized piecewise polynomial functions in parallel.
Downloads. Index. Contents.

Accessing memory in __device__ code. ots::cuda::memory::Block class.


s explained in the previous section, a kernel function that means to operate on a device memory block, receives information about the memory block in a form of a DeviceHandler instance. For example,

__global__ void MyKernel( ots::cuda::memory::DeviceHandler<int> handler )

{

//some code

}

Once inside the kernel, the DeviceHandler instance needs to be passed into a constructor of the decorator class Block. For example,

__global__ void MyKernel( ots::cuda::memory::DeviceHandler<int> handler )

{

ots::cuda::memory::Block<int> block(handler);

//some code

}

Then the instance of Block may be used to access the memory. The following is the prototype of the class Block.

template <typename dataType>

class Block

{

public:

typedef dataType type;

typedef typename ots::config::Index::type index;

typedef Block<type> block;

typedef Handler<type> handler;

typedef ::ots::cuda::io::device::OStream ostream;

private:

type* theData;

index theSize;

public:

__device__ Block() : theData(NULL), theSize(0) {}

__device__ explicit Block( const handler& x )

: theData(x.theData), theSize(x.theSize) {}

__device__ Block( const block& x )

: theData(x.theData), theSize(x.theSize) {}

__device__ block& operator=( const block& x )

{ theData=x.theData; theSize=x.theSize; return *this; }

__device__ Block( index size, type* data )

: theData(data),theSize(size) {}

__device__ bool isValid() const { return theData!=NULL && theSize>0; }

__device__ index size() const { return theSize; }

__device__ type get( index pos ) const { return theData[pos]; }

__device__ void put( index pos, type x ) { theData[pos]=x; }

__device__ bool isValidIndex( index pos ) { return 0<=pos && pos<theSize; }

__device__ void copyFrom( const block& b )

{

index m=min(theSize,b.theSize);

for( index i=0; i<m; ++i )

theData[i]=b.get(i);

}

__device__ void assign( type x )

{

for( index i=0; i<theSize; ++i )

theData[i]=x;

}

__device__ void print( ostream& os ) const

{

namespace out=::ots::cuda::io::device;

out::Endl endl;

os<<"Block<dataType>";

os<<endl;

os<<"size="<<theSize<<endl;

for( index i=0; i<theSize; ++i )

os<<theData[i]<<endl;

os<<"end Block<dataType>"<<endl;

}

};

Note that we make no attempt to control possible index-out-bounds situation. This is so because device code must have minimal flow control. Bounds should be controlled from host code. Correctness of such control may be verified by running the cuda-memcheck utility.





Downloads. Index. Contents.


















Copyright 2007