6. Kernel (functions) - Source Code

6.1. Include

include iostream: C++ input/output streams [3]

include cuda.h defines the public host functions and types for the CUDA driver API.

include cuda_runtime.h: defines everything cuda_runtime_api.h does, as well as built-in type definitions and function overlays for the CUDA language extensions and device intrinsic functions.

include device_launch_parameters.h: defines functions to launch the kernel on gpu

inlude cmath: The header cmath declares a set of functions to compute common mathematical operations and transformations.

include engine.h is necessary to make a connection to matlab.

include windows.h: is a Windows-specific header file for the C/C++ programming language. [6]

include fstream: This is an input/output stream class and it is necessary to operate on files.

include string: The header introduces string types, character traits and a set of converting functions

include myPerms.h: See 6.1.6

6.2. Connection to Matlab-Engine

The command Engine *m_pEngine = engOpen(NULL); starts a MATLAB engine session. It returns a pointer to an engine handle, or NULL if the open fails. If it does not return NULL, the command std::cout<<"Success. Matlab is there."<<endl; is executed. A window opens, which displays "Sucess. Matlab is there.". See also here.

6.3. Passing the constants from matlab to C++

In pragma region CopyConstantDataToHost the constants are read in from matlab.

Important functions are mxArray, engGetVariable, mxGetScalar, mxGetField and mxGetData.

6.4. Allocate GPU buffers

cudaMalloc (devP, size) allocates size bytes of linear memory on the device and returns in *devP a pointer to the allocated memory. The allocated memory is suitably aligned for any kind of variable. The memory is not cleared. cudaMalloc() returns cudaErrorMemoryAllocation in case of failure. [5]

Parameters:

devP - Pointer to allocated device memory

size - Requested allocation size in bytes

Important: Please pay attention to the difference between single and double precision!

6.5. Copy data to GPU

cudaMemcpy() copies count bytes from the memory area pointed to by source memory adress to the memory area pointed to by the destination with a special type of transfer (here: cudaMemcpyDeviceToHost). [4]

Important: Please pay attention to the difference between single and double precision!

6.6. Launching the kernel on GPU

The number of interation steps are calculated from NumLogs and NumStepsPerLog. These parameters are set in SimConfig_dp_dist.m.

NumLogs $ \cdot $ NumStepsPerLog $ \cdot $ 7 = number of iterations steps

The following table shows the work mechanism of the storage of each iteration step. As you can see it is a permutation. Each data move one step to the left, the oldest data on the left side is deleted and the new data is saved on the right side.

$iteration\ step\\memory\ adress$ 1 2 3 4 5 6
1 $f_0^{t_0}$ $f_1^{t_0+\Delta t}$ $f_2^{t_0+2\Delta t}$ $f_3^{t_0+3\Delta t}$ $f_4^{t_0+4\Delta t}$ $f_5^{t_0+5\Delta t}$
2 $f_1^{t_0+\Delta t}$ $f_2^{t_0+2\Delta t}$ $f_3^{t_0+3\Delta t}$ $f_4^{t_0+4\Delta t}$ $f_5^{t_0+5\Delta t}$ $f_0^{t_0+6\Delta t}$
3 $f_2^{t_0+2\Delta t}$ $f_3^{t_0+3\Delta t}$ $f_4^{t_0+4\Delta t}$ $f_5^{t_0+5\Delta t}$ $f_0^{t_0+6\Delta t}$ $f_1^{t_0+7\Delta t}$
4 $f_3^{t_0+3\Delta t}$ $f_4^{t_0+4\Delta t}$ $f_5^{t_0+5\Delta t}$ $f_0^{t_0+6\Delta t}$ $f_1^{t_0+7\Delta t}$ $f_2^{t_0+8\Delta t}$
5 $f_4^{t_0+4\Delta t}$ $f_5^{t_0+5\Delta t}$ $f_0^{t_0+6\Delta t}$ $f_1^{t_0+7\Delta t}$ $f_2^{t_0+8\Delta t}$ $f_3^{t_0+9\Delta t}$
6 $f_5^{t_0+5\Delta t}$ $f_0^{t_0+6\Delta t}$ $f_1^{t_0+7\Delta t}$ $f_2^{t_0+8\Delta t}$ $f_3^{t_0+9\Delta t}$ $f_4^{t_0+10\Delta t}$
7 $f_0^{t_0+6\Delta t}$ $f_1^{t_0+7\Delta t}$ $f_2^{t_0+8\Delta t}$ $f_3^{t_0+9\Delta t}$ $f_4^{t_0+10\Delta t}$ $f_5^{t_0+11\Delta t}$
In a three nested for-loop the main kernels are executed.

To understand these kernel, we have to take a look at the paramters in the angle brackets. The first number in those parameters represents the number of parallel blocks in which we would like the device to execute our kernel. In this case, we’re passing the value N for this parameter. For example, if we launch with kernel<<<N,1>>>(), you can think of the runtime creating N copies of the kernel and running them in parallel. We call each of these parallel invocations a block. The CUDA runtime allows these blocks to be split into threads. Inside the angle brackets, the second parameter actually represents the number of threads per block we want the CUDA runtime to create on our behalf. [7]

übersichtsschema.jpg
CurrentTotal<<< 1 , 1 >>>(d_Ik, d_I, d_dI);

RunIONBash6<<< SIM.NumBlocks, SIM.SizeLocal, SIM.SizeLocal_bytes >>>
         (d_xP,
         d_yP,
         d_z[perm_curr[6]], 
         d_x[perm_curr[5]], 
         d_y[perm_curr[5]], 
         d_z[perm_curr[5]], 
         d_y[perm_curr[4]],
         d_z[perm_curr[4]],
         d_y[perm_curr[3]],
         d_z[perm_curr[3]],
         d_y[perm_curr[2]],
         d_z[perm_curr[2]],
         d_y[perm_curr[1]],
         d_z[perm_curr[1]],
         d_y[perm_curr[0]],
         d_U[perm_curr[5]], 
         SIM.e_2 );

RunRLCBash6<<< 1, 1 >>>(d_UP,
         d_VP,
         d_g[perm_curr[6]],
         d_U[perm_curr[5]],                  
         d_V[perm_curr[5]],
         d_g[perm_curr[5]],
         d_V[perm_curr[4]],   
         d_g[perm_curr[4]],
         d_V[perm_curr[3]],   
         d_g[perm_curr[3]],
         d_V[perm_curr[2]],   
         d_g[perm_curr[2]],
         d_V[perm_curr[1]],
         d_g[perm_curr[1]],   
         d_V[perm_curr[0]],               
         d_I, 
         d_dI);

Current<<< SIM.NumBlocks, SIM.SizeLocal >>>
         (d_xP, 
         d_yP,
         d_Ik);

CurrentTotal<<< 1 , 1 >>>(d_Ik, d_IP, d_dIP);

RunIONBash6_P<<< SIM.NumBlocks, SIM.SizeLocal, SIM.SizeLocal_bytes >>>
         (d_x[perm_curr[6]], 
         d_y[perm_curr[6]], 
         d_z[perm_curr[6]], 
         d_x[perm_curr[5]], 
         d_y[perm_curr[5]], 
         d_z[perm_curr[5]], 
         d_y[perm_curr[4]],
         d_z[perm_curr[4]],
         d_y[perm_curr[3]],
         d_z[perm_curr[3]],
         d_y[perm_curr[2]],
         d_z[perm_curr[2]],
         d_y[perm_curr[1]],
         d_xP,
         d_yP,
         d_UP, 
         SIM.e_2 );

RunRLCBash6_P<<< 1, 1 >>>
         (d_U[perm_curr[6]],
         d_V[perm_curr[6]],
         d_g[perm_curr[6]],
         d_U[perm_curr[5]],                  
         d_V[perm_curr[5]],
         d_g[perm_curr[5]],
         d_V[perm_curr[4]],   
         d_g[perm_curr[4]],
         d_V[perm_curr[3]],   
         d_g[perm_curr[3]],
         d_V[perm_curr[2]],   
         d_g[perm_curr[2]],
         d_V[perm_curr[1]],
         d_UP,
         d_VP,
         d_IP, 
         d_dIP);

The function cudaDeviceSynchronize(); is between each of these kernel. It blocks until the device has completed all preceding requested tasks. cudaDeviceSynchronize() returns an error if one of the preceding tasks has failed. If the cudaDeviceScheduleBlockingSync flag was set for this device, the host thread will block until the device has finished its work.

A detailed explanation of the kernels can be found in the next section.

Important: Please pay attention to the difference between single and double precision!

6.7. Copy data to CPU

(see 6.5) Type of transfer (here: cudaMemcpyHostToDevice)

6.8. Save the Data

The function myFile_Ik.write(reinterpret_cast(SIM.Ik), sizeof(double)*SIM.SizeGlobal) shall attempt to write sizeof(double)*SIM.SizeGlobal bytes from the buffer to the file Ik (e.g. C:\Users\DarkNemo\Documents\MATLAB\CUDA\Cloud8\Output1). Analog for x, y, I, U and V. Then all files will ne closed.

//Pos5

ofstream myFile_Pos5((PATH.NewInputPath +"\\Pos5.bin").c_str(), ios::out | ios::binary );

gpuErrchk(cudaMemcpy( SIM.x5, d_x[perm_curr[6]],

sizeof(double)*SIM.SizeGlobal*3, cudaMemcpyDeviceToHost));

myFile_Pos5.write( reinterpret_cast(SIM.x5), sizeof(double)*SIM.SizeGlobal*3 );

myFile_Pos5.close();

-> analog: Velo0 bis Velo5, I0, U3, V0 bis V5, (e.g. C:\Users\DarkNemo\Documents\MATLAB\CUDA\Cloud8\Input2).

Important: Please pay attention to the difference between single and double precision!

6.9. Delete data and free up disk space

delete SIM.Ik;

engClose(m_pEngine); //Quit MATLAB engine session

Back Table of Contents Forward


-- EvaMartenstein - 20 Apr 2015
Topic revision: r19 - 2015-06-05, EvaMartenstein
 
This site is powered by FoswikiCopyright © by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding GSI Wiki? Send feedback
Imprint (in German)
Privacy Policy (in German)