Writing Efficient Code for OpenCL Applications

This is the part 2 of the OpenCL Webinar Series put out by Intel. There were some good information about optimization in general but lot of the information focused on how to optimize OpenCL running on 3rd Gen Intel Core Processor. I jot down some optimization notes that are applicable to all processors.

Avoid Invariant in OpenCL Kernel
Anything that’s independent from kernel execution (invariants, constants), move it to the host.

Bad:
__kernel void test1 (__global int* data, int2 size, int base) {
       int offset = size.y * base + size.x;
       float offsetF = (float)offset;
       …
}

Good:
__kernel void test1 (__global int* data, int offset, float offset) {
      …
}

Avoid initialization in Kernel
Move one time initialization to the host or in a different kernel.

Bad:
__kernel void something (__global int* data) {
         size_t tid = get_global_id(0)
         if (0== tid) {
                 //Do Something
         }
        barrier(CLK_GLOBAL_MEM_FENCE)
}

Use the Built in Functions
i.e. dot, hypot, clamp, etc.

Trade Accuracy vs Speed
If the output is correct, look at the performance and use “mad” or “native_sin”

Get rid of edge conditions

Bad:
__kernel void myKernel(__global int* data, int maxR, int maxC){
         int row = get_global_id(0);
         int col = get_global_id(1);

         if (row > maxRow){
                 …
         }
         else if (col > maxCol) {
                  …
         }
         else{
                  …
         }
}

Use the ND Range (work within a smaller range. Just avoid the conditionals…)
Use the padded buffers…

Get rid of edge conditions in general
Use logical ops instead of comparison/conditions.

Reduce number of registers to increase parallelism

Avoid Byte/Short Load and Stores
Use load and store in a greater chunk.
i.e. uchar -> uint4

Don’t use too many barriers in kernel
You are asking everything to wait until the work items are done.
If the work item doesn’t run, the code may hang

Use the vector types (float8, double4, int4, etc) for better performance on the CPU

Preferred work group size for kernels is 64 or 128 work items.
-Workgroup size multiple of 8
-Query
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE parameter by calling to clGetKernelWorkgroupInfo.
-Match number of workgroups to logical cores.

Buffer Object vs. Image Objects
CPU Device:
– Avoid using Image Object and use Buffer object instead.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s