Writing Efficient Code for OpenCL Applications

This is the part 2 of the OpenCL Webinar Series put out by Intel. There were some good information about optimization in general but lot of the information focused on how to optimize OpenCL running on 3rd Gen Intel Core Processor. I jot down some optimization notes that are applicable to all processors.

Avoid Invariant in OpenCL Kernel
Anything that’s independent from kernel execution (invariants, constants), move it to the host.

Bad:
__kernel void test1 (__global int* data, int2 size, int base) {
       int offset = size.y * base + size.x;
       float offsetF = (float)offset;
       …
}

Good:
__kernel void test1 (__global int* data, int offset, float offset) {
      …
}

Avoid initialization in Kernel
Move one time initialization to the host or in a different kernel.

Bad:
__kernel void something (__global int* data) {
         size_t tid = get_global_id(0)
         if (0== tid) {
                 //Do Something
         }
        barrier(CLK_GLOBAL_MEM_FENCE)
}

Use the Built in Functions
i.e. dot, hypot, clamp, etc.

Trade Accuracy vs Speed
If the output is correct, look at the performance and use “mad” or “native_sin”

Get rid of edge conditions

Bad:
__kernel void myKernel(__global int* data, int maxR, int maxC){
         int row = get_global_id(0);
         int col = get_global_id(1);

         if (row > maxRow){
                 …
         }
         else if (col > maxCol) {
                  …
         }
         else{
                  …
         }
}

Use the ND Range (work within a smaller range. Just avoid the conditionals…)
Use the padded buffers…

Get rid of edge conditions in general
Use logical ops instead of comparison/conditions.

Reduce number of registers to increase parallelism

Avoid Byte/Short Load and Stores
Use load and store in a greater chunk.
i.e. uchar -> uint4

Don’t use too many barriers in kernel
You are asking everything to wait until the work items are done.
If the work item doesn’t run, the code may hang

Use the vector types (float8, double4, int4, etc) for better performance on the CPU

Preferred work group size for kernels is 64 or 128 work items.
-Workgroup size multiple of 8
-Query
CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE parameter by calling to clGetKernelWorkgroupInfo.
-Match number of workgroups to logical cores.

Buffer Object vs. Image Objects
CPU Device:
– Avoid using Image Object and use Buffer object instead.

Brushing up on Probabilities, Localization, and Gaussian

Gaussian: It is a bell curve characterized by mean and variance. It’s is unimodal and symmetric. The area under the Gaussian adds up to 1.

Variance: measure of uncertainty. Large covariance = more spread = more uncertain.

Bayes Rule

Localization:

Involves “move” (motion) step and “sense” (measurement) step.

Motion (move) : First the robot moves. We use convolution to get the probability that robot moved to the current grid location. We use Bayes Rule (given previous location, find probability of being in this current grid location).

Measurement (sense) : Then robot senses the environment. We use products to get the probability that the sensor measurement is correct. Measurement applies theorem of total probability (sum of:  probability that sensor measurement is correct given it’s a hit, prob that sensor is correct given it’s a miss).

*Side note: for grid based localization method (histogram method), the memory increases exponentially with number of state variables (x,y,z, theta,row, pitch, yaw, etc)

 

Moonrise Kingdom

My roommate, Nicole, invited me to the showing of Moonrise Kingdom as part of her birthday celebration. I had seen the trailer and gathered that it’s a film about two kids in New England who decide to runaway together. The film takes place in the sixties. I would have dismissed the movie if it wasn’t for the cast and the pretty movie trailer (and the 94% rotten tomatoes rating also helped). I swept aside my Asian guilt (I thought about spending this evening coding) and joined her and her friends for the movie showing.

Moonrise Kingdom, whose title sounds like a Chinese martial arts movie, is so pretty. It’s like candy to your eyes and ears. It’s so pretty that I wanted to take frames out of it and post it around my room. Every scene is like a vintage polaroid photo. An advertisement from the 60s. It’s as if Wes Anderson shot the movie through Instagram.

The plot is simple and yet each moment is full of innocence, wonderment, and adventure. And Wes Anderson  portrays children as complete human being with full mental faculty and emotional complexity! But frankly all that stands out in my memory is the color palette. The sepia and pink hues, the golden fields, washed out blues, red and green standing out against the desaturated background. There is such a decisive and consistent look throughout the film.

I guess one thing that slightly bothered me was the visual imagery of the night time scene. The rain is pouring and it’s the evening. And either the director or the post-production crew decided to blue-filter the sh#$ out of it. So everyone’s faces look blue. Like ghosts. I didn’t particularly like the look but maybe it was the look they were going for.

I appreciated that the film didn’t water down childhood. I find that childhood portrayed through Hollywood is either idealistic, fantastical, really sad and dreary, or some other end of the spectrum. The subtlety and complexity is usually entirely missing. Wes Anderson deals with it very delicately, being careful not to shift to the extremes I mentioned. And the acting from these kids is amazing. I loved the gaze of the main character (the girl). It’s always distant and full of meaning, and we can never really guess all that goes inside her head.

Overall, I loved it. Two hours well spent and coding could wait.

Intel SDK for OpenCL Applications Webinar Series 2012

Intel hosted a webinar on running OpenCL on Intel Core processor. The webinar I attended this morning (9am, July 11th), is first part of the three-part webinars on this topic. It was well organized and educational and I think the next seminar will be even more useful (since it deals with programming using OpenCL. I took notes during the webinar to get you up to speed in case you want to attend the next two seminars.

* July 18-Writing Efficient Code for OpenCL Applications<http://link.software-dispatch.intel.com/u.d?V4GtisPHZ8Stq7_bNj1hJ=3231> 
* July 25-Creating and Optimizing OpenCL Applications<http://link.software-dispatch.intel.com/u.d?K4GtisPHZ8Stq7_bNj1hS=3241> 

OpenCL: Allows us to swap out loops with kernels for parallel processing.

Introduction: Intel’s 3rd Generation Core Processor.

  • Inter-operability between CPUs and HD Graphics.
  • Device 1: maps to four cores of intel processor (CPUs)
  • Device 2: Intel HD Graphics.
  • Allows access to all compute units available within system (unified compute model – CPU and HD Graphics)
  • Good for multiple socket cpu – if you want to divide the openCL code with underlying memory architecture.
  • Supported on Window7 and Linux.

General Electric’s use of OpenCL

  • GE uses OpenCL for image reconstruction for medical imaging (O(n^3) – O(n^4))
  • Need unified programming model for CPUs and GPUs
  • OpenCL is most flexible (across all CPU and GPUs) – good candidate for unified programming language.
  • Functional Portability: take OpenCL application and run it on multiple hardware platforms and expect it to produce correct results.
  • Performance Portability: functional Portability + Deliver performance close to entitlement performance (10-20%)
  • Partial Portability: functional Portability + only host code tuning is required.
  • Benefits of OpenCL:
    • C like language – low learning curve
    • easy abstraction of host code (developers focus on kernel only)
    • easy platform abstraction (don’t need to decide platform right away.)
    • development resource versatility (suitable for mult. platforms)
  • Uses combination of buffers (image buffers and their customized ones). Image buffers allow them to use unique part of GPU.
  • Awesome chart that compares various programming models:

Image courtesy of Intel SDK for OpenCL Webinar.

Intel OpenCL SDK: interoperable with Intel Media SDK with no copy overhead on Intel HD Graphics.

Intel Media SDK: hardware accelerated video encode/decode and predefined set of pre-processing filters

Thank you UC Berkeley Visual Computing Center for letting me know about this webinar series!

Caramel

Image

Ever since I heard about this movie, a foreign film from Lebanon that takes place in Beirut and written, acted, and directed by Nadine Labaki (a female director to boot), I wanted to see it but kept pushing it off. It came up again on my Netflix watch list so I finally decided to watch it tonight.

And it whirled me into its world: a small beauty salon in Beirut where people come to share their personal drama, joy, worries, and hopes. It’s a pastiche of different stories that involve women who work in the beauty salon and their frequent customers.

And you will never guess what the caramel is used for. Hint: it’s not for eating.

There are many poignant moments in the movie and what I appreciated the most was the honesty in which these poignant stories are shared. Unlike in Hollywood movies, where the story takes expected turns to satisfy the audience, Caramel has moments of disappointment, disillusionment, but ultimately hope. The beauty of life shown through its most mundane, every day moments. I loved it.

I was reminded of a quote from Gogol’s Dead Souls, which strangely fits this movie:

And for a long time yet, led by some wondrous power, I am fated to journey hand in hand with my strange heroes and to survey the surging immensity of life, to survey it through the laughter that all can see and through the tears unseen and unknown by anyone.

AI Course on Udacity

Image

I’ve been taking an artificial intelligence course on Udacity (http://www.udacity.com/courses), an online course taught by Sebastian Thrun. The course is called “Programming a Robotic Car”. One of my co-workers pointed out that the course covers exactly what I need to learn – probabilities, Kalman Filter, Particle Filter, and SLAM. I will be blogging about my progress with the course and the insights I picked up from it.

Unlike OpenCourseWare from MIT and other webcasts offered, Udacity is much more interactive. I didn’t find myself bored or distracted (though I’m taking a break right now to write this blog) because it has short quizzes (they are easy and very short) to recap the concepts covered in the video. The videos also focus on insight and doesn’t dwell on the mathematical formulation of the problem unless it’s absolutely necessary. And as a visual learner, I find Sebastian Thrun’s drawings very helpful in understanding the concept.

I wish every web classes offered online were as good as these. I hope you find them useful.

Understanding FastSLAM

The SLAM I’m talking about has nothing to do with poetry or basketball. I’m “investigating” (read “learning on the fly”) the SLAM algorithm (Simultaneous Localization and Mapping). One of my co-workers forwarded me two papers that I should read (http://tinyurl.com/6vfvuxg), both of which are co-authored by Sebastian Thrun of Google X (clearly I am very excited to point this out). I think it’s pretty awesome that reading research papers is part of my job.

To understand FastSLAM (version of SLAM in the papers), I needed to understand particle filter and Kalman Filter. Here are one sentence summaries based on wikipedia articles:

particle filter: Uses differently weighted samples of distribution to determine probability of an ‘event happening’ (some hidden parameter) at a specific time given all observations up to that time.

*note to self: similar to importance sampling: particle filter is more flexible for dynamic models that are non-linear.

Kalman Filter:Takes in a noisy input and using various  measurements (from sensor, ctrl input, things known from physics), recursively updates the estimates (they call it system’s state) to be more accurate. example: A truck has a GPS that estimates the position within few meters.  Estimate is noisy but we can take into account the speed and direction over time (via wheel revolution and angle of steering wheel) to update the estimated position to be more accurate.

*note to self: Kalman Filter assumes linearity in dynamics and in noise.

In terms of flexibility, it can be described this way (from least flexible to most):

Kalman Filter < Exteneded Kalman Filter < Particle Filter

FastSLAM is a Bayesian formulation. It essentially boils down to this:


The particle filter is used to estimate the path of the robot (it’s given by the posterior probability p(s_t | z_t, u_t, n_t)). First, construct a temporary set of particles from robot’s previous position and the control input. Then sample from this set with probability of importance factor (particle’s weight). Finding weight of each particle is quite involved. I’ll let you refer to the actual papers for the derivation.

After we have path estimates, we can solve for landmark location estimates (the right side of the equation). Through series of equalities, authors arrive at:


FastSLAM updates the above equation using the Kalman Filter.

The main advantages of FastSLAM are that it runs at O(M log K) instead of O(MK), where M is number of particles, K is number of landmarks. I’ve had trouble understanding this part but here it goes: each particle contains the estimates of K landmarks (and each estimate is a Gaussian). Resampling particles requires copying the data inside the particle (K Gaussians if we have K landmarks). Instead of copying over all K landmark location estimates, FastSLAM does a partial copy for only Gaussians that need to be updated. Also the conditional independence between landmark location and robot location allows for easy setup for parallel computing.

Stay tuned for breakdown of FastSLAM2.0 …