ATI Stream and OpenCL Q&A Interview

Post by **Apoptosis** » Mon Oct 19, 2009 7:00 am

Simon Solotko & Ben Sander recently had a Q&A on the power of ATI Stream technology and the elegant, standards-based interface now available with OpenCL that comes with ATI Stream SDK 2.0 for GPU.

Ben, what have we created with OpenCL and what does it do?

Ben: Sure, with OpenCL we created a C-based interface for programming a range of parallel processors. Developers write OpenCL Kernels, sub-routines which developers seek to accelerate or offload, and embed these in their applications. OpenCL includes a runtime component which allows these OpenCL Kernels to be compiled at runtime for either a CPU or GPU. AMD has contributed to the development of the OpenCL specification and written the implementation x86 processors and GPU's - a runtime environment which compiles the code near runtime, then schedules and executes the code at runtime.

What are the benefits of being able to compile an application for a CPU or a GPU?

Ben: Developers can write one piece of code and easily support a variety of compute devices in the platform - CPUs and GPUs, from multiple vendors. Code can be load-balanced between CPU and GPU depending on the capabilities in the final platform. For example, we expect that some applications or parts of applications will run faster on the CPU than the GPU, other applications perform better on the GPU. Finally, the OpenCL CPU implementation levertages the CPU hardware debug features to provide excellent debug capabilities, using familiar debug environments, at a full CPU speeds.

When exactly during runtime is the Kernel compiled?

Ben: There are specific commands within the body of your application which you call to compile the Kernel, and direct it to be compiled for the CPU or GPU. At that point, the Kernel code is translated into a binary. The binary later executes natively when the Kernel is called. The code is not interpreted in the hot spot of the loop, it's not like Java in that regard.

So the code within a Kernel looks like C but can be compiled to execute on the GPU?

Ben: Exactly. Because a GPU looks and functions differently than a CPU, however, you have to think differently when you write the Kernel for GPU, because at that point, you are executing your code directly on the GPU. There are constraints imposed on Kernel code to accommodate the specialized functionality of the GPU. Kernels are based on C99 with extensions provided by OpenCL-C for vectors and address spaces.

Give me some examples of the special ways in which the C code within a Kernel is different from the standard code in the body of the application?

Ben: To understand writing a Kernel it is important to understand that the code is actually executing on a GPU, despite the fact that the functions you are performing are syntactically the same as other C code. A GPU has a small fast cache (local memory) and larger main GPU memory (global memory). You move data in blocks, and complete as much of the task on that block as possible before moving the block out and moving the next block in. With a GPU we have a lot of compute bandwidth relative to memory bandwidth making it advantageous to do as much as you can to data within the cache. With OpenCL the blocking process does not necessarily get easier, but you can control it from C code.

How do we move data from main memory to the GPU memory for use by a Kernel function?

Ben: A Kernel cannot move memory from main memory, that is done in your application code. So there are standard functions to copy memory into GPU memory from the application, and pointers to this memory can then be passed to a Kernel function. The Kernel function can then copy memory into the fast cache or "local" memory.

This sounds a bit complicated, but I have to remind myself, this is all standard C code, and we are discussing the optimization that makes something run fast on the GPU, and the memory management tools that are available, now within standard C through the OpenCL library, to do that.

Ben: That's Right. The magic is that a Kernel is C code which is amazingly compiled by the runtime component of OpenCL to run on a GPU or CPU with some extra tools to ensure it can take full advantage of the extremely high compute to memory bandwidth capability of the fast, parallel math engine of the GPU.

So as time goes on, we anticipate that people will write and optimize many useful Kernels which will simplify the development of complex applications?

Ben: Yes. It is relatively straight-forward to port applications written for other GPGPU languages like Brook+ and CUDA to OpenCL. This is a huge step forward from proprietary GPU code, you now have a standard way to get at GPU code and memory from C in a platform independent way.

With ATI Stream technology and the standardization of the programming model with OpenCL for GPU almost any aspiring GPGPU developer can download the tools necessary to get started and develop platform-independent software fueled by the power of the evolved GPU. If you are ready to get started with OpenCL, you can begin with AMD's OpenCL resource page here.