UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor yurtesen
Visitor
676 Views
Registered: ‎11-13-2018

OpenCL Vector Add Example question

Hello,

I am new to FPGAs. I have been looking at the OpenCL examples. I had a few questions. The example vector addition kernel looks like this:

#define BUFFER_SIZE 256
kernel __attribute__((reqd_work_group_size(1, 1, 1)))
void vector_add(global int* c,
                global const int* a,
                global const int* b,
                       const int n_elements)
{
    int arrayA[BUFFER_SIZE];
    int arrayB[BUFFER_SIZE];
    for (int i = 0 ; i < n_elements ; i += BUFFER_SIZE)
    {
        int size = BUFFER_SIZE;
        if (i + size > n_elements) size = n_elements - i;
        readA: for (int j = 0 ; j < size ; j++) arrayA[j] = a[i+j];
        readB: for (int j = 0 ; j < size ; j++) arrayB[j] = b[i+j];
        vadd_writeC: for (int j = 0 ; j < size ; j++) c[i+j] = arrayA[j] + arrayB[j];
    }
}

 

1- The host code uses `enqueueTask` which runs kernel using a single work item. Why not use `enqueueNDRangeKernel`? Does the vector addition even get parallelized when run like this?

2- I was wondering, what is the purpose of 'readA' and 'readB'? why not directly run `c[i+j] = a[i+j] + b[i+j]` ?

3- Do these have something to do with how FPGA should be used with OpenCL? Because when I program vector addition in GPU, I would simply do:

    int i = get_global_id(0);
    c[i] = a[i] + b[i];

 and the kernel would be run with global size which matches the lengths of the vectors.

4- Why the FPGA example is so different? Is there any pointers or documentation about this?

Thanks!

0 Kudos
3 Replies
Xilinx Employee
Xilinx Employee
672 Views
Registered: ‎07-16-2008

回复: OpenCL Vector Add Example question

I'd recommend that you take a look at SDAccel Optimization Guide:

http://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_2_xdf/ug1207-sdaccel-optimization-guide.pdf

 

You must send the command to start the kernel in one of two ways:

• Using clEnqueueNDRange API for the data parallel case.
• Using clEnqueueTask for the task parallel case.

For the data parallel case, Xilinx recommends that you carefully choose the global and local work sizes for your host code and kernel so that the global work size is a small multiple of the local work size. Ideally, the global work size is the same as the local work size.

For the task parallel case, Xilinx recommends that you minimize the calls to clEnqueueTask. Ideally, you should finish all the work load in a single call to clEnqueueTask.

clEnqueueTask is equivalent to calling clEnqueueNDRangeKernel with work_dim = 1, global_work_offset = NULL, global_work_size[0] set to 1, and local_work_size[0] set to 1.

 

The purpose of "readA" and "readB" is to increase the performance of the kernel.

If it reads a and b values directly from the Global (DDR) memory for each loop iteration, the data transfer would be very inefficient. Therefore we can prefetch a and b vectors to the private memory (implemented as BRAM) before calculation.

You can also try to apply dataflow optimization to further increase performance.

-------------------------------------------------------------------------
Don't forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
Visitor yurtesen
Visitor
625 Views
Registered: ‎11-13-2018

回复: OpenCL Vector Add Example question

Yes, the OpenCL on FPGA seems to be confusing :)

But the example does NOT appear to be task parallel? The SDAccel doc says you need to use xcl_dataflow attribute for task parallelism inside kernel. It is not used? As I understand, it should be put just before the first loop? :

    for (int i = 0 ; i < n_elements ; i += BUFFER_SIZE)

Otherwise how the compiler will know which loops to parallize?

About prefetching the data, you prefetch it using a loop, so you are fetching 1 element at a time anyway using a loop, then store them in a local variable one by one. Shouldn't it result in same performance but only use more resources?

Also, if it really helps, why the example is NOT writing results to a local `arrayC` in bram first then copy it back to `c` in yet another loop?

In sdaccel programmers guide says : "Xilinx® recommends using the clEnqueueTask command to execute the kernel over the entire range of input data set using the maximum number of work group items"  So it is not recommended to use clEnqueueNDRange?

0 Kudos
Xilinx Employee
Xilinx Employee
614 Views
Registered: ‎07-16-2008

回复: OpenCL Vector Add Example question

As you said, you can apply optimization techniques to the kernel code to increase performance. 

e.g. dataflow, loop pipeline, burst data transfer

You can compare the profile report to see how performance can be improved after making changes.

 

-------------------------------------------------------------------------
Don't forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos