Showing results for 
Show  only  | Search instead for 
Did you mean: 
Registered: ‎09-02-2019

barrier(CLK_GLOBAL_MEM_FENCE) end computations in accelerated function

I have following problem. When barrier(CLK_LOCAL_MEM_FENCE) is placed in my accelerated function (kernel) then instructions that are after barrier are not performed at all and I wish to know why. In theory barrier should force every workitems in workgroup to wait until each of them performed all operations before barrier and then allow them to perform next operations. In my case only operations before barrier are performed.

I use SDx environment for programming on FPGA devices and have runned code in CPU-Emulation, Hardware emulation and System mode and checked if max workgroup size isn't exceeded.

This is my code:

#include <stdlib.h>
#include <fstream>
#include <iostream>
#include "vadd.h"

//TARGET_DEVICE macro needs to be passed from gcc command line
#if defined(SDX_PLATFORM) && !defined(TARGET_DEVICE)
    #define STR_VALUE(arg)      #arg
    #define GET_STRING(name) STR_VALUE(name)

int main(int argc, char* argv[]) {

    const char *target_device_name = TARGET_DEVICE;

    if(argc != 2) {
        std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
        return EXIT_FAILURE;

    char* xclbinFilename = argv[1];

    std::vector<cl::Device> devices;
    cl::Device device;
    std::vector<cl::Platform> platforms;
    bool found_device = false;

    for(size_t i = 0; (i < platforms.size() ) & (found_device == false) ;i++){
        cl::Platform platform = platforms[i];
        std::string platformName = platform.getInfo<CL_PLATFORM_NAME>();
        if ( platformName == "Xilinx"){
            platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);

            //Traversing All Devices of Xilinx Platform
            for (size_t j = 0 ; j < devices.size() ; j++){
                device = devices[j];
                std::string deviceName = device.getInfo<CL_DEVICE_NAME>();
                if (deviceName == target_device_name){
                    found_device = true;
    if (found_device == false){
       std::cout << "Error: Unable to find Target Device " 
           << target_device_name << std::endl;
       return EXIT_FAILURE; 

    // Creating Context and Command Queue for selected device
    cl::Context context(device);
    cl::CommandQueue q(context, device, CL_QUEUE_PROFILING_ENABLE);

    // Load xclbin 
    std::cout << "Loading: '" << xclbinFilename << "'\n";
    std::ifstream bin_file(xclbinFilename, std::ifstream::binary);
    bin_file.seekg (0, bin_file.end);
    unsigned nb = bin_file.tellg();
    bin_file.seekg (0, bin_file.beg);
    char *buf = new char [nb];, nb);

    // Creating Program from Binary File
    cl::Program::Binaries bins;
    cl::Program program(context, devices, bins);

    std::cout << "CL_DEVICE_MAX_WORK_GROUP_SIZE : " << device.getInfo<CL_DEVICE_MAX_WORK_GROUP_SIZE>() << "\n";

    cl_int err;
    float* a = new float[1024*1024];
    cl::Buffer buffer_a(context, CL_MEM_READ_WRITE, sizeof(float) * 1024*1024);
    cl::Kernel krnl_test(program,"krnl_test");
    krnl_test.setArg(0, buffer_a);
    err = q.enqueueNDRangeKernel(krnl_test, cl::NullRange, cl::NDRange(1024*1024), cl::NDRange(1024), NULL, NULL);
    if (err != CL_SUCCESS)
        std::cerr << err << std::endl;
    q.enqueueReadBuffer(buffer_a, CL_TRUE, 0, sizeof(float) * 1024*1024, a);
    for (int i = 1000; i < 1040; i++)
        std::cout << i << "     " << a[i] << std::endl;

    return 0;

and my kernel code:

kernel void krnl_test(global float* a)
    int index = get_global_id(0);
    a[index] = index;
    a[index] = index*2;


Value of each element of array a should be value of its index multiplied by 2 but actual results are each elements value is value of its index.

0 Kudos
0 Replies