UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Contributor
Contributor
7,890 Views
Registered: ‎04-17-2012

OpenCL Application hangs when many kernels are sequentially enqueued

Hi,

 

I have an application with 4 kernels (K1, K2, K3, K4) that are enqueued sequentially:

 

Host -> clEnqueueNDR(K1) -> clEnqueueNDR(K2) ... -> clEnqueueNDR(K4) -> Host

 

 

After every call to clEnqueueNDRange() and before the next kernel is enqueued, clFinish() is called to make sure that 

previous commands have finished execution. My command_queue is NOT configured with CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE.

 

The problem is that execution hangs right after the first kernel is enqueued.

This problem dissappears, when I enqueue only one kernel (KX), no matter which one:

Host -> clEnqueueNDR(KX) -> Host

 

Is there a way to discover what is the cause of these hangs or right place to start looking for it? 

 

 

 


Best,
L30nardo SV
Tags (2)
0 Kudos
9 Replies
Xilinx Employee
Xilinx Employee
7,884 Views
Registered: ‎02-03-2016

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Hey leonardo.solis,

 

Does this happen in sw emulation, hw emulation or on the board?  Also, my first thought (specifically, if this is occurring in software emulation) is that you may have a stack overflow situation due to the size of the arrays in a combination of two or more of the kernels.  If this is the issue, you can use "ulimit -s unlimited" to unset the stack limit.

 

Thanks,

Spenser

0 Kudos
Contributor
Contributor
7,880 Views
Registered: ‎04-17-2012

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Hi Spenser,

 

I forgot to mention: this is happening in sw-emulation flow.

Yeah, my kernels are passing ~46 arguments and some of them are relatively large arrays.

 

 

I will try this out and post the results soon.

 

Thanks for the quick answer!


Best,
L30nardo SV
0 Kudos
Contributor
Contributor
7,836 Views
Registered: ‎04-17-2012

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Hi again Spenser,

 

I have set the stack size to unlimited but still I have the same problem during sw-emulation.

 

Regarding hw-emulation, it also hangs similarly as sw-emulation.

 

I didn't try with board-execution since I want to first debug my application.

 

Any hints would be appreciated.


Best,
L30nardo SV
0 Kudos
Xilinx Employee
Xilinx Employee
7,813 Views
Registered: ‎02-03-2016

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Hey leonardo.solis,

 

Do your kernels access global memory and if so were the buffers allocated to the correct size?  Perhaps, it's a segfault.

 

For the hw_emu failure, how long did you wait before deciding that the kernel is hung? minutes, hours, days? Hw_emu can take a very long time depending on the kernel (possibly even weeks/months given a large enough problem size.)

 

Thanks,

Spenser

0 Kudos
Contributor
Contributor
7,722 Views
Registered: ‎04-17-2012

Re: OpenCL Application hangs when many kernels are sequentially enqueued


@spenserg wrote:

Hey leonardo.solis,

 

Do your kernels access global memory and if so were the buffers allocated to the correct size?  Perhaps, it's a segfault.


Thanks for the reply Spenser,

 

Yes, my kernels access global memory. This is how I allocate buffers (my kernel args are only global and private):

 

size_t size_floatgrids;
cl_mem mem_dockpars_fgrids;
...
size_floatgrids = 7689500; //in #bytes
...

mallocBufferObject(context,CL_MEM_READ_WRITE,size_floatgrids, &mem_dockpars_fgrids); memcopyBufferObjectToDevice(command_queue,mem_dockpars_fgrids, (void *)cpu_floatgrids, size_floatgrids);
...
setKernelArg(kernel1,7, sizeof(mem_dockpars_fgrids), &mem_dockpars_fgrids);
...

Printing sizeof(mem_dockpars_fgrids) produces 8 bytes, which I think is correct since its type is cl_mem.

 

And my implementation of previous functions is:

int mallocBufferObject(cl_context context, cl_mem_flags flags, size_t size, cl_mem* mem){
  cl_mem local_mem;
  local_mem = clCreateBuffer(context, flags, size, NULL, NULL);
  if (!local_mem){
    printf("Error: clCreateBuffer()\n");fflush(stdout);
    return EXIT_FAILURE;
  }

  *mem = local_mem;
  return CL_SUCCESS;
}

int memcopyBufferObjectToDevice(cl_command_queue cmd_queue, cl_mem dest, void* src, size_t size){
  cl_int err;
  err = clEnqueueWriteBuffer(cmd_queue,dest,CL_TRUE,0,size,src,0,NULL,NULL);
  if (err != CL_SUCCESS){
    printf("Error: clEnqueueWriteBuffer() %d\n", err);fflush(stdout);
    return EXIT_FAILURE;
  }

  return CL_SUCCESS;
}

int setKernelArg(cl_kernel kernel, cl_uint num, size_t size, const void *ptr){
  cl_int err;
  err = clSetKernelArg(kernel,num,size,ptr);
  if (err != CL_SUCCESS){
    printf("Error: clSetKernelArg() %d\n", err);fflush(stdout);
    return EXIT_FAILURE;
  }
  return CL_SUCCESS;
}

 

I have tried other smaller projects using these custom functions and they work fine. But in this project, it still hangs after launching the first kernel during cpu-emul.

 

I am currently enabling only two kernels. The first one passes 46 and the second one passes 4 arguments; in both cases, args are cl_mem or scalars. I wonder if the number of arguments is too large for the cpu-emul?. is this likely to be the problem due to a certain device's limit?

For kernel 1,  I am allocating arrays, with the following sizes:

 

Size of dockpars_fgrids:                7689500 bytes.
Size of dockpars_conformations_current:    38400 bytes.
Size of dockpars_energies_current:        600 bytes.
Size of dockpars_conformations_next:    38400 bytes.
Size of dockpars_energies_next:            600 bytes.
Size of dockpars_evals_of_new_entities:    600 bytes.
Size of dockpars_prng_states:            19200 bytes.
Size of atom_charges_const:                360 bytes.
Size of atom_types_const:                90 bytes.
Size of intraE_contributors_const:        24384 bytes.
Size of VWpars_AC_const:                784 bytes.
Size of VWpars_BD_const:                784 bytes.
Size of dspars_S_const:                    56 bytes.
Size of dspars_V_const:                    56 bytes.
Size of rotlist_const:                    16384 bytes.
Size of ref_coords_x_const:                360 bytes.
Size of ref_coords_y_const:                360 bytes.
Size of ref_coords_z_const:                360 bytes.
Size of rotbonds_moving_vectors_const:    384 bytes.
Size of rotbonds_unit_vectors_const:    384 bytes.
Size of ref_orientation_quats_const:    1600 bytes.

 



For the hw_emu failure, how long did you wait before deciding that the kernel is hung? minutes, hours, days? Hw_emu can take a very long time depending on the kernel (possibly even weeks/months given a large enough problem size.)



Well, I have waited no longer than 20 mins for hw-emul. I will run it again.

 

I am looking forward to hearing from you.

 

 

 


Best,
L30nardo SV
0 Kudos
Xilinx Employee
Xilinx Employee
7,717 Views
Registered: ‎02-03-2016

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Hey leonardo.solis,

 

I don't see any obvious errors in the code you provided.  If you go through the HLS logs do you see any messages suggesting out of bounds accesses? Also, are you using pipes?

 

I don't believe there is a limit on the number of arguments that can be passed to a kernel.

 

Thanks,

Spenser

0 Kudos
Contributor
Contributor
7,683 Views
Registered: ‎04-17-2012

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Hey Spenser,

 

I have compiled for hw emulation and look for hls files here: /ofdock_GUI_2/solutions/solution_1/impl/kernels/

As far as I can see, there are no messages suggesting out of bound accesses. Please find those file (one for each kernel) attached. By the way,I am not using pipes.

 

May I send you my kernel source code so you can have look at it?

 

And what about the amount of data allocated in global memory? is there any limit to that?


Best,
L30nardo SV
Tags (2)
0 Kudos
Xilinx Employee
Xilinx Employee
7,676 Views
Registered: ‎02-03-2016

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Hey leonardo.solis,

 

Yes, you can send me the kernel code but also please send me the full host code as well as either a tcl script or Makefile.  Furthermore, if the code can be released publicly it would be good to post it here so that others can learn from your experience.

 

There's at least 4 GB of memory on all cards.  According to the sizes you provided earlier, you are not hitting this limit.

 

Thanks,

Spenser

 

0 Kudos
Contributor
Contributor
7,634 Views
Registered: ‎04-17-2012

Re: OpenCL Application hangs when many kernels are sequentially enqueued

Thanks Spenser.

 

Once I am allowed to, I will make my code publicly available.


Regarding to the hw-emul, it hangs in the same statement as in cpu-emul and has been there for 1 day so far ...

is there a way to know whether it is still processing or actually hanging?


Best,
L30nardo SV
0 Kudos