02-21-2017 08:27 PM
Hi all,
I have used pipes to add two vectors such as c[i] = a[i] +b[i].
In the host I have used CL_MEM_USE_HOST_PTR | CL_MEM_EXT_PTR_XILINX options for buffers of a and b vectors. There are three kernels connected through two pipes, the first one (read_data_kernel) reads a and b element by element and pushes into the pipe , the second one (add_data_kernel) pops a and b elements by elements from the pipe and pushes the result into another pipe and the last one write the result (c) into the host memory.
And this is the timing diagram after hardware emulation. My questions are,
why is read_data_kernel started after a delay and the add_data_kernel should go to stall waiting for data coming from read_data_kernel through pipe.
Shouldn’t all three kernels start at the same time?
Is there any way to reduce this delay?
The codes also attached.
Thanks
Mohammad
02-22-2017 08:10 PM
When running on the actual hardware, the kernel start-up time is about 30us, which unfortunately can't be hidden. In your case, it's about 1% of your computation time, which shouldn't be that much of an overhead.
02-21-2017 09:40 PM
I also observed this similar issue in past. I guess the delay what you are seeing between Kernel start is due to kernel initialization (configuring kernel registers by host before starting it). So I believe this delay is just initial delay and will be negligible if you run your for large data count.
Change your DATA_LENGTH to something large size( for example 65536) and check the overall timeline.
02-21-2017 09:44 PM
02-21-2017 10:45 PM
Hi @heeran
The xilinx pipe example also has the same issue. (maybe its not an issue)
this is the timing diagram with HW emulation on KU115
The add-stage starts first and after a delay the input stage starts and after more delay the output stage starts.
Thanks
Mohammad
02-21-2017 10:53 PM
02-22-2017 04:40 AM
Hi @heeran
I think you are right and the delay is all about the kernel initialisation.
As I realised it depends on the number of kernel arguments and their types. It seems that global pointers have the most delay.
This delay is not negligible sometimes. I spent some times to implement an efficient histogram algorithm on FPGA, and now it takes less than 4 msec to process 67Mbyte without considering this delay. But the delay is about 15msec which is a huge overhead and it's not negligible.
Now my questions are
What parameters have impact on this delay?
Is there any way to reduce or hide this delay behind other computations?
And interestingly, the kernel profiling (using the CL_QUEUE_PROFILING_ENABLE) does not report this delay. And says the kernel execution time is 4msec (in my histogram design) while actually it takes about 20msec. (if there are two kernels, actually the kernel profiling reports high execution time for the other one which does computation and not for the one that receives data)
Thanks
Mohammad
02-22-2017 08:10 PM
When running on the actual hardware, the kernel start-up time is about 30us, which unfortunately can't be hidden. In your case, it's about 1% of your computation time, which shouldn't be that much of an overhead.