UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
2,061 Views
Registered: ‎11-05-2017

Run many instances of same kernel

Jump to solution

Currently my opencl kernel only takes up 1% of the resources on the board. How can I run 100 of them at once?

 

Should I create 100 command queues for the same kernel and enqueue 100 kernels? Will this run them concurrently?

 

Should I make 100 copies of the same kernel with slightly different names?(half-joking)

 

Should I try to make my kernel process 100 items per iteration? (this seems hard and slow)

 

On a GPU my kernel uses too many registers per workgroup so I can't even utilize all of the compute units at once, but on an fpga, I have so many left over resources that the limit of 16 compute units would not be enough to maximize resource usage.

 

 

0 Kudos
1 Solution

Accepted Solutions
Xilinx Employee
Xilinx Employee
2,952 Views
Registered: ‎07-18-2014

Re: Run many instances of same kernel

Jump to solution

Hi @christian1188,

 

For running multiple instance of kernels, user has to generate those many compute units using --nk option  of xocc.

-nk <kernelName>:<number of instance>

 

Please refer below example for multiple compute units:

https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/clk_freq/large_loop_ocl

 

For SDAccel, number of compute units are restricted to max 10. So at max, 10 compute units can run concurrently. 

In your case (100 parallel operation), I would suggest to do parallel computation inside single kernel instance as much as possible. and later go for multiple compute units. 

You can refer below example watermark, in which single kernel is performing all image processing, but it does 16 pixels operation concurrently using loop unrolling.

https://github.com/Xilinx/SDAccel_Examples/blob/master/vision/watermarking/src/krnl_watermarking.cl

 

So effectively, it is equivalent to 16 GPU work-items which operate on single pixel in each work-items. 

 

Here is my understanding for your quesitons:

Should I create 100 command queues for the same kernel and enqueue 100 kernels? Will this run them concurrently?

[Heera]: Multiple command queues wont help if xclbin binary has single kernel instance in FPGA. All request will run sequentially.

 

 

Should I make 100 copies of the same kernel with slightly different names?(half-joking)

[Heera]: 100 copies are not possible as max is restricted to 10 only. If you need 10 instance, you can create multiple instance of same kernel using --nk option. No need of different names.

 

Should I try to make my kernel process 100 items per iteration? (this seems hard and slow)

 [Heera]: Yes, This is recommended way for FPGA architecture. Do as much parallel operation you can do inside single kernel using loop unrolling. 

 

I hope it will help.

-Heera

View solution in original post

3 Replies
Xilinx Employee
Xilinx Employee
2,953 Views
Registered: ‎07-18-2014

Re: Run many instances of same kernel

Jump to solution

Hi @christian1188,

 

For running multiple instance of kernels, user has to generate those many compute units using --nk option  of xocc.

-nk <kernelName>:<number of instance>

 

Please refer below example for multiple compute units:

https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/clk_freq/large_loop_ocl

 

For SDAccel, number of compute units are restricted to max 10. So at max, 10 compute units can run concurrently. 

In your case (100 parallel operation), I would suggest to do parallel computation inside single kernel instance as much as possible. and later go for multiple compute units. 

You can refer below example watermark, in which single kernel is performing all image processing, but it does 16 pixels operation concurrently using loop unrolling.

https://github.com/Xilinx/SDAccel_Examples/blob/master/vision/watermarking/src/krnl_watermarking.cl

 

So effectively, it is equivalent to 16 GPU work-items which operate on single pixel in each work-items. 

 

Here is my understanding for your quesitons:

Should I create 100 command queues for the same kernel and enqueue 100 kernels? Will this run them concurrently?

[Heera]: Multiple command queues wont help if xclbin binary has single kernel instance in FPGA. All request will run sequentially.

 

 

Should I make 100 copies of the same kernel with slightly different names?(half-joking)

[Heera]: 100 copies are not possible as max is restricted to 10 only. If you need 10 instance, you can create multiple instance of same kernel using --nk option. No need of different names.

 

Should I try to make my kernel process 100 items per iteration? (this seems hard and slow)

 [Heera]: Yes, This is recommended way for FPGA architecture. Do as much parallel operation you can do inside single kernel using loop unrolling. 

 

I hope it will help.

-Heera

View solution in original post

1,941 Views
Registered: ‎11-05-2017

Re: Run many instances of same kernel

Jump to solution

@heeran  So a single command queue is all that is needed when using multiple compute units? Or are multiple command queues required? And is CL_QUEUE_OUT_OF_ORDER_EXEC_MODE_ENABLE  required?

0 Kudos
Xilinx Employee
Xilinx Employee
1,926 Views
Registered: ‎07-18-2014

Re: Run many instances of same kernel

Jump to solution

Hi @christian1188,
Single command queue can handle multiple compute units. You can refer below example:
https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/clk_freq/large_loop_ocl
where multiple compute units are created for good kernel and scheduled parallel using global size (4,1,1). It does not require CL_QUEUE_OUT_OF_ORDER_EXE_MODE_ENABLE.
OUT_OF_ORDER is needed, when multiple commands are placed into single command queue, and user expect them to run concurrently. You can refer below example for the same:
https://github.com/Xilinx/SDAccel_Examples/tree/master/getting_started/host/concurrent_kernel_execution_ocl
This examples demonstrates two approaches of running kernels concurently (using single command queue with Out_of_order and multiple command queue).

I hope it will help.

-Heera

0 Kudos