sdaccel difference between subdevice and multiple compute unit
I have a question about the difference between subdevice and multiple compute unit. I have tried those two on my design. I found both of them behave similarly. There is some scheduling running underneath. If I use multiple compute unit (xilinx example here) with 1 out of order queue. The tasks launched for each CU is different. If I use subdevice (example here), even though I explicitly manage multiple sub queues, the tasks launched for each CU is still different.
More importantly, it has a performance difference. For my limited test with hardware emulation, subdevice implementation is always better than multiple compute units (I have done test with enough large dataset).
My question is, is there a implementation difference in the runtime of those two? And why performance wise the behave differently. Also, is hardware emulation close enough to the real on-board performance?