02-20-2020 10:18 AM
I am working on quite complex c++ kernels that uses 8 or 16 HBM ports.
All of my processing is done using ap_uint<1024>, and right now so are my interfaces to HBM banks.
Now I wonder.. would there be any benefit (either area or performance) changing the interfaces to 512bits?
Of course the rest of the processing will remain 1024.
PS: Working on Alveo U280 on Linux with Vitis
02-20-2020 10:25 AM - edited 02-20-2020 10:28 AM
The HBMs are AXI3- 256 bit interfaces trying to run at 450MHz.
If you go with a 512 bit interface from your kernel, you might have a harder time closing timing, or since the V++ engine will frequency scale, you might close timing at a lower rate.
If you go with 1024 you will use more logic, so depending on the density of your design, it might be harder to place.
So yes, I believe your consideration of area and space are the only considerations I'd see also. You want it to go the fastest possible, so I'm thinking the speed that the kernels run at is going to be a result of this choice.
#Edit: I think we should first consider what speed your 1024 kernel closes timing at and then decide if it would be beneficial to try and close timing twice or more faster when downsizing to 512. Or Maybe your 1024 kernel produces more data than the HBM can consume, that might be causing a bottleneck that you can free up by downsizing that interface.
02-20-2020 12:42 PM
If you happen to implement both versions,
would it be possible if you could share if you had any benefit in effective DRAM BW in using 1024b vs 512b?
I couldn't get anything routed using 1024b with U50, so I was curious.
02-20-2020 12:45 PM
@mice101 - Have you tried routing adjacent Kernels using different SLRs?
Or were you unable to place a single kernel using 1024?
02-20-2020 01:22 PM
02-22-2020 05:01 AM
02-23-2020 06:28 PM
Thanks for sharing your result, @benedetto73 . I suppose you reduced the number of compute units by half?
One suggestion for your problem: Could you check if the frequency of the bitstream matches the actual programmed frequency?
For my U50, I realized they were different, and hangs when the difference is too large. I think there is a bug there.
02-25-2020 02:27 PM
When you get to it, see if the hang is reported in xbutil query or in dmesg as an axi firewall trip. I bet its not, in which case you are likely running into an ERT bug.
If you want a quick test, create a file called xrt.ini and place it in the same directory as the executable that programs the xclbin to the card. Make the contents of that file as follows:
[Runtime] ert = false
to confirm if the change has made it into your design check in dmesg for a message similar to the following:
[ 2571.861208] xocl 0000:d8:00.1: dev ffff92c28d5ee098, exec_cfg_cmd: scheduler config ert(0), dataflow(0), slots(16), cudma(0), cuisr(0), cdma(0), cus(2)
ert(0) is the key you are looking for.
This bug seems to be fixed in the upcoming release of XRT + Platforms.
*Note: The *.INI file is only read by XRT when a new xclbin is programmed onto the card, so rerunning a program or running a program that uses the same xclbin as the last will not force the platform's scheduler to reconfigure itself.