07-05-2016 12:36 AM
I am using an char array of size 307200 in my design and I have directing it to store it using BRAMs. As we can see it can fit in 134 BRAMs. But when it was synthesized it is using 256 BRAMs, which is equivalent to the next 2 power of 307200. Is there anyway to inform the compiler not to round to the nearest power.
07-05-2016 01:27 AM
One option is to force HLS to partition the array in a logical way. For example, you could do this:
uint8_t buffer; // Because this is 5 * a power of 2. #pragma HLS ARRAY_PARTITION variable=buffer dim=1 block factor=5
That should result in it splitting the buffer into five 65536*8-bit arrays, and (hopefully) it'll correctly recognise that a four BRAM_18K blocks can be combined into one 64K*1 BRAM. Then each array will occupy 4*8 = 32 BRAM_18K blocks, and there'll be five of them so you'll have a total of 160 RAMs.
Getting it down below that may be difficult. Technically for 307200 elements you can have four blocks of 64KB (as above, each using 32 RAMs) plus a final block of 44KB. This last block should only require three RAMs in series times eight parallel (ie 24 total), which cuts RAM usage down to 152 RAMs. However, unless you actually shift that last block to a whole separate array, I don't think you'll have much luck persuading HLS to do this.
The absolute minimum when the RAMs are in 16K mode is 150. To get below that, like aiming for your desired 134 BRAMs, you have to be in 18K mode. This means that either you go for very wide RAM elements (72-bit RAM, nine bytes per element) or you go for relatively narrow elements (9-bit is the minimum) but have partial bytes in each element (eg. for 9-bit elements you store 1.125 bytes per element). Both approaches are messy, and both will require division to determine the correct addresses - which will not be good for resource consumption or speed. They can be implemented if you're really short on RAM, but I would definitely not recommend them.
07-05-2016 08:05 AM
07-05-2016 08:15 AM
My algorithm runs in a streaming fashion only. Actually I call my accelerator multiple times for a single image with different scaled versions of the original image. I am facing some data transfer issues. So, instead of transferring data multiple times, I want to transfer the whole image once and re-scale it internally and use.
07-06-2016 03:53 AM
Hah, I do the same thing for much the same reason. I need to produce scaled versions of the original image several hundred times, and I'd prefer not to burn the (off-chip) RAM bandwidth needed to do this via an AXI Master. I also needed very high speed (reading 10+ pixels per clock cycle) and that is only practical with properly-partitioned block RAMs.
With that said, I'm now looking into whether there are better approaches, since this one really doesn't work with larger images (eg. 1920*1080*24-bit = absolute minimum of 2700 BRAM_18K blocks).
07-06-2016 12:09 PM
I see this issue posted a lot. I'm not sure how to force HLS to be as efficient as one can be with a memory. Instead I make an external interface and instantiate the memory in HDL where I can control it. It's silly that this is necessary but it does work.