cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
siva_krishna
Visitor
Visitor
8,824 Views
Registered: ‎04-06-2016

Vivado HLS BRAM usage problem

Hello,

    I am using an char array of size 307200 in my design and I have directing it to store it using BRAMs. As we can see it can fit in 134 BRAMs. But when it was synthesized it is using 256 BRAMs, which is equivalent to the next 2 power of 307200. Is there anyway to inform the compiler not to round to the nearest power.

Tags (1)
0 Kudos
5 Replies
u4223374
Advisor
Advisor
8,811 Views
Registered: ‎04-26-2015

One option is to force HLS to partition the array in a logical way. For example, you could do this:

 

uint8_t buffer[327680]; // Because this is 5 * a power of 2.
#pragma HLS ARRAY_PARTITION variable=buffer dim=1 block factor=5

That should result in it splitting the buffer into five 65536*8-bit arrays, and (hopefully) it'll correctly recognise that a four BRAM_18K blocks can be combined into one 64K*1 BRAM. Then each array will occupy 4*8 = 32 BRAM_18K blocks, and there'll be five of them so you'll have a total of 160 RAMs.

 

Getting it down below that may be difficult. Technically for 307200 elements you can have four blocks of 64KB (as above, each using 32 RAMs) plus a final block of 44KB. This last block should only require three RAMs in series times eight parallel (ie 24 total), which cuts RAM usage down to 152 RAMs. However, unless you actually shift that last block to a whole separate array, I don't think you'll have much luck persuading HLS to do this.

 

The absolute minimum when the RAMs are in 16K mode is 150. To get below that, like aiming for your desired 134 BRAMs, you have to be in 18K mode. This means that either you go for very wide RAM elements (72-bit RAM, nine bytes per element) or you go for relatively narrow elements (9-bit is the minimum) but have partial bytes in each element (eg. for 9-bit elements you store 1.125 bytes per element). Both approaches are messy, and both will require division to determine the correct addresses - which will not be good for resource consumption or speed. They can be implemented if you're really short on RAM, but I would definitely not recommend them.

muzaffer
Teacher
Teacher
8,787 Views
Registered: ‎03-31-2012

I recognize that number :-) and ask you this question: are you sure you need to store a full frame to run your algorithm? Is there a way you can re-structure it to go over some limited number of rows (N*640 where N<480) in a rolling fashion?
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
siva_krishna
Visitor
Visitor
8,786 Views
Registered: ‎04-06-2016

My algorithm runs in a streaming fashion only. Actually I call my accelerator multiple times for a single image with different scaled versions of the original image. I am facing some data transfer issues. So, instead of transferring data multiple times, I want to transfer the whole image once and re-scale it internally and use. 

0 Kudos
u4223374
Advisor
Advisor
8,739 Views
Registered: ‎04-26-2015

Hah, I do the same thing for much the same reason. I need to produce scaled versions of the original image several hundred times, and I'd prefer not to burn the (off-chip) RAM bandwidth needed to do this via an AXI Master. I also needed very high speed (reading 10+ pixels per clock cycle) and that is only practical with properly-partitioned block RAMs.

 

With that said, I'm now looking into whether there are better approaches, since this one really doesn't work with larger images (eg. 1920*1080*24-bit = absolute minimum of 2700 BRAM_18K blocks).

0 Kudos
jprice
Scholar
Scholar
8,715 Views
Registered: ‎01-28-2014

I see this issue posted a lot. I'm not sure how to force HLS to be as efficient as one can be with a memory. Instead I make an external interface and instantiate the memory in HDL where I can control it. It's silly that this is necessary but it does work.

0 Kudos