cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
gdg
Explorer
Explorer
6,278 Views
Registered: ‎03-22-2017

Memory Structures in Vivado HLS (a new documentation)

Jump to solution

Is there a newer document or guidelines on Memory Structures in Vivado HLS?

 

At the moment I found only the following:

Implementing Memory Structures for Video Processing in the Vivado HLS Tool

https://www.xilinx.com/support/documentation/application_notes/xapp793-memory-structures-video-vivado-hls.pdf

 

Thank you

0 Kudos
1 Solution

Accepted Solutions
u4223374
Advisor
Advisor
9,287 Views
Registered: ‎04-26-2015

Well, there are basically six and a half types of memory in HLS. In order of decreasing size:

Off-chip SDRAM - accessed either via the Zynq PS or a MIG. You get to it with an AXI Master in HLS, which essentially acts like a big array. The bandwidth of SDRAM is very high for an individual device (eg. on modern boards over 20GB/s is practical using 64-bit DDR4). However, you only get one of them - so you have to share that bandwidth between all devices. Access latency is also relatively slow, especially when the accesses are random. For decent performance you need to do burst transfers, which means one transfer per clock cycle.

UltraRAM - on the larger UltraScale+ chips Xilinx has included a new type of block RAM, called UltraRAM. Each UltraRAM stores 4096*72-bit, which is 16 times the size of a block RAM. The downsides are that it's far less flexible than block RAM. The interface is always 72-bit, and the two ports give you something between SDP and TDP mode on a block RAM. The big limitation of UltraRAM, that the two ports share the same clock, tends not to matter for HLS since C/C++ designs can't do multiple clocks anyway. The number of these blocks is fairly limited, so use with care. UltraRAM can provide up to two elements per cycle, which (at 200MHz, which seems reasonable for many HLS designs on UltraScale) produces about 3.6GB/s per block. Random access is fine, provided HLS can figure out the next address one cycle early (UltraRAM has an inherent one-cycle delay).

Block RAM - the "normal" memory resource. Block RAM tends to be perfect for things like line buffers in image processing, and it's also plentiful on most chips (eg. even the low-end Zynq 7020 has over 280 blocks). Note that HLS considers one block RAM to be 18K, whereas the Xilinx datasheets consider one RAM to be 36K - so HLS will always report twice as many block RAMs as the datasheet does (the Zynq 7020 datasheet says 140 36K blocks, HLS says 280 18K blocks). Block RAM has two ports which can each be 1, 2, 4, 9, or 18 bits wide (with depths of 16K, 8K, 4K, 2K, and 1K respectively). In simple dual-port mode (one read port, one write port) this can go up to 36 bits. As far as I know, HLS always makes both ports the same width. It is important to remember that block RAM hardware can't do arbitrary port widths, and the depths are always powers of two. While a 14-bit 1300-element array technically occupies less than the block RAM capacity (18,432 bits) it's going to get mapped into a pair of 2K*9-bit RAMs. Like UltraRAM, block RAM has an inherent one-cycle delay - worth keeping in mind.

LUT RAM - RAM built in LUTs. HLS tends not to use this automatically, but you can force it to using the RESOURCE pragma. The main advantage of LUT RAM is that it saves your block RAM; when an array is only a few hundred elements it's often not worth wasting block RAM on it. As far as I know, HLS assumes the same 1-cycle latency for this (technically the hardware doesn't require it) and can only go up to dual-port (the hardware can do a quad-port LUT RAM).

Shift registers - are very similar to LUT RAM in terms of resources (they're also LUT-based). Unlike a RAM, a shift register isn't great for random access - but it can shift all its elements by one space in a single cycle and you can access all the elements in a single cycle. This makes it ideal for sliding windows across images. Generally I find that if you fully partition an array and use an appropriate access pattern, HLS will turn it into a shift register.

Registers - if you fully partition an array in all dimensions and it's not using a shift register access pattern, then HLS will map it into individual registers. Registers give you unlimited access to every element instantly. The downside is that every "port" is a big multiplexer; if you've got 128 elements in an array and you want to be able to pull out 32 of them in any cycle then HLS is going to build 32 128-to-1 multiplexers to achieve that. Resource consumption will go up very, very fast! Keep this for either very small arrays, or for arrays where you're using constant indices and the "array" is just because the code looks nicer (eg. compared to having a hundred separate variables).

Registers #2 - you can treat the arbitrary-precision integer types as fully-partitioned arrays of 1-bit values, and access them in the same way (with an advantage over an array of bool or ap_uint<1> in that you can set them equal to each other, AND/OR/XOR them, etc). If you do variable-index selection then HLS will complain about that, but it's no worse than a fully partitioned array of 1-bit values, and in any case the multiplexers tend to be small since they're all only 1-bit.

 

 

As far as I can tell, the only new "feature" since XAPP793 was written is UltraRAM - and that's only relevant if you've spent a lot of money on one of the larger UltraScale+ chips.

View solution in original post

5 Replies
u4223374
Advisor
Advisor
6,248 Views
Registered: ‎04-26-2015

What sort of thing are you expecting? I don't think there have been any significant new features added or changes made since XAPP793 was written.

0 Kudos
gdg
Explorer
Explorer
6,238 Views
Registered: ‎03-22-2017

I was just wondering if new features or guidelines were out. I am interested in the memory management in Vivado HLS (everything involving DRAM, BRAM, and FF).

 

My reference so far is theXAPP793 document, the 2017.1 Vivado User Guide (UG902) and the "SDSoC Environment Optimization Guide" (UG1235):

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017_1/ug902-vivado-high-level-synthesis.pdf

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2016_4/ug1235-sdsoc-optimization-guide.pdf

 

0 Kudos
u4223374
Advisor
Advisor
9,288 Views
Registered: ‎04-26-2015

Well, there are basically six and a half types of memory in HLS. In order of decreasing size:

Off-chip SDRAM - accessed either via the Zynq PS or a MIG. You get to it with an AXI Master in HLS, which essentially acts like a big array. The bandwidth of SDRAM is very high for an individual device (eg. on modern boards over 20GB/s is practical using 64-bit DDR4). However, you only get one of them - so you have to share that bandwidth between all devices. Access latency is also relatively slow, especially when the accesses are random. For decent performance you need to do burst transfers, which means one transfer per clock cycle.

UltraRAM - on the larger UltraScale+ chips Xilinx has included a new type of block RAM, called UltraRAM. Each UltraRAM stores 4096*72-bit, which is 16 times the size of a block RAM. The downsides are that it's far less flexible than block RAM. The interface is always 72-bit, and the two ports give you something between SDP and TDP mode on a block RAM. The big limitation of UltraRAM, that the two ports share the same clock, tends not to matter for HLS since C/C++ designs can't do multiple clocks anyway. The number of these blocks is fairly limited, so use with care. UltraRAM can provide up to two elements per cycle, which (at 200MHz, which seems reasonable for many HLS designs on UltraScale) produces about 3.6GB/s per block. Random access is fine, provided HLS can figure out the next address one cycle early (UltraRAM has an inherent one-cycle delay).

Block RAM - the "normal" memory resource. Block RAM tends to be perfect for things like line buffers in image processing, and it's also plentiful on most chips (eg. even the low-end Zynq 7020 has over 280 blocks). Note that HLS considers one block RAM to be 18K, whereas the Xilinx datasheets consider one RAM to be 36K - so HLS will always report twice as many block RAMs as the datasheet does (the Zynq 7020 datasheet says 140 36K blocks, HLS says 280 18K blocks). Block RAM has two ports which can each be 1, 2, 4, 9, or 18 bits wide (with depths of 16K, 8K, 4K, 2K, and 1K respectively). In simple dual-port mode (one read port, one write port) this can go up to 36 bits. As far as I know, HLS always makes both ports the same width. It is important to remember that block RAM hardware can't do arbitrary port widths, and the depths are always powers of two. While a 14-bit 1300-element array technically occupies less than the block RAM capacity (18,432 bits) it's going to get mapped into a pair of 2K*9-bit RAMs. Like UltraRAM, block RAM has an inherent one-cycle delay - worth keeping in mind.

LUT RAM - RAM built in LUTs. HLS tends not to use this automatically, but you can force it to using the RESOURCE pragma. The main advantage of LUT RAM is that it saves your block RAM; when an array is only a few hundred elements it's often not worth wasting block RAM on it. As far as I know, HLS assumes the same 1-cycle latency for this (technically the hardware doesn't require it) and can only go up to dual-port (the hardware can do a quad-port LUT RAM).

Shift registers - are very similar to LUT RAM in terms of resources (they're also LUT-based). Unlike a RAM, a shift register isn't great for random access - but it can shift all its elements by one space in a single cycle and you can access all the elements in a single cycle. This makes it ideal for sliding windows across images. Generally I find that if you fully partition an array and use an appropriate access pattern, HLS will turn it into a shift register.

Registers - if you fully partition an array in all dimensions and it's not using a shift register access pattern, then HLS will map it into individual registers. Registers give you unlimited access to every element instantly. The downside is that every "port" is a big multiplexer; if you've got 128 elements in an array and you want to be able to pull out 32 of them in any cycle then HLS is going to build 32 128-to-1 multiplexers to achieve that. Resource consumption will go up very, very fast! Keep this for either very small arrays, or for arrays where you're using constant indices and the "array" is just because the code looks nicer (eg. compared to having a hundred separate variables).

Registers #2 - you can treat the arbitrary-precision integer types as fully-partitioned arrays of 1-bit values, and access them in the same way (with an advantage over an array of bool or ap_uint<1> in that you can set them equal to each other, AND/OR/XOR them, etc). If you do variable-index selection then HLS will complain about that, but it's no worse than a fully partitioned array of 1-bit values, and in any case the multiplexers tend to be small since they're all only 1-bit.

 

 

As far as I can tell, the only new "feature" since XAPP793 was written is UltraRAM - and that's only relevant if you've spent a lot of money on one of the larger UltraScale+ chips.

View solution in original post

veekshitha
Adventurer
Adventurer
1,545 Views
Registered: ‎08-13-2019

how to use LUTRAM? how do I force the design to use LUTRAM?

0 Kudos
nithink
Xilinx Employee
Xilinx Employee
1,532 Views
Registered: ‎09-04-2017

@veekshitha you can  use HLS pragma. specify core as RAM_1P_LUTRAM or RAM_2P_LUTRAM  based on your requirement

Thanks,

Nithin

0 Kudos