I have a image processor entity that has 15 or more arrays similar to the one below, but all are different sizes and dimensions. I'm trying to figure out how to get Vivado to use "LUTRAM" for as much of my design as possible to reduce LUT usage.
I'm using this in the port declaration, but it doesn't seem to do anything.
An example of of one of the 15-20 different signals is below:
I've also tried using this in the xdc file:
set_property RAM_STYLE DISTRIBUTED [get_cells sReference_window_reg]
I'm using reset in the process. I have declared initial values for all of these signals. What else specifically needs to be done to get Vivado to use BRAM/LUTRAM for these signals? (I definitely shouldn't have to take each signal and make it a separate entity with wr_en, addr, data_in, etc. That would be totally crazy.) There should be a way to tell vivado, "Instead of using LUTs for this signal, use BRAM."
Inferring specific types or RAM with VHDL is done by using the attribute statement and by structuring your VHDL in specific ways. You will find many examples of this in the section called “RAM HDL Coding Guidelines” found on about page 110 of UG901. Note also the link on page 110 of UG901 called “Coding Examples”.
I've seen that guide. It doesn't list a bunch of requirements for what my design has to have for a particular signal to be instantiated as LUTRAM/BRAM. It just has a bunch of LUTRAM/BRAM examples as separate entities. If I'm going to use separate entities, I may as well use the memory IP generator, right?
The design I'm working on is a stereo vision system. I have it working with good timing, but I can only process up to a 16 pixel disparity, which isn't very practical. My LUT usage increases as I increase the max search distance (max pixel disparity), or if I increase the kernel size. My LUTRAM and BRAM is at 1% though! I need to somehow transfer as much as possible onto LUTRAM/BRAM.
I was previously storing the first kernel_size - 1 image rows in what I thought would be BRAM by creating the memory with the BLOCK MEMORY GENERATOR. At that point I was using 0.5 of BRAM. I then switched to using a simple dual-port distributed ram (using the Distributed Memory Generator). The design still works and runs as expected on the device, but there is still no increase in LUTRAM usage, and BRAM usage is still 0.50 (1%). How could that be?
I just converted another large, 2d array to instead use a distributed memory entity. Somehow and for some reason, synthesis is ignoring the fact that it is a distributed memory and still transforming the logic into nothing but LUTs. Why?
When I used the following VHDL from UG901 then array, RAM, is implemented as LUTRAM. -and Vivado Project Summary shows that I have LUTRAM utilization.
-- Single-Port RAM with Asynchronous Read (Distributed RAM) -- File: rams_dist.vhd library ieee; use ieee.std_logic_1164.all; use ieee.std_logic_unsigned.all; entity rams_dist is port( clk : in std_logic; we : in std_logic; a : in std_logic_vector(5 downto 0); di : in std_logic_vector(15 downto 0); do : out std_logic_vector(15 downto 0) ); end rams_dist; architecture syn of rams_dist is type ram_type is array (63 downto 0) of std_logic_vector(15 downto 0); signal RAM : ram_type; attribute ram_style : string; attribute ram_style of RAM : signal is "distributed"; begin process(clk) begin if (clk'event and clk = '1') then if (we = '1') then RAM(conv_integer(a)) <= di; end if; end if; end process; do <= RAM(conv_integer(a)); end syn;
If I'm going to use separate entities, I may as well use the memory IP generator, right?
Many synthesis engines will recognize the VHDL component, rams_dist.vhd, shown above. That is, rams_dist.vhd is portable code (except for maybe the attribute of "distributed"). However, using the Xilinx IP generator does not make portable code.
How could that be?
Please open your implemented design to see if your arrays have been implemented as "distributed memory".
...synthesis is ignoring the fact that it is a distributed memory and still transforming the logic into nothing but LUTs. Why?
Logic will be implemented as LUTs but arrays can be implemented as LUTRAM (distributed memory).
Imagine that in the example project you posted, that the logic wasn't interpreted as LUTRAM. What would you do to find out why? That is the critical question. How do you find out exactly why an entity, or distributed memory IP, is not being treated as memory?
...logic wasn't interpreted as LUTRAM
I think when you say "logic" you are referring to all parts of your HDL code. To me, logic means "combinational logic" or "boolean logic", which is only part of your code - and which synthesis transforms into LUTs (and not LUTRAM). Another part of of your code is the arrays of data, which can (depending on how you write your code) be synthesized as either LUTRAM or BRAM.
I'm trying to figure out how to get Vivado to use "LUTRAM" for as much of my design as possible to reduce LUT usage.
This comment if from your original post. I don't understand it. As shown in UG474, the 7-Series FPGAs have both SLICEL and SLICEM blocks. The SLICEL has ordinary LUTs that are used to synthesize "combinational logic". The SLICEM has special LUTs that can be used to synthesize combinational logic, or distributed memory (LUTRAM), or other stuff. Synthesis will use the LUTs from SLICEM and SLICEL as it sees best. That is, LUTs from both SLICEM and SLICEL could be used as ordinary LUTs. That is, the SLICEM LUT is not restricted to be LUTRAM.
How do you find out exactly why an entity, or distributed memory IP, is not being treated as memory?
Vivado synthesis does amazing things to optimize your design. As you know, if some of your code is not being used then synthesis will simply remove it from the project. If you suspect this is happening then start putting DONT_TOUCH on things that seem to be disappearing - and see if they reappear. Also, if synthesis finds that a section of code is doing exactly the same thing as another section of code then it will remove one of the sections of code. These are just a few examples of how synthesis can do things that cause unexpected results.
So, again I ask you, have you opened your synthesized design to physically look at things that are unexpected? Are things actually missing or are things being synthesized into structures that don't make sense to you? Can you post a screen shot from "Open Synthesized Design" of an array that wasn't synthesized as memory?
You're coming at this the wrong way, and is not how synthesisors work.
You're supposed to think about how your HDL maps to the available logic. Not how the logic should map to your code
Synthesis tools (Not just Vivado) use code templates to map HDL to logic resources. If your code cant be mapped, it will just build it out of registers/logic regardless of what you tell it.
You need to think about the underlying logic design BEFORE you write any HDL.
Ok I opened up the design, which is absolutely massive and hardly readable. I have several 1d, 2d, and 3d, arrays that are used in this pipelined stereo depth processor. LUTRAM and BRAM usage shows as 1%. In a design that is far too large to read all at once, what are the specific steps needed find out why a given, multidimensional array is not being instantiated as memory? As a first step, is there some way to view only the nets and cells that are directly related/connected to the array?
Thank you for the screen shots of things from “Open Synthesized Design”.
As a first step, is there some way to view only the nets and cells that are directly related/connected to the array?
David, I’m still not sure what you are worried about. Why do you say “how to get Vivado to use "LUTRAM" for as much of my design as possible to reduce LUT usage.”. Is your LUT usage very high? How high? Are you worried about running out of LUTs?
What is your LUTRAM usage? As I mentioned earlier, LUTRAM is constructed from special LUTs found in the SLICEM blocks of the FPGA. However, synthesis will use these SLICEM LUTs as ordinary LUTs if needed. So, if your LUTRAM usage is low that means you have lots of SLICEM LUTs available.
Yes, my LUT usage is maxed out. Even if it wasn't maxed out, I would just increase the disparity search width, and/or the search kernel size, until LUT usage was maxed out. Am I going down a dead end by trying to reduce LUT usage by converting arrays to distributed memory (LUTRAM) entities? I've converted two arrays out of about 8 arrays to use distributed memory so far but the highest search width that I can fit on the FPGA is actually going down...
I've noticed that LUTRAM usage is completely incorrectly shown in the post-synthesis utilization summary. That would have been nice to know earlier. The correct LUTRAM utilization is actually shown in the post-implementation utilization summary. Not sure why post-synthesis utilization can't figure out that a distributed memory IP is going to end up being LUTRAM and therefore show at least show LUTRAM utilization with at least a little accuracy...
Am I going down a dead end by trying to reduce LUT usage by converting arrays to distributed memory (LUTRAM) entities? Should I be using BRAM instead?
I've noticed that LUTRAM usage is completely incorrectly shown in the post-synthesis utilization summary.
Yes, many things are only approximated (and sometimes badly approximated) at post-synthesis time. You will find this is true with Timing Analysis and BRAM usage too.
Am I going down a dead end by trying to reduce LUT usage by converting arrays to distributed memory (LUTRAM) entities? Should I be using BRAM instead?
Yes and Yes. If you need more ordinary LUTs for your work, then use BRAM for memory instead of LUTRAM. Then, synthesis can use the special LUTRAM LUTs as ordinary LUTs. BRAM (especially big BRAM) may use some LUTs but not as many as LUTRAM.
On about page 118 of UG901, you will find HDL that can be used to infer Simple Dual-Port BRAM with a read and write latency of 1 clock cycle. Be aware that some configurations for BRAM (especially those created with the Block Memory Generator IP) have read latency of more than 1 clock cycle.
1. I haven't finished trying it out yet, but why would an implementation from UG901 have more desirable performance than their simple dual-port BRAM IP? Wouldn't Xilinx's IP intuitively represent the the best implementation of a given design?
2. What is the benefit of having ena and enb (enable) signals? They just seem to be these unnecessary enables that have to be turned on and off but don't provide any actual benefit. For example, when you want to write, instead of just needing:
wea <= '1';
you have to pointlessly do this instead:
ena <= '1';
enb <= '0';
wea <= '1';
What benefit does this provide? I'm always going to be either reading or writing. ena and enb will never both be '0' at the same time. What do I get by having enables?
Maybe you're getting too hung up on the templates?
The main difference between LUTRAM and BRAM is that LUTRAM allows asynchonous reads. BRAM must use a synchronised (registered) version of the read/write address to infer bram properly.
The templates just give you access to all of the BRAM features, you dont always need en or we for it to infer properly - the key point is the behaviour. Your original code probably has incorrect bahaviour required to infer BRAMs, so you are going to have to change your code to make it infer properly.
BTW – what FPGA are you using? For all my comments, I've assumed you are using a Xilinx 7-Series FPGA.
..why would an implementation from UG901 have more desirable performance than their simple dual-port BRAM IP?
I’ve found the two implementations to be very similar for small BRAM. The IP does better for large BRAM. The 7-Series FPGAs have BRAM blocks with sizes 36Kb (RAMB36E1) and 18Kb (RAMB18E1) – see UG473. So, by large BRAM I’m referring to an array that use more than 8 of the RAMB36E1. Also, as I mentioned before, there is the issue of code portability. That is, the VHDL from UG901 used to infer memory will be recognized by many different FPGA toolsets whereas Xilinx IP is recognized only by Xilinx tools.
What is the benefit of having ena and enb (enable) signals?
Good question! BRAM is one of the more power-hungry components inside the FPGA. So, you can save power by setting ena=0 and enb=0 when you are not using the BRAM. By default Vivado implementation does something called BRAM Power Optimization. That is, implementation will automatically connect circuits to ena and enb that will power down the BRAM when it “thinks” you are not using the BRAM. These circuits use LUTs and (somewhat annoyingly) you get the circuits even if your VHDL permanently sets ena=1 and enb=1. To prevent implementation from adding the Power Optimization circuits on ena and enb, you must go to implementation settings and select the opt_design directive, NoBramPowerOpt. You'll find more information about NoBramPowerOpt and other BRAM peculiarities in <this> post. There are other implementation settings (directives) that you can explore to reduce LUT (combinational logic) usage. These are described on about page 58 of UG904.
Another thing you can do to reduce LUT usage is to use the DSP48 for math (multiplication, addition, and subtraction) instead of letting math be done in the FPGA fabric (with LUTs). For more information see the USE_DSP attribute on about page 60 of UG901. Division is notoriously difficult and uses lots of FPGA resources – unless you are dividing by a power-of-2, which is just bit shifting. Make sure you are not calculating constant values over and over. Instead, store these constants in BRAM.
Finally, you may want to Google “xilinx reduce lut usage”. You’ll get many hits that indicate you are not alone in trying to reduce LUT usage. The person who wrote <this> post seems to have done a good study of the problem.
Thanks for the help Mark.
I'm using an XC7A35T-1CPG236C (Diligent CMOD A7-35T).
1. If my logic never has a situation where ena and enb are '0', and I'm not concerned about power, shouldn't I just save a few LUTs and remove ena and enb from the BRAM design? However, I don't want to do anything that is going to mysteriously make Vivado fail to implement the simple dual-port BRAM entity as BRAM.
2. I don't currently do any division and very little to no multiplication from what I remember. Just a lot of 6-bit addition and some minus one subtraction. Is it worth it to use a lookup ROM for the 6-bit addition or is it just as efficient to use the DSP48 (assuming I'm successful in getting Vivado to actually honor the USE_DSP attribute)?
I'm using an XC7A35T-1CPG236C
Thanks. FYI – the XC7A200T has more that 6x the slices (and LUTs) -see Table 4 in Xilinx document, DS180.
…shouldn't I just save a few LUTs and remove ena and enb from the BRAM design?
No. The problem is that the BRAM component physically has the ena and enb pins. If you don’t tie them high in your VHDL and use NoBramPowerOpt then Vivado will automatically add shutdown circuits (with LUTs) to these pins. -but, as you say, we are only talking about a few LUTs.
Is it worth it to use a lookup ROM for the 6-bit addition or is it just as efficient to use the DSP48...
For only 6-bit addition, I suspect that both methods will be equally good at saving LUTs. However, as you say, we sometimes have trouble getting USE_DSP to work. Also, using ROM should make your code more portable.
How would you convert a 3-dimensional array, in a triple-nested for loop, to use a BRAM entity without using using multiplication to specify the address?
In the simplified example below, I'd like to convert diff_array to use the dual-port BRAM entity, but I want to somehow avoid using multiplication to calculate the correct address to use. Is there a trick that you know of for this situation?
for k in 0 to search_width loop
for i in 0 to window_size loop
for j in 0 to window_size loop
diff_array(k,i,j) <= unsigned(abs(signed(unsigned(a(i)(j))) - signed(unsigned(b(i)(j+k)))));
Assuming this is inside a clocked process, this cannot be a ram of any sort.
This is simply an array of products, all of which are calculated on every clock cycle. In a RAM, only 1 location is available per clock cycle. You would never use a for loop for a ram. And using anything other than a 1D array would just be luck if Vivado managed to synthesise it. Using a 2D (or 3D) type to infer a ram is going to complicate things and most likely confuse Vivado (or just not work).
Without some real code, it is hard to understand the context. But I still think you would be better served taking a step back from your code and looking at your architecture - you should be able to work out where RAMs are needed before you write any HDL.
Is there a trick that you know of for this situation?
-take a look at section called “3D RAM Inference” on about page 151 of UG901. There, it is shown how to make good use of the BRAM ena and enb pins, which become the third dimension of the array.
Somewhat related is another tip called an array of constants, which can be done in VHDL as follows:
type T_ARR1 is array(3 downto 0) of unsigned(1 downto 0); constant conarr1 : T_ARR1 := ("11","00","10","01"); --array of constants
Vivado synthesis will sometimes not store array, conarr1, in memory (ie. LUT ROM or block ROM). Instead, bits of conarr1 will be synthesised as connections to logic '1' or '0' - or simply absorbed into the look-up tables of LUTs that result from combinational logic using conarr1.
Do you think it would be equivalent in terms of BRAM and LUT usage to use nested for generate loops (generating simple dual port entities) instead of the 3d dual port BRAM example?
A single BRAM can replace a huge LUTRAM, or many smaller LUTRAMs.
The fact you are talking about generate loops makes me wonder if the architecture is suitable.
Yes you can infer BRAMs inside a generate loop, but each generate iteration would contain 1 BRAM.
Do you think it would be equivalent in terms of BRAM and LUT usage....
I’m not sure but I expect that both ways can be made to have similar LUT usage. You will have many other ideas like this for reducing LUT usage. -best to make some small Vivado projects to test your ideas (so you can quickly get thru to implemetation and see actual LUT usage). Finding what works best is all part of the fun!
A bigger problem when converting 3d arrays to BRAM in these nested for loops is that you can't set the address and data in within them because there is only one instance. In other words, when the loops are unrolled, it reveals that you are trying to set address and data_in to several different values at once (in one clock tick). The solution to that is to either pipeline the operation, so that each iteration of the loop happens over one cycle, or to create as many BRAMs as there are loop cycles. Both of those solutions seem to defeat the purpose of converting the array to BRAM because you just end up with lots of tiny little BRAM instances that are only as large as the data in one of the 3d array elements. Is there another solution I am missing?
I guess you are just ignoring my posts.
Again you are showing you're working backwards - you have the code, and you're trying to fit it to the chip. This is a design problem, not a coding problem. You need to go back a few steps, and have a look at the design from the top down. You know you have rams available, and pipelines, so you really need to see how you can re-arrange the algorithm to fit the available available architecture.
If you have loops, it is too late to use ram of pretty much any sort - you have a big set of parrallel accesses. You need to re-design it to pipeline it - this will likely mean removing the loops altogether.
Let’s say you need an array with dimensions, 8 x 6 x 2028. For the 6x2048 part you can instantiate one RAMB18E1 to hold 2048 values that are each 6-bits wide. When writing to this BRAM you will need 11 address (ADDRA) lines, 6 data (DINA) lines, 1 enable-A (ENA) line, and 1 write-enable (WEN) line. For the 8x part of this array, you then instantiate eight of these (6x2048) RAMB18E1. Note that these eight RAMB18E1 can share the ADDRA, DINA, and ENA lines, but each must have its own unique WEN(i) line. Writing a value to this array consists of setting ENA=1, placing values on ADDRA and on DINA, and then asserting one of the WEN(i). This can all be done in one clock cycle.
Does this give you what you need? -or do you need to write each bit of the 6-bit wide value individually?
I think the main question is "when can an RTL array be converted to a RAM".
The answer is "When the operations done to that RTL array are consistent with the capabilities of the RAM".
The following are the (basic) capabilities of a RAM
Only if your RTL code follows all these rules is an RTL array eligible for being packed into a RAM. If any one of them is violated, then the synthesis tool will (correctly) not infer a RAM in place of the array - it will be forced to use individual flip-flops.
So, when an array is not packed into RAMs, you need to ask yourself - does my access pattern conform to all the above rules.
In rare cases, even if you conform to all the rules, the structure of the RTL code is too complex for the synthesis tool to recognize that all the rules have been met. As a result, Xilinx provides templates (which have been mentioned by others) that are known to be properly recognized by the synthesis tool for inferring RAMs. The templates help you ensure that you have met all the above requirements. If you find that you cannot convert the access to your array into a structure that conforms to the templates, then you are probably violating at least one of the rules, which is why you are not getting RAMs.
Yep. It is a little hard to understand at first but it seems like the general rule is that you take whatever the largest dimension is and make that the address. So a 5x5 kernel, sliding across 50 pixels, storing 6 bits, would turn into a two-dimensional for generate statement (along the 5x5 kernel dimensions), with an address bit width of ceil(log2(50)), and a data width of 6 bits. I don't know if Vivado will actually try to create 25 wildly under-utilized BRAMs, or if it will somehow combine things though...
Actually, that doesn't work in this case. See the example below. The commended out line is the 3d array that I am attempting to replace with BRAM. grayscale_diff_wr_addr_array is of course always set to the highest value of k. I don't think this particular optimization can be done because it all happens on one clock tick.
-- move the 5x5 search window to the right k pixels for k in 0 to search_width_minus_1 loop -- compare the 5x5 reference window to the 5x5 section of the search window for i in 0 to window_size_minus_1 loop for j in 0 to window_size_minus_1 loop grayscale_diff_ena_array(i,j) <= '1'; grayscale_diff_enb_array(i,j) <= '0'; grayscale_diff_wr_en_array(i,j) <= '1'; grayscale_diff_wr_addr_array(i,j) <= std_logic_vector(to_unsigned(k, grayscale_diff_wr_addr_array(i,j)'length)); grayscale_diff_d_in_array(i,j) <= unsigned(abs(signed(unsigned(reference_window(i)(j))) - signed(unsigned(search_window(i)(j+k))))); -- grasyscale_diff_array(k,i,j) <= unsigned(abs(signed(unsigned(seference_window(i)(j))) - signed(unsigned(search_window(i)(j+k))))); end loop; end loop; end loop;
- RAMs with a larger number of read ports can be constructed using "parallel" RAMs, but you can never get more than one or two write ports in a RAM. Period.
I was wondering about this as well. How would I create a 25-port ROM (that represents the absolute value of the difference between two 6bit values)? I want to somehow just use one instance of that ROM but be able to read from potentially 25 different addresses at once. I'm also worried that Vivado will just take the effort I spent creating the "25-port ROM" and convert it into a bunch of LUTs anyway, similar to what it would do if I just created a giant constant that stored those values.
Given that you are talking about pixels, I assume you're doing image processing. Is this on live video?
Either way, I assume your data source is providing data serially in some way or another? If this is the case, it should be quite straight forward to construct your NxN pixel matrix, using N shift registers and N BRAMs. Data gets pushed into the shift register and BRAM on entry into the system. At the same time, the 2nd port of the BRAM plays out the last line of data into another shift register and another BRAM.
From all the shift registers you have your N taps, and together you get your NxN Kernal.
Using simple counters (for input position) you can work out what logic is needed to mux in data at the edges of the image (for pixel replication or simply inserting 0 or White)
Multiple kernels just needs multiple copies of this arrangement.
I want to somehow just use one instance of that ROM but be able to read from potentially 25 different addresses at once.
That's exactly the point - you can't.
This is the fundamental difference between an array of flip-flops/LUTs and a RAM/ROM. An array of flip-flops or LUTs can have any number of its elements written/read on a single clock. A RAM can only do one operation per port.
It is specifically because of this limitation (one access per port) that a RAM is so much more compact than a multidimensional array of flip-flops.
I am looking at your code snippets. These are all RTL loops that are modifying many or all elements of the array on the same clock. These are fundamentally not implementable in a RAM.