cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Navindaxon
Visitor
Visitor
541 Views
Registered: ‎06-30-2020

Why is Vivado unable to synthesize one of these rams, but can synthesize the other

Jump to solution

I will provide 2 RAM instantiations below. The first is a fairly typical example of a dual port RAM. It has all the standard features, such as synchronous write and read, and so forth. The second is identical, except that the read element is asynchronous.

When I run a testbench, I see the outputs are identical, except that the output for the first is delayed by a single clock cycle compared to the second. This makes sense, given that VHDL's delta cycle makes process assignments at the end of the process, and uses old values first.

So my question is this: why does the second example cause the following error:

[Synth 8-3391] Unable to infer a block/distributed RAM for 'RAM_reg' because the memory pattern used is not supported.
Failed to dissolve the memory into bits because the number of bits (131072) is too large.
Use 'set_param synth.elaboration.rodinMoreOptions {rt::set_parameter dissolveMemorySizeLimit 131072}' to allow the memory to be dissolved into individual bits

And, if I set the dissolveMemorySizeLimit to the appropriate value, using the TCL console, why does synthesis suddenly go from a few seconds to hours or more? Is there a way to do the asynchronous read (which would be slightly preferable to me) that doesn't run into this sort of problem? I have successfully used the memory2 design on a smaller scale in a separate part of the project, and it works fine.

I am currently using a basys3 board for this implementation.

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity memory1 is
generic (
	K : integer := 13;
	W : integer := 16
);
port (
	clk : in std_logic;
	enA, enB : in std_logic;
	i, j: in std_logic_vector(K-1 downto 0);
	inA, inB : in std_logic_vector(W-1 downto 0);
	outA, outB: out std_logic_vector(W-1 downto 0)
);
end memory1;

architecture Behavioral of memory1 is

type RAM_TYPE is array (0 to 2**K-1) of std_logic_vector(W-1 downto 0);
signal RAM : ram_type;

begin
process(clk)
begin
	if clk'event and clk='1' then
	
		if enA = '1' then
			RAM(to_integer(unsigned(i))) <= inA;
		end if;
		outA <= RAM(to_integer(unsigned(i)));	
		
		if enB = '1' then
			RAM(to_integer(unsigned(j))) <= inB;
		end if;
		outB <= RAM(to_integer(unsigned(j)));
		
	end if;
end process;

end Behavioral;

And a second one:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity memory2 is
generic (
	K : integer := 13;
	W : integer := 16
);
port (
	clk : in std_logic;
	enA, enB : in std_logic;
	i, j: in std_logic_vector(K-1 downto 0);
	inA, inB : in std_logic_vector(W-1 downto 0);
	outA, outB: out std_logic_vector(W-1 downto 0)
);
end memory2;

architecture Behavioral of memory2 is

type RAM_TYPE is array (0 to 2**K-1) of std_logic_vector(W-1 downto 0);
signal RAM : ram_type;

begin
outA<=RAM(to_integer(unsigned(i)))
outB<=RAM(to_integer(unsigned(j)))

process(clk)
begin
	if clk'event and clk='1' then
	
		if enA = '1' then
			RAM(to_integer(unsigned(i))) <= inA;
		end if;
		
		if enB = '1' then
			RAM(to_integer(unsigned(j))) <= inB;
		end if;
		
	end if;
end process;

end Behavioral;

 

0 Kudos
1 Solution

Accepted Solutions
maps-mpls
Mentor
Mentor
487 Views
Registered: ‎06-20-2017

Vivado is genuinely trying to help you.  Really, it is. 

But if you insist, it will give you some rope, and if you take it, well, in all fairness, it did try to warn you.

What kind of ram were you hoping to infer?  BRAM?  distributed RAM?  RAM composed of flip-flops and read muxes on the asynchronous read side?  Because the registers with huge muxes is what you described in memory2's architecture.

>This makes sense, given that VHDL's delta cycle makes process assignments at the end of the process, and uses old values first.

The only role delta-cycles play in your memory2 description is if a write occurs to the same address on the same clock, then in simulation inB (the last transaction) would win.

Synthesizers work on pattern matching.  Simulators work by executing code.  Simulators can simulate things that cannot yet be synthesized.  In your case, it can be synthesized.  But a quick look at UG901 shows that they don't have a recommended pattern for a distributed ram with two write ports and two asynchronous read ports.  Indeed, if you look at UG474, there is no distributed memory that supports two write ports. 

This leaves you with registers, since block ram, while supporting dual write ports, only supports synchronous reads. 

Since you're effectively describing a dual port synchronous write memory with a dual port asynchronous read composed of flip-flops for the writes and huge read muxes for the read, you have to consider:

Each read data bit will have a 8192 to 1 mux with a 13 bits of select, for a total asynchronous circuit with 8205 inputs per output bit.  Ignoring F7-muxes for the sake of quick analysis, this will take 6 levels of LUTs (where log base 6 is due to max number of inputs on LUT). 

maps-mpls_0-1618542172755.png

6 levels of LUTs wouldn't be so bad, except that you are condensing 8205 bits down to a single bit, all of that times 32 (twice, 16 data bits for your two read ports).  To put it another way, you have a boolean equation with 8205 inputs.  Huge k-map if you did this by hand.  Fortunately, synthesizers on modern computers can manage the boolean algebra quickly. 

The real problem is the congestion in the general purpose routing the synthesizer anticipates, in my estimation.  The synthesizer can map a large boolean expression to LUTs, but I suspect it is the efforts it makes to simplify logic in anticipation of congestion that will pop up during implementation that is causing it to make herculean efforts and consume processing time.  If you have timing constraints, as you should, it will be even worse.

You could also code it this way (might as well add a reset to your ram, since you're describing flip flops):

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity memory2 is
generic (
  G_ADDR_BITS : integer := 13;
  G_DATA_BITS : integer := 16
);
port (
  iCLK   : IN  std_logic;
  iRST   : IN  std_logic;
  iENA   : IN  std_logic;
  iENB   : IN  std_logic;
  iAddrA : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iAddrB : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iDataA : IN  std_logic_vector(G_DATA_BITS-1 downto 0);
  iDataB : IN  std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataA : OUT std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataB : OUT std_logic_vector(G_DATA_BITS-1 downto 0)
);
end entity memory2;
architecture danger of memory2 is type REG_ARRAY_TYPE is array (0 to 2**G_ADDR_BITS-1) of std_logic_vector(G_DATA_BITS-1 downto 0); signal REG_ARRAY : REG_ARRAY_TYPE; begin -- architecture
process(iCLK) begin if rising_edge(iCLK) then if(iRST = '1') then REG_ARRAY <= (others => (others => '0')); else if iENA = '1' then REG_ARRAY(to_integer(unsigned(iAddrA))) <= iDataA; end if; if iENB = '1' then REG_ARRAY(to_integer(unsigned(iAddrB))) <= iDataB; end if; end if; end if; end process;

oDataA <= REG_ARRAY(to_integer(unsigned(iAddrA)));
oDataB <= REG_ARRAY(to_integer(unsigned(iAddrB)));
end architecture danger;

This should get rid of your warning and possibly the need for a special Tcl command, since it will be clear to the synthesizer you intend to have flip flops because you coded the reset.  

But you will still have your boolean functions of 8205 inputs, and all the problems that entails.

It will still take some time to synthesize, because you're creating all sorts of congestion in the general purpose routing with those huge read muxes.  And if you have timing constraints, you're making it hard on the timing driven synthesis to meet timing based on estimated routing delays.

If you can live with a single write port, you could do this to get your two asynchronous read ports:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity memory2 is
generic (
  G_ADDR_BITS : integer := 13;
  G_DATA_BITS : integer := 16
);
port (
  iCLK   : IN  std_logic;
  iRST   : IN  std_logic;
  iENA   : IN  std_logic;
  iAddrA : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iAddrB : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iDataA : IN  std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataA : OUT std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataB : OUT std_logic_vector(G_DATA_BITS-1 downto 0)
);
end entity memory2;

architecture will_infer_distributed of memory2 is
  type tMEM_ARRAY is array (0 to 2**G_ADDR_BITS-1) of std_logic_vector(G_DATA_BITS-1 downto 0);
  signal REG_ARRAY : tMEM_ARRAY;
begin -- architecture 

  process(iCLK)
  begin
    if rising_edge(iCLK) then
      if iENA = '1' then
        REG_ARRAY(to_integer(unsigned(iAddrA))) <= iDataA;
      end if;
    end if;
  end process;
oDataA <= REG_ARRAY(to_integer(unsigned(iAddrA)));
oDataB <= REG_ARRAY(to_integer(unsigned(iAddrB)));
end architecture will_infer_distributed;

And it will synthesize fairly quickly, into distributed memory (see UG474).

As you continue hone your understanding of VHDL, it is important to also consider the synthesis user guide and your target library.  In the case of Xilinx, this is UG901 for synthesis, and UG474 and UG768 for your target library.  And it is also important to remember that synthesizers work by pattern matching, especially if you're trying to infer something complex (like, say, a DSP48)--and sometimes the synthesizer writers did not anticipate a style of code you believe reasonably describes the higher order primitive.

Understanding delta cycles and simulation in general is important, but it is mostly irrelevant in this particular problem. 

You want to code at high a level as is appropriate to your problem, as you have.  But learn to recognize the types of issues described above based on your understanding of the target library and your experience. 

Vivado has a great tool that once upon a time used to cost a lot of money and was only available in 3rd party synthesizers:  An ability to convert your HDL to schematics--either post synthesis, or what the Flow Navigator calls "RTL Analysis" but which, to be honest, is sometimes buggy in its rendering.  And it is not the same analysis that takes place during synthesis as far as I can tell.

Use these tools, and pay particular attention to your post-synthesis netlist schematic, and you'll come to recognize the sorts of issues I described above on your own.  You can become a guru in no time (at least until the next bit of technology comes out).  In the old days it would take an engineer much longer to learn the relationship between their coding style, the netlist that results, and the relationship to the target library.

Another possibility, if you don't like spending time in user guides, is the distributed memory generator in the IP catalog.  The dual port ram has a single data input (single write port).  This is a clue, in that if something so simple wasn't offered there, it might not be possible.

A way you could have troubleshot this problem would have been to reduce your generic K from 13 to something small like 3 or 4, and your generic W to 1 or 2, in a dummy throw away project, and look at the schematic for a single output bit:

maps-mpls_1-1618544532630.png

And then increase your address bits by 1, and try again.  Then another bit, and so on.

In any event, try to recast your ultimate problem.  Adding a register (or two) of pipelining to your reads can often be accommodated in a logic design, and will allow your HDL to be very efficiently mapped to BRAM.  You will just have to adjust whatever is consuming the read data.  Unless your next write or read depends on your current read, in which case pipelining might not be all that useful.

*** Destination: Rapid design and development cycles *** Unappreciated answers get deleted, unappreciative OPs get put on ignored list ***

View solution in original post

0 Kudos
3 Replies
maps-mpls
Mentor
Mentor
488 Views
Registered: ‎06-20-2017

Vivado is genuinely trying to help you.  Really, it is. 

But if you insist, it will give you some rope, and if you take it, well, in all fairness, it did try to warn you.

What kind of ram were you hoping to infer?  BRAM?  distributed RAM?  RAM composed of flip-flops and read muxes on the asynchronous read side?  Because the registers with huge muxes is what you described in memory2's architecture.

>This makes sense, given that VHDL's delta cycle makes process assignments at the end of the process, and uses old values first.

The only role delta-cycles play in your memory2 description is if a write occurs to the same address on the same clock, then in simulation inB (the last transaction) would win.

Synthesizers work on pattern matching.  Simulators work by executing code.  Simulators can simulate things that cannot yet be synthesized.  In your case, it can be synthesized.  But a quick look at UG901 shows that they don't have a recommended pattern for a distributed ram with two write ports and two asynchronous read ports.  Indeed, if you look at UG474, there is no distributed memory that supports two write ports. 

This leaves you with registers, since block ram, while supporting dual write ports, only supports synchronous reads. 

Since you're effectively describing a dual port synchronous write memory with a dual port asynchronous read composed of flip-flops for the writes and huge read muxes for the read, you have to consider:

Each read data bit will have a 8192 to 1 mux with a 13 bits of select, for a total asynchronous circuit with 8205 inputs per output bit.  Ignoring F7-muxes for the sake of quick analysis, this will take 6 levels of LUTs (where log base 6 is due to max number of inputs on LUT). 

maps-mpls_0-1618542172755.png

6 levels of LUTs wouldn't be so bad, except that you are condensing 8205 bits down to a single bit, all of that times 32 (twice, 16 data bits for your two read ports).  To put it another way, you have a boolean equation with 8205 inputs.  Huge k-map if you did this by hand.  Fortunately, synthesizers on modern computers can manage the boolean algebra quickly. 

The real problem is the congestion in the general purpose routing the synthesizer anticipates, in my estimation.  The synthesizer can map a large boolean expression to LUTs, but I suspect it is the efforts it makes to simplify logic in anticipation of congestion that will pop up during implementation that is causing it to make herculean efforts and consume processing time.  If you have timing constraints, as you should, it will be even worse.

You could also code it this way (might as well add a reset to your ram, since you're describing flip flops):

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity memory2 is
generic (
  G_ADDR_BITS : integer := 13;
  G_DATA_BITS : integer := 16
);
port (
  iCLK   : IN  std_logic;
  iRST   : IN  std_logic;
  iENA   : IN  std_logic;
  iENB   : IN  std_logic;
  iAddrA : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iAddrB : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iDataA : IN  std_logic_vector(G_DATA_BITS-1 downto 0);
  iDataB : IN  std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataA : OUT std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataB : OUT std_logic_vector(G_DATA_BITS-1 downto 0)
);
end entity memory2;
architecture danger of memory2 is type REG_ARRAY_TYPE is array (0 to 2**G_ADDR_BITS-1) of std_logic_vector(G_DATA_BITS-1 downto 0); signal REG_ARRAY : REG_ARRAY_TYPE; begin -- architecture
process(iCLK) begin if rising_edge(iCLK) then if(iRST = '1') then REG_ARRAY <= (others => (others => '0')); else if iENA = '1' then REG_ARRAY(to_integer(unsigned(iAddrA))) <= iDataA; end if; if iENB = '1' then REG_ARRAY(to_integer(unsigned(iAddrB))) <= iDataB; end if; end if; end if; end process;

oDataA <= REG_ARRAY(to_integer(unsigned(iAddrA)));
oDataB <= REG_ARRAY(to_integer(unsigned(iAddrB)));
end architecture danger;

This should get rid of your warning and possibly the need for a special Tcl command, since it will be clear to the synthesizer you intend to have flip flops because you coded the reset.  

But you will still have your boolean functions of 8205 inputs, and all the problems that entails.

It will still take some time to synthesize, because you're creating all sorts of congestion in the general purpose routing with those huge read muxes.  And if you have timing constraints, you're making it hard on the timing driven synthesis to meet timing based on estimated routing delays.

If you can live with a single write port, you could do this to get your two asynchronous read ports:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;

entity memory2 is
generic (
  G_ADDR_BITS : integer := 13;
  G_DATA_BITS : integer := 16
);
port (
  iCLK   : IN  std_logic;
  iRST   : IN  std_logic;
  iENA   : IN  std_logic;
  iAddrA : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iAddrB : IN  std_logic_vector(G_ADDR_BITS-1 downto 0);
  iDataA : IN  std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataA : OUT std_logic_vector(G_DATA_BITS-1 downto 0);
  oDataB : OUT std_logic_vector(G_DATA_BITS-1 downto 0)
);
end entity memory2;

architecture will_infer_distributed of memory2 is
  type tMEM_ARRAY is array (0 to 2**G_ADDR_BITS-1) of std_logic_vector(G_DATA_BITS-1 downto 0);
  signal REG_ARRAY : tMEM_ARRAY;
begin -- architecture 

  process(iCLK)
  begin
    if rising_edge(iCLK) then
      if iENA = '1' then
        REG_ARRAY(to_integer(unsigned(iAddrA))) <= iDataA;
      end if;
    end if;
  end process;
oDataA <= REG_ARRAY(to_integer(unsigned(iAddrA)));
oDataB <= REG_ARRAY(to_integer(unsigned(iAddrB)));
end architecture will_infer_distributed;

And it will synthesize fairly quickly, into distributed memory (see UG474).

As you continue hone your understanding of VHDL, it is important to also consider the synthesis user guide and your target library.  In the case of Xilinx, this is UG901 for synthesis, and UG474 and UG768 for your target library.  And it is also important to remember that synthesizers work by pattern matching, especially if you're trying to infer something complex (like, say, a DSP48)--and sometimes the synthesizer writers did not anticipate a style of code you believe reasonably describes the higher order primitive.

Understanding delta cycles and simulation in general is important, but it is mostly irrelevant in this particular problem. 

You want to code at high a level as is appropriate to your problem, as you have.  But learn to recognize the types of issues described above based on your understanding of the target library and your experience. 

Vivado has a great tool that once upon a time used to cost a lot of money and was only available in 3rd party synthesizers:  An ability to convert your HDL to schematics--either post synthesis, or what the Flow Navigator calls "RTL Analysis" but which, to be honest, is sometimes buggy in its rendering.  And it is not the same analysis that takes place during synthesis as far as I can tell.

Use these tools, and pay particular attention to your post-synthesis netlist schematic, and you'll come to recognize the sorts of issues I described above on your own.  You can become a guru in no time (at least until the next bit of technology comes out).  In the old days it would take an engineer much longer to learn the relationship between their coding style, the netlist that results, and the relationship to the target library.

Another possibility, if you don't like spending time in user guides, is the distributed memory generator in the IP catalog.  The dual port ram has a single data input (single write port).  This is a clue, in that if something so simple wasn't offered there, it might not be possible.

A way you could have troubleshot this problem would have been to reduce your generic K from 13 to something small like 3 or 4, and your generic W to 1 or 2, in a dummy throw away project, and look at the schematic for a single output bit:

maps-mpls_1-1618544532630.png

And then increase your address bits by 1, and try again.  Then another bit, and so on.

In any event, try to recast your ultimate problem.  Adding a register (or two) of pipelining to your reads can often be accommodated in a logic design, and will allow your HDL to be very efficiently mapped to BRAM.  You will just have to adjust whatever is consuming the read data.  Unless your next write or read depends on your current read, in which case pipelining might not be all that useful.

*** Destination: Rapid design and development cycles *** Unappreciated answers get deleted, unappreciative OPs get put on ignored list ***

View solution in original post

0 Kudos
Navindaxon
Visitor
Visitor
449 Views
Registered: ‎06-30-2020

Thank you for your detailed response. It has helped me to visualize what the tools are trying to do, as I'm learning VHDL and FPGA design in general. I think, rather than deal with the lengthy synthesis process, I will simply add some control logic to account for the synchronous read, and leave it at that. In the meantime, I shall take your advice and create small scale versions of these circuits so that I can examine precisely what the system is doing.

0 Kudos
maps-mpls
Mentor
Mentor
402 Views
Registered: ‎06-20-2017

>Thank you for your detailed response.

You are welcome, and good luck! 

P.S.  The comment about putting reset was a bit tongue-in-cheek.  The consequences are valid, but in general, you shouldn't reset things unless you really need them to be reset.  By default, they will be reset.  The issue with your memory2 is that the synthesizer recognizes it as an attempt to infer RAM.

Another possible solution, assuming you really need to have two ports to write to the memory, is to do the will_infer_distributed architecture, and then add another 2 more module in front of it to arbitrate writes to the RAM.  Indeed, if you did a distributed memory generator with an AXI interface, you could just use the AXI interrconnect to do the multi-port arbitration.  But if you're doing this for a learning exercise, writing a two-port arbitrary to distributed memory you inferred would be a good learning exercise.

*** Destination: Rapid design and development cycles *** Unappreciated answers get deleted, unappreciative OPs get put on ignored list ***
0 Kudos