cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
edgar_lemaire
Observer
Observer
800 Views
Registered: ‎02-28-2019

VHDL for-loop in Vivado : How is it inferred ? (strange resource utilization)

Jump to solution

Hello,

I am implementing an adder tree in VHDL for hardware synthesis using Vivado. The goal is that each sum operation is operated in parallel. I have implemented my adder tree with successive for loops in a clocked process. I have made a similar example with two adder stages in the following code :

 

type sum_stage1_type is array (0 to 10-1) of std_logic_vector(7 downto 0);
type sum_stage2_type is array (0 to 5-1) of std_logic_vector(7 downto 0);

signal sum_stage1 : sum_stage1_type;
signal sum_stage2 : sum_stage2_type;

	
[...]
	process(i_clk, i_rst)
	
	begin
		if(rising_edge(i_clk)) then
			if(i_rst='1') then
				
				sum_stage1 <= (others => (others => '0'));
				sum_stage2 <= (others => (others => '0'));

			else
				for i in 0 to 10-1 loop
					w_sum_stage1(i) <= w(i*2) + w(i*2+1);
				end loop;
				for i in 0 to 5-1 loop
					w_sum_stage2(i) <= w_sum_stage1(i*2) + w_sum_stage1(i*2+1);
				end loop;
			end if;
		end if;
	end process;

 

This code is functional (tested with modelsim) and passed Vivado Hardware Synthesis, but it seems there is something strange regarding the resource utilization. Indeed, when I increase the number of stages and the number of adders per stage, the resource does not vary so much. I am thus wondering if the adder-tree is correctly inferred, with all adders operating in parallel. To me, it seems plausible that Vivado inferred it with a "time multiplexing" philosophy in order to save resources.

 

What do you guys think about this ?

 

Thanks for your time,

 

Edgar

 

0 Kudos
Reply
1 Solution

Accepted Solutions
730 Views
Registered: ‎01-22-2015

@edgar_lemaire 

...something strange regarding the resource utilization... I am thus wondering if the adder-tree is correctly inferred, with all adders operating in parallel..... it seems plausible that Vivado inferred it with a "time multiplexing" philosophy in order to save resources.

You've omitted a few details in your example VHDL.  So, I tried to add them as shown below.

    type sum_stage1_type is array (0 to 10-1) of unsigned(7 downto 0);
    type sum_stage2_type is array (0 to 5-1) of unsigned(7 downto 0);
    type w_type is array (0 to 20-1) of unsigned(7 downto 0);
    signal sum_stage1 : sum_stage1_type;
    signal sum_stage2 : sum_stage2_type;
    signal w : w_type;
     .....
	process(i_clk, i_rst)
	begin
		if(rising_edge(i_clk)) then
			if(i_rst='1') then		
				sum_stage1 <= (others => (others => '0'));
				sum_stage2 <= (others => (others => '0'));
			else
				for i in 0 to 10-1 loop
					sum_stage1(i) <= w(i*2) + w(i*2+1);
				end loop;
				for i in 0 to 5-1 loop
					sum_stage2(i) <= sum_stage1(i*2) + sum_stage1(i*2+1);
				end loop;
			end if;
		end if;
	end process;

 

This is pretty straight-forward VHDL, so I doubt that synthesis will have any trouble with it.  As @maps-mpls says, Vivado synthesis can sometimes do amazing optimization.  However, these optimizations can be hard for us to understand.

Anyway, your VHDL specifies that everything inside the process must complete in one clock cycle.  So, there is not much room for the time-multiplexing you mention.  Things happen as follows:

  1. After reset-release, the 1st execution of the process will use 10 adders to calculate the values of sum_stage1(i).  Also, 5 adders are used to calculate values of sum_state2(i) but these values will all be 0 because the reset values (0) for sum_stage1(i) are being used.
  2. After reset-release, the 2nd execution of the process will use values of sum_state1(i) from the 1st execution of the process to calculate non-zero values for sum_state2(i).

So, there is latency associated with calculation of sum_state2(i).

Vivado RTL ANALYSIS (Open Elaborated Design) gives you a nice schematic of what's happening - as shown below.

sum_stage_RTL.jpg

In summary, have faith that Vivado is doing synthesis and implementation correctly for your VHDL.  -and try not to worry about optimization that is occurring.

Cheers,
Mark

View solution in original post

7 Replies
maps-mpls
Mentor
Mentor
760 Views
Registered: ‎06-20-2017

The synthesizer may be making optimizations you did not realize were possible.  (Synthesizers are good for this).  Does you post synthesis netlist produce the correct results in simulation?

*** Destination: Rapid design and development cycles ***
731 Views
Registered: ‎01-22-2015

@edgar_lemaire 

...something strange regarding the resource utilization... I am thus wondering if the adder-tree is correctly inferred, with all adders operating in parallel..... it seems plausible that Vivado inferred it with a "time multiplexing" philosophy in order to save resources.

You've omitted a few details in your example VHDL.  So, I tried to add them as shown below.

    type sum_stage1_type is array (0 to 10-1) of unsigned(7 downto 0);
    type sum_stage2_type is array (0 to 5-1) of unsigned(7 downto 0);
    type w_type is array (0 to 20-1) of unsigned(7 downto 0);
    signal sum_stage1 : sum_stage1_type;
    signal sum_stage2 : sum_stage2_type;
    signal w : w_type;
     .....
	process(i_clk, i_rst)
	begin
		if(rising_edge(i_clk)) then
			if(i_rst='1') then		
				sum_stage1 <= (others => (others => '0'));
				sum_stage2 <= (others => (others => '0'));
			else
				for i in 0 to 10-1 loop
					sum_stage1(i) <= w(i*2) + w(i*2+1);
				end loop;
				for i in 0 to 5-1 loop
					sum_stage2(i) <= sum_stage1(i*2) + sum_stage1(i*2+1);
				end loop;
			end if;
		end if;
	end process;

 

This is pretty straight-forward VHDL, so I doubt that synthesis will have any trouble with it.  As @maps-mpls says, Vivado synthesis can sometimes do amazing optimization.  However, these optimizations can be hard for us to understand.

Anyway, your VHDL specifies that everything inside the process must complete in one clock cycle.  So, there is not much room for the time-multiplexing you mention.  Things happen as follows:

  1. After reset-release, the 1st execution of the process will use 10 adders to calculate the values of sum_stage1(i).  Also, 5 adders are used to calculate values of sum_state2(i) but these values will all be 0 because the reset values (0) for sum_stage1(i) are being used.
  2. After reset-release, the 2nd execution of the process will use values of sum_state1(i) from the 1st execution of the process to calculate non-zero values for sum_state2(i).

So, there is latency associated with calculation of sum_state2(i).

Vivado RTL ANALYSIS (Open Elaborated Design) gives you a nice schematic of what's happening - as shown below.

sum_stage_RTL.jpg

In summary, have faith that Vivado is doing synthesis and implementation correctly for your VHDL.  -and try not to worry about optimization that is occurring.

Cheers,
Mark

View solution in original post

edgar_lemaire
Observer
Observer
684 Views
Registered: ‎02-28-2019

Hello @maps-mpls ,

Indeed, the post-synthesis simulation gives the good results. I have viewed the adder-tree signals, and I don't think there is any time-multiplexing involved after all. However, it is quite difficult to vizualize the signals, as vivado splits them in various sub-signals. Is there a command to put in the VHDL code in order for vivado to keep the "full signals" for later vizualization in simulation ? Something like the "set debug" used to probe the signals in hardware.

Thanks for your help,

 

Edgar

0 Kudos
Reply
drjohnsmith
Teacher
Teacher
678 Views
Registered: ‎07-09-2009

Can I suggest that the VHDL way would be to use a for generate loop, not a C like for next loop

https://www.ics.uci.edu/~jmoorkan/vhdlref/generate.html

As said above, 

   should be fairly simple code for the tools to understand, but the for next loop might not be generating what you intended,

      for instance, in FPGAs, "registers are free" , also there are special DSP units that are great at big adders, 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
edgar_lemaire
Observer
Observer
674 Views
Registered: ‎02-28-2019

Hello markg@prosensing.com and thank you very much for such a detailed response.
Indeed, your schematic is much more interpretable than mine (which implies much more logic for the rest of my architecture), and here I can see that the adder-tree is correctly inferred.

As I said in my previous comment, I performed a post-synth simulation which show no sign of time-multiplexing, so I think everything is okay with how Vivado infers the IP.

Thank you again for your time and very enlightening answer.

 

Regards,

 

Edgar

0 Kudos
Reply
edgar_lemaire
Observer
Observer
666 Views
Registered: ‎02-28-2019

Hello @drjohnsmith, thanks for your reply.
Indeed it seems a good way to implement the adder tree. I will try this solution and give some of my findings about the difference between my initial way of coding the adder-tree and your suggestion.
Concerning DSPs, I don't thin it would be feasible as I will have a big lot of 8 bits adders in my architecrue (something like 10 000), so I guess there would not be enough DSP on the board.

 Thanks a lot for your help,

 

Regards,

 

Edgar

0 Kudos
Reply
drjohnsmith
Teacher
Teacher
637 Views
Registered: ‎07-09-2009

@Edgar_lemaire2 

 

I'm not even certain there will be enough logic in a FPGA to implement 80 000 adders, 

   do you really need to add all on every clock or could you time multiplex the work,

  so use 1000 adders to do 10 0000 additions,

    

Just to highlight, add two 8 bit number , you get a 9 bit result,

   add 10000 8 bit adders, you get a real big result, 

 

Also note re size tests ,

be very awware that the tools synthesis your code, and implement your timing constraints, and provided it fit in the design they STOP.

     The major delay in a FPAG is the routing, so as designs get bigger, if you need to talk across the chip, the timings get worse.

             The answer to the above is to pipe line, remember registers are "free" in a FPGA.

 

    If you are using only a small amount of logic in a chip, the way its implemented by the tools could be very different to how its implemented for a lot of logic.

many have been caught out by this in the past.

If yo want a BIG design, you have to design the system to scale well,

 

<== If this was helpful, please feel free to give Kudos, and close if it answers your question ==>
0 Kudos
Reply