UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Voyager
Voyager
1,626 Views
Registered: ‎10-07-2011

Generic AddReduce function

Jump to solution

Hi folks,

 

Is anybody aware of a way to code a generic pipelined "AddReduce" function (or entity)?

 

I often have to add M values of N bits each. So far, I've been doing this the long way by cascading tailored adder IPs. For example, assuming M=5 and N=8:

 

Dessin1.png

 

So, on the above:

  . Level 1 adder IPs (+) are 8 bits in and 9 bits out

  . Level 2 adder IPs are 9 bits in and 10 bits out

  . Level 3 adder IPs are 10 bits in and 11 bits out; Result S is 11 bits

  . DelayLine depth is always matching the latency of the adder IP and register.

 

Of course, the required number of levels is depending on M. The configuration of the various adders is depending on N and on the number of levels needed (which is depending on M).

 

If either M or N is changed, the whole adder structure is broken.

 

Question: Is there a way to code such AddReduce function in a generic way, and also make it suitable for both signed and unsigned data (overloading)?

 

Thnaks for helping!

 

Claude

 

 

0 Kudos
1 Solution

Accepted Solutions
Voyager
Voyager
1,656 Views
Registered: ‎10-07-2011

Re: Generic AddReduce function

Jump to solution

Hello all,

 

I finally decided to move forward using the Xilinx Adder IP rather than the '+' operator. It's a bit annoying but it leads to optimal results, breaking long operands and avoiding long carry chains. Even though I didn't test it, I feel sysnthesis will implement the '+' operator has a huge fully combinational adder, while the Adder IP is implementing it as a pipeline.

 

To make things generic, I didn't generate the Adder IP from the IP Catalog. I'm rather instantiating by myself from the Xilinx c_addsub library. I only use the "Fabric" implementation (ie LUTs and FFs rather than DSPs). In this mode, the equation I provided for the latency is working. It is used to set the depth of the DelayLine blocks shown on the figure on the original post. The registers are not needed because they are embedded into the Adder IP.

 

All in all, this is working nicely and reliably so far. I should add that VHDL-2008 is needed as all STD_LOGIC_VECTOR (operands and result) are left unconstrained.

 

Claude

View solution in original post

0 Kudos
6 Replies
Scholar jmcclusk
Scholar
1,585 Views
Registered: ‎02-24-2014

Re: Generic AddReduce function

Jump to solution

I know how to do this using recursive VHDL  (that's the easy part).    The tricky part is the interface object definition.

 

Can you use VHDL-2008?    This allows a definition for an interface object like:

 

type unsigned_array is array(natural range <>) of unsigned;

entity AddReduce_unsigned is
port (
     A : in unsigned_array;
     clk : in std_logic;
     Q : out unsigned;
);
end entity;

if you can't use VHDL-2008,  then VHDL-93 requires a fall back to something weird and clunky, like a 2D array of std_logic;

 

Here is a solution, which synthesizes nicely using VHDL-2008 and Vivado 2018.2

 

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
library xil_defaultlib;
use xil_defaultlib.Add_Reduce.all;

entity add_reduce_unsigned is
    Port ( clk : in STD_LOGIC;
           A : in unsigned_array;
           Q : out unsigned
           );
end add_reduce_unsigned;

architecture Behavioral of add_reduce_unsigned is

component add_reduce_unsigned is
    Port ( clk : in STD_LOGIC;
           A : in unsigned_array;
           Q : out unsigned
           );
end component;

constant C_word_len : natural := A(A'high)'length;

begin

gen: case A'length generate
    when 1 | 2 | 3 =>               -- ternary addition is same as binary for Series 7 and higher
        proc: process(clk) is
        begin
        if rising_edge(clk) then
            if A'length = 1 then
                Q <= resize( A(A'low),Q'length);
            elsif A'length = 2 then
                Q <= resize( A(A'high),Q'length) + resize( A(A'low),Q'length);
            else
                Q <= resize( A(A'high),Q'length) + resize( A(A'high - 1),Q'length) + resize(A(A'low),Q'length);
            end if;
        end if;
        end process;
    when 4 | 5 | 6 =>
        gen_1bit: if A(A'high)'length = 1 generate   -- special case, when we are just adding bits instead of words
            proc: process(clk) is
            begin
            if rising_edge(clk) then
                if A'length = 4 then
                    Q <= resize(A(A'high),Q'length)+resize(A(A'high-1),Q'length)+resize(A(A'high-2),Q'length)+resize(A(A'low),Q'length);
                elsif A'length = 5 then
                    Q <= resize(A(A'high),Q'length)+resize(A(A'high-1),Q'length)+resize(A(A'high-2),Q'length)+resize(A(A'high-3),Q'length)+resize(A(A'low),Q'length);
                else
                    Q <= resize(A(A'high),Q'length)+resize(A(A'high-1),Q'length)+resize(A(A'high-2),Q'length)+resize(A(A'high-3),Q'length)+resize(A(A'high-4),Q'length)+resize(A(A'low),Q'length);
                end if;
            end if;
            end process;        
        else generate
            signal tmp1, tmp2 : unsigned(Q'length-2 downto 0);
        begin
            g1: entity xil_defaultlib.add_reduce_unsigned
                port map( clk => clk,
                    A => A(A'high downto A'high - A'length/2 + 1),
                    Q => tmp1 );               
            g2: entity xil_defaultlib.add_reduce_unsigned
                port map( clk => clk,
                    A => A(A'high - A'length/2 downto A'low),
                    Q => tmp2 );
            proc: process(clk) is
            begin
            if rising_edge(clk) then
                Q <= resize(tmp1,Q'length) + resize(tmp2,Q'length);
            end if;
            end process; 
        end generate gen_1bit;                    
    when others =>
        signal tmp1, tmp2, tmp3 : unsigned(Q'length-2 downto 0);
        begin
            g1: entity xil_defaultlib.add_reduce_unsigned
                port map( clk => clk,
                    A => A(A'high downto A'high - A'length/3 + 1),
                    Q => tmp1 );               
            g2: entity xil_defaultlib.add_reduce_unsigned
                port map( clk => clk,
                    A => A(A'high - A'length/3 downto A'low + A'length/3),
                    Q => tmp2 );
            g3: entity xil_defaultlib.add_reduce_unsigned
                port map( clk => clk,
                    A => A(A'low + A'length/3 - 1 downto A'low),
                    Q => tmp3 );    
            proc: process(clk) is
            begin
            if rising_edge(clk) then
                Q <= resize(tmp1,Q'length) + resize(tmp2,Q'length) + resize(tmp3,Q'length);
            end if;
            end process;            
end generate gen;
end Behavioral;

 

 

And here is a simple example of using this component to sum 24 values of 32 bits to produce a 37 bit sum:

 

-- VHDL-2008 code.   MUST SET THIS FILE TO VHDL-2008 in Vivado
library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.NUMERIC_STD.ALL;
library xil_defaultlib;
use xil_defaultlib.Add_Reduce.all;

entity test_add is
    Port ( clk : in STD_LOGIC;
           A : in unsigned_array(23 downto 0)(31 downto 0);
           sum : out unsigned(36 downto 0));
end test_add;

architecture Behavioral of test_add is

signal B : A'subtype;

begin

proc: process(clk) is
begin
if rising_edge(clk) then
    B <= A;
end if;
end process;

g1: entity xil_defaultlib.add_reduce_unsigned
    port map( clk => clk,
        A => B,
        Q => sum );   
        
end Behavioral;

 

 

EDIT:  I reworked the code, because it was performing poorly with summing single bits.  There was also something weird about the process using a tmp variable, since I saw weird feedback loops in the netlist.    Now when summing 512 single bits into a 12 bit sum, it runs at 500 MHz in a -1 Ultrascale device (it might run faster, but 500 MHz seems fast enough).    Summing 24 inputs of 32 bits to a 37 bit sum runs at 500 MHz too!!    Xilinx has devoted a ton of effort at making the carry chains really fast, and this result demonstrates that.   This kind of speed simply isn't possible with the older 40 and 90 nm devices.  

Don't forget to close a thread when possible by accepting a post as a solution.
1,549 Views
Registered: ‎01-08-2012

Re: Generic AddReduce function

Jump to solution

Years ago I designed one of these that was optimised for large (i.e. in the hundreds) M and N = 1 (i.e. single bit).  This operation is also known as "population count" or "bit count".

 

The obvious RTL implementation using adders performed very badly after synthesis, as the tools were (and still are) incapable of seeing the big picture and transforming the whole problem into something that can be implemented efficiently.  I don't think this will change, BTW.

 

Instead, I structured my code as a Wallace tree, and it all worked very well.  (Wallace was building high performance multipliers, and the addition of the subproducts looks just like a popcount.)

 

Conclusion: If you are writing a truly generic AddReduce function, you must take the size of the thing into account and be prepared to use different architectures for different sizes.

 

EDIT: looking back through my notes, I can see that it was pretty fast:

Post-PAR Performance in virtex 6 (-1 speed grade)
512 inputs : 585 LUTs, 5.271 ns clock period

I did not record the number of pipeline stages used though.

Voyager
Voyager
1,536 Views
Registered: ‎10-07-2011

Re: Generic AddReduce function

Jump to solution

Hi @jmcclusk,

 

Many thanks for the code. And to answer your questions, yes, I can use VHDL-2008 (fortunately) and unconstrained arrays.

 

I understand the recursive nature of your code, breaking the data array until there is no more than 3 values in each segment. However, in the "when 1 | 2 | 3 =>" case, I see the following:

    variable tmp : unsigned(Q'range) := (others => '0')

    for i in A'range loop
        tmp := tmp + resize(A(i),Q'length);
    end loop;

    if rising_edge(clk) then
        Q <= tmp;
    end if;

It's not clear to me how the above loop is getting synthesized. I would expect an up-to-two levels fully combinatorial adder in front of register Q (the first level is optimized away because tmp is initialized to 0). If this is right, I'd be afraid about the overall delay associated to the carry chain over the two levels. For example, what if A is consisting of 3x 64-bit values? Is the maximum achievable clock rate going down with increasing bit width?

 

I was also thinking of using the '+' oprator and the "ieee.numeric_std" library but again, I'm afraid of the overall carry chain delay in the case of large "Q'length". Do you have any opinion on this carry chain delay thing?

 

I was also hoping to end-up with something that would work for both SIGNED and UNSIGNED data. Hence, I was targetting to have that performed in some function to enable overloading.

 

Finally, I thought I could use the Xilinx $INSTALL/data/ip/Xilinx/c_addsub_v___ library (that is the library that is being used when an Adder/Subtractor IP is created from the IP Catalog) to benefit from the Xilinx adder IP (that is fully pipelined and optimized), but there is no way for me to know (predict) what the IP latency is going to be once it is configured with the proper parameters. Hence, no way to adjust the depth of the DelayLines (see figure in the original post) when they're needed.

 

Anyhow, I will give your stuff a try a come back here if successful.

 

Thanks for helping!

 

Claude

0 Kudos
Voyager
Voyager
1,533 Views
Registered: ‎10-07-2011

Re: Generic AddReduce function

Jump to solution

Hi @allanherriman,

 

Thanks for your comments. Never heard of Wallace Tree before. I had a look at it this morning but it's still not clear to me how this could help. I'm probably missing something.

 

Initially, I was really looking at coding the structure shown in the figure (see original post). This is easy to make that structure generic. What is hard is to make efficient adders (ie adders that will still work when clocked at "high" frequencies). Adders must be efficiently pipelined and the number of pipeline stages (ie latency) will vary depending on M and N.

 

When Xilinx Adder/Subtractor IP is created from the IP catalog, the latency can be set by the user. However, there is a minimal value that is varying based on the configuration parameters. This makes the IP unusable for me because I have no way to know of the minimal latency to adjust the depth of the DelayLines (see figure).

 

So, I'm left with either:

1. Use the '+' operator which I guess will result in a fully combinatorial adder, with fmax going down with increasing data bit width.

2. Create my own basic adder block that would break data into chunks (eg 32 bit data into 4x 8-bit) and pipeline the process. Being my stuff, I could have an equation in the instantiating file to set the depth of the DelayLines.

 

But at the end, I'm really not sure either of these paths will make me successful reaching my goal to end-up with something overloadable for both SIGNED and UNSIGNED data.

 

Any comment?

 

Cheers,

 

Claude

0 Kudos
Scholar jmcclusk
Scholar
1,521 Views
Registered: ‎02-24-2014

Re: Generic AddReduce function

Jump to solution

Hi @allanherriman and @chevalier ,   I've reworked the code to address bit compression (now it will do 6:3 compressors) when given an array of single bits.   It's not completely optimal, since there are unused carry inputs on the carry adder logic which could absorb additional single inputs.    It does however nicely do ternary tree addition, and the speed is decent with reasonably sized operands (32 bit ternary adders run at 500 MHz in -1 speed grade Ultrascale).   

 

EDIT:   Tried adding an array of 512 registers, each 32 bits long, producing a 43 bit sum.   This definitely won't run at 500 MHz, but will run slightly over 400 MHz in a -1 Ultrascale (A XCVU080 device).    This still seems very fast to me, actually.    Now that I think about it,   this could be used with some tinkering to generate multiplier-free filters very easily.  

Don't forget to close a thread when possible by accepting a post as a solution.
0 Kudos
Voyager
Voyager
1,657 Views
Registered: ‎10-07-2011

Re: Generic AddReduce function

Jump to solution

Hello all,

 

I finally decided to move forward using the Xilinx Adder IP rather than the '+' operator. It's a bit annoying but it leads to optimal results, breaking long operands and avoiding long carry chains. Even though I didn't test it, I feel sysnthesis will implement the '+' operator has a huge fully combinational adder, while the Adder IP is implementing it as a pipeline.

 

To make things generic, I didn't generate the Adder IP from the IP Catalog. I'm rather instantiating by myself from the Xilinx c_addsub library. I only use the "Fabric" implementation (ie LUTs and FFs rather than DSPs). In this mode, the equation I provided for the latency is working. It is used to set the depth of the DelayLine blocks shown on the figure on the original post. The registers are not needed because they are embedded into the Adder IP.

 

All in all, this is working nicely and reliably so far. I should add that VHDL-2008 is needed as all STD_LOGIC_VECTOR (operands and result) are left unconstrained.

 

Claude

View solution in original post

0 Kudos