UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Explorer
Explorer
1,817 Views
Registered: ‎11-29-2015

What is the most space-efficient count_ones function?

Given that the requirement is to count ones in one clock tick, for a std_logic_vector of any size, what function will use the least amount of LUTS?

I just know of the method below, but I'm reading that it performs poorly in terms of area vs other designs, but the other designs all have a fixed input bit width.

function count_ones(s : std_logic_vector) return integer is
      variable temp : natural := 0;
    begin
      for i in s'range loop
        if s(i) = '1' then temp := temp + 1; 
        end if;
      end loop;
      
      return temp;
    end function count_ones;
0 Kudos
27 Replies
Historian
Historian
1,798 Views
Registered: ‎01-23-2009

Re: What is the most space-efficient count_ones function?

While there are some exceptions, in general the RTL coding style you use to describe combinatorial logic does not affect the number of LUTs required for the function. Synthesis is very good at doing Boolean minimization, so regardless of what form the logic starts at (after elaboration) synthesis will likely find a similar result (in terms of usage) after it has completed its minimization.

Avrum

0 Kudos
Explorer
Explorer
1,781 Views
Registered: ‎11-29-2015

Re: What is the most space-efficient count_ones function?

I have a 300bit vector which somehow results in 4500 adders... Isn't there a more efficient way to count 300 bits?

Module census_vector 
Detailed RTL Component Info : 
+---Adders : 
	   2 Input      9 Bit       Adders := 4500  
+---XORs : 
	   2 Input    300 Bit         XORs := 30    
+---Registers : 
	               11 Bit    Registers := 2     
	                5 Bit    Registers := 1     
	                1 Bit    Registers := 2     
+---Muxes : 
	   2 Input     11 Bit        Muxes := 1     
	   2 Input      9 Bit        Muxes := 41    
	   2 Input      5 Bit        Muxes := 20    
	   2 Input      4 Bit        Muxes := 12    
	   2 Input      3 Bit        Muxes := 6     
	   2 Input      2 Bit        Muxes := 3     
	   2 Input      1 Bit        Muxes := 2 
0 Kudos
Explorer
Explorer
1,754 Views
Registered: ‎11-29-2015

Re: What is the most space-efficient count_ones function?

I found the code below which was reported to be very space efficient but it I don't quite follow the logic enough to expand it out to 300 bits without coping out and just adding 10 them together. How would you modify this function to accept a 300 bit input?

function count_ones_32_bit(s : std_logic_vector) return integer is
    
        variable temp  : unsigned (31 downto 0);
        variable temp1 : unsigned (31 downto 0);
        variable temp2 : unsigned (31 downto 0);
        variable temp3 : unsigned (31 downto 0);
        variable temp4 : unsigned (31 downto 0);
        variable temp5 : unsigned (31 downto 0);

    begin
        temp  := unsigned(s);
        temp1 := (temp  AND X"5555_5555") + ( ( temp  srl 1)  AND X"5555_5555");  -- 0..2 out  x16
        temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2)  AND X"3333_3333");  -- 0..4 out  x8
        temp3 := (temp2 AND X"0707_0707") + ( ( temp2 srl 4)  AND X"0707_0707");  -- 0..8 out  x4
        temp4 := (temp3 AND X"001f_001f") + ( ( temp3 srl 8)  AND X"001f_001f");  -- 0..16 out x2
        temp5 := (temp4 AND X"0000_003f") + ( ( temp4 srl 16) AND X"0000_003f");  -- 0..32 out

        return to_integer(temp5(5 downto 0));
    end count_ones_32_bit;
0 Kudos
Scholar brimdavis
Scholar
1,722 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234   "How would you modify this function to accept a 300 bit input?"

I have always liked the "shift, mask, and add" approach because it is simple yet produces near-optimal results with most synthesizers - that said, I haven't checked the results in Vivado.

Below is a version using an entity with unconstrained ports that works to 512 bits ( limited only by the internal mask table size, see FIXME note in code).

This code hasn't been simulated, so it may have stupid errors...

The input port lengths are sanity checked using assertions, so be sure you turn on synthesis assertions in Vivado.

--
-- <bit_count.vhd>
--

---------------------------------------------------------------
--
-- (C) COPYRIGHT 2018  Brian Davis
--
-- Code released under the terms of the MIT license
-- https://opensource.org/licenses/MIT
--
---------------------------------------------------------------

--
-- bit counting 
--
--   - uses classical software shift-and-mask addition technique to count bits
--     XST optimizes this into a cascade of small adders
--
--   - unconstrained input vector supported by using a loop with a signal array
--       - input is currently limited to 512 bits by the mask array size
--

library ieee;
  use ieee.std_logic_1164.all;
  use ieee.numeric_std.all;
  use ieee.math_real.all;


entity bit_count is
  port
    (   
      din    : in  std_logic_vector;
      count  : out std_logic_vector
    );

end bit_count;


architecture looped_array of bit_count is

  --
  -- FIXME: currently using a constant array of masks good to 512 bits
  --        should just create these on the fly with a function call in the loop
  --
  -- note mask indexing starts at one to match loop range
  --
  type t_mask_array is array(1 to 9) of unsigned(511 downto 0);

  constant mask : t_mask_array :=
    (
      X"5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555",
      X"3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333",
      X"0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707",
      X"000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f",
      X"0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f",
      X"0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f",
      X"0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f",
      X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff",
      X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff"
    );

  --
  -- number of adder stages needed to cover the input vector length
  --
  constant STAGES : natural := natural( ceil( log2( real(din'length) ) ));

  --
  -- note there are STAGES+1 entries in array
  --
  type t_sum_array is array(0 to STAGES) of unsigned(din'range);

  signal sums : t_sum_array;

begin

  --
  -- port size sanity check 
  -- enable synthesis assertions when using Vivado
  --
  assert din'length <= mask(1)'length
    report   "Input vector size exceeds internal mask array size"
    severity ERROR;

  assert count'length >= STAGES
    report   "Output count vector size is too small to hold result"
    severity ERROR;

  --
  -- count bits by masking and adding adjacent bit groups
  --
  process(din,sums)
    begin

      --
      -- initialize index zero of the sum array to the data input
      --
      sums(0) <= unsigned(din);

      --
      -- stage loop
      --   note that the loop index starts at 1
      --
      stage_loop: for i in 1 to STAGES loop

        sums(i) <=   (   sums(i-1)                AND mask(i)(din'length-1 downto 0) ) 
                   + ( ( sums(i-1) srl 2**(i-1) ) AND mask(i)(din'length-1 downto 0) );

      end loop;

      count <= std_logic_vector( sums(STAGES)( count'length-1 downto 0 ) );

    end process;

end looped_array;

-Brian

0 Kudos
Explorer
Explorer
1,708 Views
Registered: ‎11-29-2015

Re: What is the most space-efficient count_ones function?

@brimdavis

I created this function based on your entity. It is hacky in that it only works with a 512 bit input. It would be nice to improve this to accept 512 bits or less, or even to accept any vector length.

function count_512_ones(din : std_logic_vector) return integer is
        --
        -- FIXME: currently using a constant array of masks good to 512 bits
        --        should just create these on the fly with a function call in the loop
        --
        -- note mask indexing starts at one to match loop range
        --
        type t_mask_array is array(1 to 9) of unsigned(511 downto 0);

        constant mask : t_mask_array :=
            (
            X"5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555",
            X"3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333",
            X"0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707",
            X"000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f",
            X"0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f",
            X"0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f",
            X"0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f",
            X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff",
            X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff"
            );

        --
        -- number of adder stages needed to cover the input vector length
        --
        -- constant STAGES : natural := natural( ceil( log2( real(din'length) ) ));
        constant STAGES : natural := 9;

        --
        -- note there are STAGES+1 entries in array
        --
        type t_sum_array is array(0 to STAGES) of unsigned(511 downto 0);

        variable sums : t_sum_array;
    begin
        --
        -- initialize index zero of the sum array to the data input
        --
        sums(0) := unsigned(din);

        --
        -- stage loop
        --   note that the loop index starts at 1
        --
        stage_loop: for i in 1 to STAGES loop
            sums(i) := (sums(i-1) AND mask(i)(din'length-1 downto 0)) + ((sums(i-1) srl 2**(i-1)) AND mask(i)(din'length-1 downto 0) );
        end loop;

        return to_integer(sums(STAGES)(511 downto 0));
    end count_512_ones;
0 Kudos
Scholar brimdavis
Scholar
1,704 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234   "It is hacky in that it only works with a 512 bit input. It would be nice to improve this to accept 512 bits or less, or even to accept any vector length."

I haven't reviewed your code, but my original example sized itself to an unconstrained std_logic_vector input port, and would work with any vector length of 512 bits or less.

EDIT: if you restore the original log2 code for STAGES, and change your function's last line to something like the following, I think it should work for any input vector up to 512 bits:

return to_integer(sums(STAGES)(STAGES-1 downto 0));

( I would also note that in practice to meet timing constraints, deep adder trees benefit from a pipeline register in each stage; this is easy to do in my entity version, but adds STAGES clock cycles of latency. )

Also FYI, an elegant recursive bit counting function was posted by @josyb  on the late Programmable Planet forums:

-- countbits.vhd
-- 07/12/2012 jb
-- recursive counting of bits in a std_logic_vector (of arbitrary length)

    function count_bits( s : std_logic_vector ) return natural is
        begin
            if (s'length = 1) then
                if ( s = "1") then
                    return 1 ;
                else
                    return 0 ;
                end if ;
            else
                return    count_bits( s(s'high 
                                                 downto s'low + s'length / 2) ) 
                         + count_bits( s(s'low + s'length / 2 - 1 
                                                 downto s'low ) ) ;
            end if ;		
        end function ;

I haven't tried the recursive version in Vivado, but sometimes synthesis tools have problems with, or get excruciatingly slow, upon encoutering deep recursion.

 -Brian

0 Kudos
Observer josyb
Observer
1,694 Views
Registered: ‎05-13-2015

Re: What is the most space-efficient count_ones function?

Brian,

thanks for the mention. Do you have more of that 'All Programmable Planet' stuff? That site just disappeared from the web without a trace ...

You may want to look at this page: https://github.com/josyb/recurse

It is about recursion in MyHDL but most of it applies to VHDL too.

It also shows that the recursive implementation will have the least levels of logic. 

 

Regards,

Josy

 

Scholar brimdavis
Scholar
1,685 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@josyb "Do you have more of that 'All Programmable Planet' stuff?"

Sadly, I only have copies of stuff from the threads that I'd posted to.

"It also shows that the recursive implementation will have the least levels of logic. "

The looped shift-and-mask code that I posted above relies on the synthesis tool to optimize out masked-off bits of the carry chain, so it ends up generating the same tree of small-ish adders as does your recursive version.

It's not as elegant as the recursive version, but if I replace the hardcoded mask array with a function to create the masks on the fly, I think it would at least get an honorable mention :)

"You may want to look at this page: https://github.com/josyb/recurse"

Thanks for the link! I've been meaning to experiment with MyHDL.

-Brian

 

0 Kudos
Scholar brimdavis
Scholar
1,674 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

Earlier, I wrote "if I replace the hardcoded mask array with a function to create the masks on the fly"

Here's a version of the shift-and-mask code (entity version) without the maximum-of-512-bit input vector length, also fixed a bug in the assertion check:

EDIT: bogus syntax check errors (red underlines) in the Vivado text editor regarding math.real functions can be safely ignored, as Vivado Synthesis supports elaboration time computations using the math.real functions.

--
-- <bit_count.vhd>
--

---------------------------------------------------------------
--
-- (C) COPYRIGHT 2018  Brian Davis
--
-- Code released under the terms of the MIT license
-- https://opensource.org/licenses/MIT
--
---------------------------------------------------------------

--
-- bit counting 
--
--   - uses classical software shift-and-mask addition technique to count bits
--     XST optimizes this into a tree of small adders
--

library ieee;
  use ieee.std_logic_1164.all;
  use ieee.numeric_std.all;
  use ieee.math_real.all;


entity bit_count is
  port
    (   
      din    : in  std_logic_vector;
      count  : out std_logic_vector
    );

end bit_count;


architecture looped_function of bit_count is

  constant SIZE : natural := din'LENGTH;
  
  --
  -- produce bit mask arrays to add consecutive groups of bits, e.g.
  --
  -- STAGE  MASK
  --   1    X"5555_5555"
  --   2    X"3333_3333"
  --   3    X"0707_0707"
  --   4    X"000f_000f"
  --
  function bit_mask( stage : natural ) return unsigned is
    variable mask : unsigned(SIZE-1 downto 0);

    constant STRIDE : natural := 2**(stage);
    variable i : natural := 0;

  begin

     mask := ( others => '0');

     while i < SIZE loop

       for j in 0 to stage-1 loop                        -- EDIT: fixed loop index 
         if i+j < SIZE then mask(i+j) := '1';  end if;   -- EDIT: fixed non-power-of-two bug
       end loop;

       i := i + STRIDE;

     end loop;

     return mask;
  end;


  --
  -- number of adder stages needed to cover the input vector length
  --
  constant STAGES : natural := natural( ceil( log2( real(SIZE) ) ));

  --
  -- note there are STAGES+1 entries in array
  --
  type t_sum_array is array(0 to STAGES) of unsigned(din'range);

  signal sums : t_sum_array;

begin

  --
  -- port size sanity check 
  -- enable synthesis assertions when using Vivado
  --
  assert count'LENGTH > STAGES
    report   "Output count vector size is too small to hold result"
    severity ERROR;

  --
  -- count bits by masking and adding adjacent bit groups
  --
  process(din,sums)
    begin

      --
      -- initialize index zero of the sum array to the data input
      --
      sums(0) <= unsigned(din);

      --
      -- stage loop
      --   note that the loop index starts at 1
      --
      stage_loop: for i in 1 to STAGES loop

        sums(i) <=   (   sums(i-1)                AND bit_mask(i) ) 
                   + ( ( sums(i-1) srl 2**(i-1) ) AND bit_mask(i) );

      end loop;

    end process;

  count <= std_logic_vector( sums(STAGES)( count'length-1 downto 0 ) );

end looped_function;

 

0 Kudos
Teacher xilinxacct
Teacher
1,642 Views
Registered: ‎10-23-2018

Re: What is the most space-efficient count_ones function?

@david12341234

That sounds like a fun challenge... What are the utilization numbers you are trying to beat?  What is the balance of resources you are looking for? (e.g. reduce adders, ...) In other words what resources are most precious OR is it just 'generally' uses a small amount of resources (and mix will do)? A 300 bit specific solution is acceptable? Any other criteria?

0 Kudos
Explorer
Explorer
1,637 Views
Registered: ‎11-29-2015

Re: What is the most space-efficient count_ones function?

@brimdavis

 

Value 507 is out of range 0 to 506 at the line marked below

function bit_mask( stage : natural ) return unsigned is
        variable mask : unsigned(SIZE-1 downto 0);
        constant STRIDE : natural := 2**(stage);
        variable i : natural := 0;
    begin
        mask := ( others => '0');
        while i < SIZE loop
            for j in 0 to STRIDE/2-1 loop
------->     mask(i+j) := '1';   <----------------
            end loop;
            i := i + STRIDE;
        end loop;
        return mask;
    end;
0 Kudos
Scholar brimdavis
Scholar
1,621 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234  "Value 507 is out of range 0 to 506 at the line marked below"

I tested that second entity version for inputs of 256/512/1024, but those were all power of two; probably need to add another if to limit the index of the inner loop, something like:

  for j in 0 to STRIDE/2-1 loop
     if i+j < SIZE then mask(i+j) := '1';  end if;
  end loop;

If that doesn't fix it, post your latest function code and caller and I'll look at it.

-Brian

Scholar brimdavis
Scholar
1,615 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

For reference, here's a list of various bit counting techniques from comp.arch.fpga/comp.lang.vhdl/comp.lang.verilog, where this question pops up periodically:

  Counting the number of ones present:  https://groups.google.com/d/topic/comp.lang.verilog/_-xAzd8yfzI/discussion

  Counting ones:   https://groups.google.com/d/topic/comp.arch.fpga/2DhAH2VBhKg/discussion

 The following thread includes some synthesis result summaries:

  Loop and manually unrolled loop: https://groups.google.com/group/comp.lang.vhdl/msg/01e25cf637b280db

  A function with a loop: https://groups.google.com/group/comp.lang.vhdl/msg/fcb25e7e5613bffe

  Explicit coding of adder structure: https://groups.google.com/group/comp.lang.vhdl/msg/48d337742b3d37a9

  Hybrid, lookup tables + adder https://groups.google.com/group/comp.lang.vhdl/msg/410a3b0c2e7c2402

  My mask-and-add-vectors 32 bit version: https://groups.google.com/group/comp.lang.vhdl/msg/aec4118508169efb

  Results comparison in LUTs: https://groups.google.com/group/comp.lang.vhdl/msg/6924f9a1e0f04336

And for completeness, josyb's page on recursive bit counting that was mentioned up thread:  https://github.com/josyb/recurse

 -Brian

Explorer
Explorer
1,586 Views
Registered: ‎11-29-2015

Re: What is the most space-efficient count_ones function?

@brimdavis

I saw that thread you mentioned originally, before posting here. The consensus seems to be that the mask-and-add technique is the best (Fastest synthesis time. Lowest resource usage.) The recursive version took hours to synthesize.

Here is the updated function that I was originally looking for in my first post, based on your examples. Works for any size input, odd or even. I appreciate your advice.

function count_ones(din : std_logic_vector) return integer is
        constant SIZE : natural := din'LENGTH;

        function bit_mask( stage : natural ) return unsigned is
            variable mask : unsigned(SIZE-1 downto 0);
            constant STRIDE : natural := 2**(stage);
            variable i : natural := 0;
        begin
            mask := ( others => '0');
            while i < SIZE loop
                for j in 0 to STRIDE/2-1 loop
                    if i+j < SIZE then mask(i+j) := '1';  end if;
                end loop;
                i := i + STRIDE;
            end loop;
            return mask;
        end;
        -- number of adder stages needed to cover the input vector length
        constant STAGES : natural := natural( ceil( log2( real(SIZE) ) ));

        -- note there are STAGES+1 entries in array
        type t_sum_array is array(0 to STAGES) of unsigned(SIZE-1 downto 0);

        variable sums : t_sum_array;
    begin
        -- initialize index zero of the sum array to the data input
        sums(0) := unsigned(din);
        -- stage loop
        --   note that the loop index starts at 1
        stage_loop: for i in 1 to STAGES loop
            sums(i) := (sums(i-1) AND bit_mask(i)(SIZE-1 downto 0)) + ((sums(i-1) srl 2**(i-1)) AND bit_mask(i)(SIZE-1 downto 0));
        end loop;
        return to_integer(sums(STAGES)(STAGES-1 downto 0));
    end count_ones;
0 Kudos
Scholar brimdavis
Scholar
1,549 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234   "Here is the updated function that I was originally looking for in my first post, based on your examples."

One correction to the bit_mask() function, I made a mistake (mostly harmless) in the j loop index, the code should read:

    for j in 0 to stage-1 loop

The original code produced the correct results, but the bit masks were wider than they needed to be (which the synthesizer then optimized away).

In Vivado 2017.4 targeting an Artix7, I see the following LUT usage:

  •  256 bits  386 LUTs
  •  128 bits  189 LUTs
  •   32 bits   42 LUTs

-Brian

 

0 Kudos
Teacher xilinxacct
Teacher
1,543 Views
Registered: ‎10-23-2018

Re: What is the most space-efficient count_ones function?

@brimdavis

Well done Brian!

You have gone the extra mile on this one, with all the links and updates. May I ask one favor, I would love to try this out on the UltraScale, and to insure I don’t miss one of the ‘fixes’, (e.g. end of with your final solution) would it be possible to upload the Vivado project?

Thanks in advance

0 Kudos
Scholar brimdavis
Scholar
1,523 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@xilinxacct  "would it be possible to upload the Vivado project"

Code, testbench, simulation makefile (GHDL/GTKWave), and Vivado Artix7 project are checked in here:

  https://github.com/brimdavis/fpga_stuff/tree/master/bit_count

Note that the above repository contains my entity version of the shift-and-mask bit counting code.

If I have time someday, I'll add a package with a few of the bit counting variants as functions; as I said up thread, I usually avoid functions for potentially deep trees because they are more difficult to pipeline (i.e. you have to surround them with N stages of registers and turn on the synthesis retiming options.)

-Brian

Teacher xilinxacct
Teacher
1,516 Views
Registered: ‎10-23-2018

Re: What is the most space-efficient count_ones function?

@brimdavis

@david12341234

Thanks... for those who care... The UltraScale utilization come in at...

256 = 387 LUTs 126 carries

128 = 190 LUTs 62 carries

32 = 47 LUTs 15 carries

0 Kudos
Mentor hgleamon1
Mentor
1,484 Views
Registered: ‎11-14-2011

Re: What is the most space-efficient count_ones function?

Just to be a bit extreme (the original query was least amount of LUTs), you could use a BRAM as a look up table, thereby not using any LUTs at all.

Routing a 300+ bit address might be awkward, though and may depend, in a real system, on other BRAM usage.

 

----------
"That which we must learn to do, we learn by doing." - Aristotle
0 Kudos
Explorer
Explorer
1,463 Views
Registered: ‎11-29-2015

Re: What is the most space-efficient count_ones function?

@hgleamon1

Wouldn't you need 2^300 addresses though?! Is there a trick to somehow only use 300 addresses/elements?

0 Kudos
Mentor hgleamon1
Mentor
1,456 Views
Registered: ‎11-14-2011

Re: What is the most space-efficient count_ones function?

Yes, for a 300 bit address there would be a lot of data and probably impractically so (might work for simulation purposes but then you wouldn't care about LUT usage!)..

The memory will of course have the same output for a given number of addresses (e.g. every possible "one-hot" address will return 0b1) and could therefore itself be subject to some optimisation.

You could probably use the same function as described previously to populate the RAM/ROM at compile time if the size of vector is known at compile time.

As I implied in my original response, it's a bit extreme and probably more an academic exercise but for a smaller address vector you could use a BRAM to completely eliminate LUT usage.

----------
"That which we must learn to do, we learn by doing." - Aristotle
0 Kudos
Historian
Historian
1,433 Views
Registered: ‎01-23-2009

Re: What is the most space-efficient count_ones function?

Just to put this in perspective - a RAM with 300 bits that can return the number from 0 to 300 would be 1.83x10^91 bits of memory.

For comparison, the total amount of computer memory on the planet is estimated to be 2x10^21 bits...

Avrum

0 Kudos
Mentor hgleamon1
Mentor
1,420 Views
Registered: ‎11-14-2011

Re: What is the most space-efficient count_ones function?

I apologise for derailing the thread slightly but my point was really about approaching the issue from a completely different perspective, rather than about identifying a physically implementable solution, i.e. in principle a solution that uses the least amount of LUTs is one that uses no LUTs at all.

 

I trust that was understood :)

----------
"That which we must learn to do, we learn by doing." - Aristotle
0 Kudos
Scholar jmcclusk
Scholar
1,299 Views
Registered: ‎02-24-2014

Re: What is the most space-efficient count_ones function?

I wrote my own implementation of a count_ones module in VHDL using recursion, and I get rather different results, Brian.    

Targeting an Artix XC7A200TFFG1156-1 device with Vivado 2018.2 :

Width  LUTS   Latency   Fmax
4096   2916     7       257.6 MHz
 256    288     5       300.0 MHz
 128     98     4       312.5 MHz
  32     37     3       323.4 MHz

Getting these numbers was not easy, since Vivado is reluctant to pack the LUT's the way I like using behavorial RTL.  And even when packed properly, the placement of the LUT's is dubious at times.  I may be able to improve the timing a bit by some judiciously inserted RLOC constraints.

I find the LUT usage oddly high for 256 bit widths...  I'm not sure why this is happening.

John

 

Don't forget to close a thread when possible by accepting a post as a solution.
0 Kudos
Scholar brimdavis
Scholar
1,287 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@jmcclusk  "I wrote my own implementation of a count_ones module in VHDL using recursion, and I get rather different results"

I haven't tried JosyB's above recursive function in Vivado yet; David had reported up-thread that Vivado synthesis of that recursion took hours to finish.

Re. your lower CLB counts, it would be interesting to see the synthesized structure - is it building ternary adders or small lookup tables for the early stages, as opposed to a binary adder tree?

I set out ~20 years ago intending to write an optimized (4-LUT) structural bit count, but the shift-and-mask adder inference worked well enough that I never got around to it...

-Brian

0 Kudos
Scholar jmcclusk
Scholar
1,255 Views
Registered: ‎02-24-2014

Re: What is the most space-efficient count_ones function?

Well, my previous results are all just garbage, because of a couple of problems..  First of all, it turns out you can make a 5:2 compressor in 2 luts by using a LUT6_2 primitive, but behavioral code just won't get you there.   Secondly Vivado is rather resistant to implementing small adders of 3 or 4 bits, and seems to prefer a LUT based solution, rather than a Carry4.   Furthermore, constructing an optimal tree is remarkably difficult, because reasonable choices at the branch points of the tree can result in bad situations at the leaves.   I'm considering trying to rewrite the recursion to go from the leaves down to the root, which is remarkably difficult...  I'm still mulling over the approach..

And lastly...  it turns out there's a little known secret about ternary adders.....  you get TWO carry inputs instead of just one, which is very handy for sucking up those left-over leaf bits.   But it also complicates the matter of building an optimal tree...   i'm beginning to think that searching the space of tree topologies is NP-complete.   

and finally...   Vivado is not very consistent about LUT reporting...   It reports a LUT6_2 as 2 luts in the Statistics of Cell properties..    The project summary page seems to avoid this issue, but I think I'm going to have to write a TCL script to actually search the netlist to avoid this double counting problem with LUT6_2 elements.   The Project summary tab seems to report correctly...  but is it really? 

Now I'm getting 25 LUTS for a 32 bit compressor, which is an outstanding number, from all results I've seen on the web.   We'll see.   I'll eventually post this code when I've tested it further and cleaned it up a bit.

256 bit compressor takes 232 LUTs...   it synthesizes and routes in a few minutes, so I can't complain about the speed.

Don't forget to close a thread when possible by accepting a post as a solution.
0 Kudos
Scholar brimdavis
Scholar
1,243 Views
Registered: ‎04-26-2012

Re: What is the most space-efficient count_ones function?

@jmcclusk  "256 bit compressor takes 232 LUTs...   it synthesizes and routes in a few minutes, so I can't complain about the speed."

It's interesting that your recursive function isn't giving Vivado indigestion, what Vivado version are you using?

I hacked up my looped shift and mask code to try a ternary add; but looking at the synthesis results, it's not actually inferring any ternary adders- I think the shift/mask is breaking the inference thereof.

But the recoding did shave a few LUTS over the original binary tree code for the smaller widths, 32 bits is now 36 LUTs.

Although to be a fair comparison to the results in older devices, we'd probably really want to break out the detailed counts of LUT4, LUT6, carries, etc. for each version.

-Brian

0 Kudos