**UPGRADE YOUR BROWSER**

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Forums
- :
- Forums
- :
- Vivado RTL Development
- :
- Synthesis
- :
- What is the most space-efficient count_ones functi...

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

Highlighted
##

david12341234

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-01-2018 04:44 PM

1,817 Views

Registered:
11-29-2015

What is the most space-efficient count_ones function?

Given that the requirement is to count ones in one clock tick, for a std_logic_vector of __any size__, what function will use the least amount of LUTS?

I just know of the method below, but I'm reading that it performs poorly in terms of area vs other designs, but the other designs all have a fixed input bit width.

function count_ones(s : std_logic_vector) return integer is variable temp : natural := 0; begin for i in s'range loop if s(i) = '1' then temp := temp + 1; end if; end loop; return temp; end function count_ones;

27 Replies

avrumw

Historian

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-01-2018 05:31 PM

1,798 Views

Registered:
01-23-2009

Re: What is the most space-efficient count_ones function?

While there are some exceptions, in general the RTL coding style you use to describe **combinatorial** **logic** does not affect the number of LUTs required for the function. Synthesis is very good at doing Boolean minimization, so regardless of what form the logic starts at (after elaboration) synthesis will likely find a similar result (in terms of usage) after it has completed its minimization.

Avrum

david12341234

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-01-2018 07:35 PM

1,781 Views

Registered:
11-29-2015

Re: What is the most space-efficient count_ones function?

I have a 300bit vector which somehow results in 4500 adders... Isn't there a more efficient way to count 300 bits?

Module census_vector Detailed RTL Component Info : +---Adders : 2 Input 9 Bit Adders := 4500 +---XORs : 2 Input 300 Bit XORs := 30 +---Registers : 11 Bit Registers := 2 5 Bit Registers := 1 1 Bit Registers := 2 +---Muxes : 2 Input 11 Bit Muxes := 1 2 Input 9 Bit Muxes := 41 2 Input 5 Bit Muxes := 20 2 Input 4 Bit Muxes := 12 2 Input 3 Bit Muxes := 6 2 Input 2 Bit Muxes := 3 2 Input 1 Bit Muxes := 2

david12341234

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 12:25 AM

1,754 Views

Registered:
11-29-2015

Re: What is the most space-efficient count_ones function?

I found the code below which was reported to be very space efficient but it I don't quite follow the logic enough to expand it out to 300 bits without coping out and just adding 10 them together. How would you modify this function to accept a 300 bit input?

function count_ones_32_bit(s : std_logic_vector) return integer is variable temp : unsigned (31 downto 0); variable temp1 : unsigned (31 downto 0); variable temp2 : unsigned (31 downto 0); variable temp3 : unsigned (31 downto 0); variable temp4 : unsigned (31 downto 0); variable temp5 : unsigned (31 downto 0); begin temp := unsigned(s); temp1 := (temp AND X"5555_5555") + ( ( temp srl 1) AND X"5555_5555"); -- 0..2 out x16 temp2 := (temp1 AND X"3333_3333") + ( ( temp1 srl 2) AND X"3333_3333"); -- 0..4 out x8 temp3 := (temp2 AND X"0707_0707") + ( ( temp2 srl 4) AND X"0707_0707"); -- 0..8 out x4 temp4 := (temp3 AND X"001f_001f") + ( ( temp3 srl 8) AND X"001f_001f"); -- 0..16 out x2 temp5 := (temp4 AND X"0000_003f") + ( ( temp4 srl 16) AND X"0000_003f"); -- 0..32 out return to_integer(temp5(5 downto 0)); end count_ones_32_bit;

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 09:30 AM - edited 12-02-2018 09:31 AM

1,722 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234 "How would you modify this function to accept a 300 bit input?"

I have always liked the "shift, mask, and add" approach because it is simple yet produces near-optimal results with most synthesizers - that said, I haven't checked the results in Vivado.

Below is a version using an entity with unconstrained ports that works to 512 bits ( limited only by the internal mask table size, see FIXME note in code).

This code hasn't been simulated, so it may have stupid errors...

The input port lengths are sanity checked using assertions, so be sure you turn on synthesis assertions in Vivado.

-- -- <bit_count.vhd> -- --------------------------------------------------------------- -- -- (C) COPYRIGHT 2018 Brian Davis -- -- Code released under the terms of the MIT license -- https://opensource.org/licenses/MIT -- --------------------------------------------------------------- -- -- bit counting -- -- - uses classical software shift-and-mask addition technique to count bits -- XST optimizes this into a cascade of small adders -- -- - unconstrained input vector supported by using a loop with a signal array -- - input is currently limited to 512 bits by the mask array size -- library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; use ieee.math_real.all; entity bit_count is port ( din : in std_logic_vector; count : out std_logic_vector ); end bit_count; architecture looped_array of bit_count is -- -- FIXME: currently using a constant array of masks good to 512 bits -- should just create these on the fly with a function call in the loop -- -- note mask indexing starts at one to match loop range -- type t_mask_array is array(1 to 9) of unsigned(511 downto 0); constant mask : t_mask_array := ( X"5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555", X"3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333", X"0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707", X"000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f", X"0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f", X"0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f", X"0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f", X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff", X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff" ); -- -- number of adder stages needed to cover the input vector length -- constant STAGES : natural := natural( ceil( log2( real(din'length) ) )); -- -- note there are STAGES+1 entries in array -- type t_sum_array is array(0 to STAGES) of unsigned(din'range); signal sums : t_sum_array; begin -- -- port size sanity check -- enable synthesis assertions when using Vivado -- assert din'length <= mask(1)'length report "Input vector size exceeds internal mask array size" severity ERROR; assert count'length >= STAGES report "Output count vector size is too small to hold result" severity ERROR; -- -- count bits by masking and adding adjacent bit groups -- process(din,sums) begin -- -- initialize index zero of the sum array to the data input -- sums(0) <= unsigned(din); -- -- stage loop -- note that the loop index starts at 1 -- stage_loop: for i in 1 to STAGES loop sums(i) <= ( sums(i-1) AND mask(i)(din'length-1 downto 0) ) + ( ( sums(i-1) srl 2**(i-1) ) AND mask(i)(din'length-1 downto 0) ); end loop; count <= std_logic_vector( sums(STAGES)( count'length-1 downto 0 ) ); end process; end looped_array;

-Brian

david12341234

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 11:29 AM

1,708 Views

Registered:
11-29-2015

Re: What is the most space-efficient count_ones function?

I created this function based on your entity. It is hacky in that it only works with a 512 bit input. It would be nice to improve this to accept 512 bits or less, or even to accept any vector length.

function count_512_ones(din : std_logic_vector) return integer is -- -- FIXME: currently using a constant array of masks good to 512 bits -- should just create these on the fly with a function call in the loop -- -- note mask indexing starts at one to match loop range -- type t_mask_array is array(1 to 9) of unsigned(511 downto 0); constant mask : t_mask_array := ( X"5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555_5555", X"3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333_3333", X"0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707_0707", X"000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f_000f", X"0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f_0000_001f", X"0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f_0000_0000_0000_003f", X"0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f_0000_0000_0000_0000_0000_0000_0000_007f", X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_00ff", X"0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_0000_01ff" ); -- -- number of adder stages needed to cover the input vector length -- -- constant STAGES : natural := natural( ceil( log2( real(din'length) ) )); constant STAGES : natural := 9; -- -- note there are STAGES+1 entries in array -- type t_sum_array is array(0 to STAGES) of unsigned(511 downto 0); variable sums : t_sum_array; begin -- -- initialize index zero of the sum array to the data input -- sums(0) := unsigned(din); -- -- stage loop -- note that the loop index starts at 1 -- stage_loop: for i in 1 to STAGES loop sums(i) := (sums(i-1) AND mask(i)(din'length-1 downto 0)) + ((sums(i-1) srl 2**(i-1)) AND mask(i)(din'length-1 downto 0) ); end loop; return to_integer(sums(STAGES)(511 downto 0)); end count_512_ones;

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 11:54 AM - edited 12-02-2018 12:00 PM

1,704 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234 "It is hacky in that it only works with a 512 bit input. It would be nice to improve this to accept 512 bits or less, or even to accept any vector length."

I haven't reviewed your code, but my original example sized itself to an unconstrained std_logic_vector input port, and would work with any vector length of 512 bits or less.

EDIT: if you restore the original log2 code for STAGES, and change your function's last line to something like the following, I think it should work for any input vector up to 512 bits:

return to_integer(sums(STAGES)(STAGES-1 downto 0));

( I would also note that in practice to meet timing constraints, deep adder trees benefit from a pipeline register in each stage; this is easy to do in my entity version, but adds STAGES clock cycles of latency. )

Also FYI, an elegant recursive bit counting function was posted by @josyb on the late Programmable Planet forums:

-- countbits.vhd -- 07/12/2012 jb -- recursive counting of bits in a std_logic_vector (of arbitrary length) function count_bits( s : std_logic_vector ) return natural is begin if (s'length = 1) then if ( s = "1") then return 1 ; else return 0 ; end if ; else return count_bits( s(s'high downto s'low + s'length / 2) ) + count_bits( s(s'low + s'length / 2 - 1 downto s'low ) ) ; end if ; end function ;

I haven't tried the recursive version in Vivado, but sometimes synthesis tools have problems with, or get excruciatingly slow, upon encoutering deep recursion.

-Brian

josyb

Observer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 12:16 PM

1,694 Views

Registered:
05-13-2015

Re: What is the most space-efficient count_ones function?

Brian,

thanks for the mention. Do you have more of that 'All Programmable Planet' stuff? That site just disappeared from the web without a trace ...

You may want to look at this page: https://github.com/josyb/recurse

It is about recursion in MyHDL but most of it applies to VHDL too.

It also shows that the recursive implementation will have the least levels of logic.

Regards,

Josy

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 12:41 PM

1,685 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@josyb "Do you have more of that 'All Programmable Planet' stuff?"

Sadly, I only have copies of stuff from the threads that I'd posted to.

"It also shows that the recursive implementation will have the least levels of logic. "

The looped shift-and-mask code that I posted above relies on the synthesis tool to optimize out masked-off bits of the carry chain, so it ends up generating the same tree of small-ish adders as does your recursive version.

It's not as elegant as the recursive version, but if I replace the hardcoded mask array with a function to create the masks on the fly, I think it would at least get an honorable mention :)

"You may want to look at this page: https://github.com/josyb/recurse"

Thanks for the link! I've been meaning to experiment with MyHDL.

-Brian

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 01:46 PM - edited 12-03-2018 03:24 PM

1,674 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

Earlier, I wrote "if I replace the hardcoded mask array with a function to create the masks on the fly"

Here's a version of the shift-and-mask code (entity version) without the maximum-of-512-bit input vector length, also fixed a bug in the assertion check:

EDIT: bogus syntax check errors (red underlines) in the Vivado text editor regarding math.real functions can be safely ignored, as Vivado Synthesis supports elaboration time computations using the math.real functions.

-- -- <bit_count.vhd> -- --------------------------------------------------------------- -- -- (C) COPYRIGHT 2018 Brian Davis -- -- Code released under the terms of the MIT license -- https://opensource.org/licenses/MIT -- --------------------------------------------------------------- -- -- bit counting -- -- - uses classical software shift-and-mask addition technique to count bits -- XST optimizes this into a tree of small adders -- library ieee; use ieee.std_logic_1164.all; use ieee.numeric_std.all; use ieee.math_real.all; entity bit_count is port ( din : in std_logic_vector; count : out std_logic_vector ); end bit_count; architecture looped_function of bit_count is constant SIZE : natural := din'LENGTH; -- -- produce bit mask arrays to add consecutive groups of bits, e.g. -- -- STAGE MASK -- 1 X"5555_5555" -- 2 X"3333_3333" -- 3 X"0707_0707" -- 4 X"000f_000f" -- function bit_mask( stage : natural ) return unsigned is variable mask : unsigned(SIZE-1 downto 0); constant STRIDE : natural := 2**(stage); variable i : natural := 0; begin mask := ( others => '0'); while i < SIZE loop for j in 0 to stage-1 loop -- EDIT: fixed loop index if i+j < SIZE then mask(i+j) := '1'; end if; -- EDIT: fixed non-power-of-two bug end loop; i := i + STRIDE; end loop; return mask; end; -- -- number of adder stages needed to cover the input vector length -- constant STAGES : natural := natural( ceil( log2( real(SIZE) ) )); -- -- note there are STAGES+1 entries in array -- type t_sum_array is array(0 to STAGES) of unsigned(din'range); signal sums : t_sum_array; begin -- -- port size sanity check -- enable synthesis assertions when using Vivado -- assert count'LENGTH > STAGES report "Output count vector size is too small to hold result" severity ERROR; -- -- count bits by masking and adding adjacent bit groups -- process(din,sums) begin -- -- initialize index zero of the sum array to the data input -- sums(0) <= unsigned(din); -- -- stage loop -- note that the loop index starts at 1 -- stage_loop: for i in 1 to STAGES loop sums(i) <= ( sums(i-1) AND bit_mask(i) ) + ( ( sums(i-1) srl 2**(i-1) ) AND bit_mask(i) ); end loop; end process; count <= std_logic_vector( sums(STAGES)( count'length-1 downto 0 ) ); end looped_function;

xilinxacct

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 04:19 PM

1,642 Views

Registered:
10-23-2018

Re: What is the most space-efficient count_ones function?

david12341234

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 04:35 PM

1,637 Views

Registered:
11-29-2015

Re: What is the most space-efficient count_ones function?

Value 507 is out of range 0 to 506 at the line marked below

function bit_mask( stage : natural ) return unsigned is variable mask : unsigned(SIZE-1 downto 0); constant STRIDE : natural := 2**(stage); variable i : natural := 0; begin mask := ( others => '0'); while i < SIZE loop for j in 0 to STRIDE/2-1 loop -------> mask(i+j) := '1'; <---------------- end loop; i := i + STRIDE; end loop; return mask; end;

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 04:50 PM

1,621 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234 "Value 507 is out of range 0 to 506 at the line marked below"

I tested that second entity version for inputs of 256/512/1024, but those were all power of two; probably need to add another **if** to limit the index of the inner loop, something like:

for j in 0 to STRIDE/2-1 loop if i+j < SIZE then mask(i+j) := '1'; end if; end loop;

If that doesn't fix it, post your latest function code and caller and I'll look at it.

-Brian

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 05:10 PM

1,615 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

For reference, here's a list of various bit counting techniques from comp.arch.fpga/comp.lang.vhdl/comp.lang.verilog, where this question pops up periodically:

Counting the number of ones present: https://groups.google.com/d/topic/comp.lang.verilog/_-xAzd8yfzI/discussion

Counting ones: https://groups.google.com/d/topic/comp.arch.fpga/2DhAH2VBhKg/discussion

The following thread includes some synthesis result summaries:

Loop and manually unrolled loop: https://groups.google.com/group/comp.lang.vhdl/msg/01e25cf637b280db

A function with a loop: https://groups.google.com/group/comp.lang.vhdl/msg/fcb25e7e5613bffe

Explicit coding of adder structure: https://groups.google.com/group/comp.lang.vhdl/msg/48d337742b3d37a9

Hybrid, lookup tables + adder https://groups.google.com/group/comp.lang.vhdl/msg/410a3b0c2e7c2402

My mask-and-add-vectors 32 bit version: https://groups.google.com/group/comp.lang.vhdl/msg/aec4118508169efb

Results comparison in LUTs: https://groups.google.com/group/comp.lang.vhdl/msg/6924f9a1e0f04336

And for completeness, josyb's page on recursive bit counting that was mentioned up thread: https://github.com/josyb/recurse

-Brian

david12341234

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-02-2018 07:21 PM

1,586 Views

Registered:
11-29-2015

Re: What is the most space-efficient count_ones function?

I saw that thread you mentioned originally, before posting here. The consensus seems to be that the mask-and-add technique is the best (Fastest synthesis time. Lowest resource usage.) The recursive version took hours to synthesize.

Here is the updated function that I was originally looking for in my first post, based on your examples. Works for any size input, odd or even. I appreciate your advice.

function count_ones(din : std_logic_vector) return integer is constant SIZE : natural := din'LENGTH; function bit_mask( stage : natural ) return unsigned is variable mask : unsigned(SIZE-1 downto 0); constant STRIDE : natural := 2**(stage); variable i : natural := 0; begin mask := ( others => '0'); while i < SIZE loop for j in 0 to STRIDE/2-1 loop if i+j < SIZE then mask(i+j) := '1'; end if; end loop; i := i + STRIDE; end loop; return mask; end; -- number of adder stages needed to cover the input vector length constant STAGES : natural := natural( ceil( log2( real(SIZE) ) )); -- note there are STAGES+1 entries in array type t_sum_array is array(0 to STAGES) of unsigned(SIZE-1 downto 0); variable sums : t_sum_array; begin -- initialize index zero of the sum array to the data input sums(0) := unsigned(din); -- stage loop -- note that the loop index starts at 1 stage_loop: for i in 1 to STAGES loop sums(i) := (sums(i-1) AND bit_mask(i)(SIZE-1 downto 0)) + ((sums(i-1) srl 2**(i-1)) AND bit_mask(i)(SIZE-1 downto 0)); end loop; return to_integer(sums(STAGES)(STAGES-1 downto 0)); end count_ones;

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-03-2018 03:22 PM

1,549 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@david12341234 "Here is the updated function that I was originally looking for in my first post, based on your examples."

One correction to the bit_mask() function, I made a mistake (mostly harmless) in the j loop index, the code should read:

for j in 0 to stage-1 loop

The original code produced the correct results, but the bit masks were wider than they needed to be (which the synthesizer then optimized away).

In Vivado 2017.4 targeting an Artix7, I see the following LUT usage:

- 256 bits 386 LUTs
- 128 bits 189 LUTs
- 32 bits 42 LUTs

-Brian

xilinxacct

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-03-2018 03:41 PM

1,543 Views

Registered:
10-23-2018

Re: What is the most space-efficient count_ones function?

Well done Brian!

You have gone the extra mile on this one, with all the links and updates. May I ask one favor, I would love to try this out on the UltraScale, and to insure I don’t miss one of the ‘fixes’, (e.g. end of with your final solution) would it be possible to upload the Vivado project?

Thanks in advance

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-03-2018 06:20 PM

1,523 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@xilinxacct "would it be possible to upload the Vivado project"

Code, testbench, simulation makefile (GHDL/GTKWave), and Vivado Artix7 project are checked in here:

https://github.com/brimdavis/fpga_stuff/tree/master/bit_count

Note that the above repository contains my entity version of the shift-and-mask bit counting code.

If I have time someday, I'll add a package with a few of the bit counting variants as functions; as I said up thread, I usually avoid functions for potentially deep trees because they are more difficult to pipeline (i.e. you have to surround them with N stages of registers and turn on the synthesis retiming options.)

-Brian

xilinxacct

Teacher

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-03-2018 06:46 PM

1,516 Views

Registered:
10-23-2018

Re: What is the most space-efficient count_ones function?

Thanks... for those who care... The UltraScale utilization come in at...

256 = 387 LUTs 126 carries

128 = 190 LUTs 62 carries

32 = 47 LUTs 15 carries

hgleamon1

Mentor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-04-2018 09:56 AM

1,484 Views

Registered:
11-14-2011

Re: What is the most space-efficient count_ones function?

Just to be a bit extreme (the original query was least amount of LUTs), you could use a BRAM as a look up table, thereby not using any LUTs at all.

Routing a 300+ bit address might be awkward, though and may depend, in a real system, on other BRAM usage.

----------

"That which we must learn to do, we learn by doing." - Aristotle

"That which we must learn to do, we learn by doing." - Aristotle

david12341234

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-04-2018 10:41 PM

1,463 Views

Registered:
11-29-2015

Re: What is the most space-efficient count_ones function?

hgleamon1

Mentor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-04-2018 11:12 PM

1,456 Views

Registered:
11-14-2011

Re: What is the most space-efficient count_ones function?

Yes, for a 300 bit address there would be a lot of data and probably impractically so (might work for simulation purposes but then you wouldn't care about LUT usage!)..

The memory will of course have the same output for a given number of addresses (e.g. every possible "one-hot" address will return 0b1) and could therefore itself be subject to some optimisation.

You could probably use the same function as described previously to populate the RAM/ROM at compile time if the size of vector is known at compile time.

As I implied in my original response, it's a bit extreme and probably more an academic exercise but for a smaller address vector you could use a BRAM to completely eliminate LUT usage.

----------

"That which we must learn to do, we learn by doing." - Aristotle

"That which we must learn to do, we learn by doing." - Aristotle

avrumw

Historian

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-05-2018 08:22 PM - edited 12-05-2018 08:23 PM

1,433 Views

Registered:
01-23-2009

Re: What is the most space-efficient count_ones function?

Just to put this in perspective - a RAM with 300 bits that can return the number from 0 to 300 would be 1.83x10^91 bits of memory.

For comparison, the total amount of computer memory on the planet is estimated to be 2x10^21 bits...

Avrum

hgleamon1

Mentor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-06-2018 04:22 AM

1,420 Views

Registered:
11-14-2011

Re: What is the most space-efficient count_ones function?

I apologise for derailing the thread slightly but my point was really about approaching the issue from a completely different perspective, rather than about identifying a physically implementable solution, i.e. in principle a solution that uses the least amount of LUTs is one that uses no LUTs at all.

I trust that was understood :)

----------

"That which we must learn to do, we learn by doing." - Aristotle

"That which we must learn to do, we learn by doing." - Aristotle

jmcclusk

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-13-2018 02:41 PM

1,299 Views

Registered:
02-24-2014

Re: What is the most space-efficient count_ones function?

I wrote my own implementation of a count_ones module in VHDL using recursion, and I get rather different results, Brian.

Targeting an Artix XC7A200TFFG1156-1 device with Vivado 2018.2 :

Width LUTS Latency Fmax

4096 2916 7 257.6 MHz

256 288 5 300.0 MHz

128 98 4 312.5 MHz

32 37 3 323.4 MHz

Getting these numbers was not easy, since Vivado is reluctant to pack the LUT's the way I like using behavorial RTL. And even when packed properly, the placement of the LUT's is dubious at times. I may be able to improve the timing a bit by some judiciously inserted RLOC constraints.

I find the LUT usage oddly high for 256 bit widths... I'm not sure why this is happening.

John

Don't forget to close a thread when possible by accepting a post as a solution.

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-13-2018 04:33 PM

1,287 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@jmcclusk "I wrote my own implementation of a count_ones module in VHDL using recursion, and I get rather different results"

I haven't tried JosyB's above recursive function in Vivado yet; David had reported up-thread that Vivado synthesis of that recursion took hours to finish.

Re. your lower CLB counts, it would be interesting to see the synthesized structure - is it building ternary adders or small lookup tables for the early stages, as opposed to a binary adder tree?

I set out ~20 years ago intending to write an optimized (4-LUT) structural bit count, but the shift-and-mask adder inference worked well enough that I never got around to it...

-Brian

jmcclusk

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-14-2018 05:06 PM

1,255 Views

Registered:
02-24-2014

Re: What is the most space-efficient count_ones function?

Well, my previous results are all just garbage, because of a couple of problems.. First of all, it turns out you can make a 5:2 compressor in 2 luts by using a LUT6_2 primitive, but behavioral code just won't get you there. Secondly Vivado is rather resistant to implementing small adders of 3 or 4 bits, and seems to prefer a LUT based solution, rather than a Carry4. Furthermore, constructing an optimal tree is remarkably difficult, because reasonable choices at the branch points of the tree can result in bad situations at the leaves. I'm considering trying to rewrite the recursion to go from the leaves down to the root, which is remarkably difficult... I'm still mulling over the approach..

And lastly... it turns out there's a little known secret about ternary adders..... you get TWO carry inputs instead of just one, which is very handy for sucking up those left-over leaf bits. But it also complicates the matter of building an optimal tree... i'm beginning to think that searching the space of tree topologies is NP-complete.

and finally... Vivado is not very consistent about LUT reporting... It reports a LUT6_2 as 2 luts in the Statistics of Cell properties.. The project summary page seems to avoid this issue, but I think I'm going to have to write a TCL script to actually search the netlist to avoid this double counting problem with LUT6_2 elements. The Project summary tab seems to report correctly... but is it really?

Now I'm getting 25 LUTS for a 32 bit compressor, which is an outstanding number, from all results I've seen on the web. We'll see. I'll eventually post this code when I've tested it further and cleaned it up a bit.

256 bit compressor takes 232 LUTs... it synthesizes and routes in a few minutes, so I can't complain about the speed.

Don't forget to close a thread when possible by accepting a post as a solution.

brimdavis

Scholar

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

12-14-2018 06:49 PM

1,243 Views

Registered:
04-26-2012

Re: What is the most space-efficient count_ones function?

@jmcclusk "256 bit compressor takes 232 LUTs... it synthesizes and routes in a few minutes, so I can't complain about the speed."

It's interesting that your recursive function isn't giving Vivado indigestion, what Vivado version are you using?

I hacked up my looped shift and mask code to try a ternary add; but looking at the synthesis results, it's not actually inferring any ternary adders- I think the shift/mask is breaking the inference thereof.

But the recoding did shave a few LUTS over the original binary tree code for the smaller widths, 32 bits is now 36 LUTs.

Although to be a fair comparison to the results in older devices, we'd probably really want to break out the detailed counts of LUT4, LUT6, carries, etc. for each version.

-Brian