cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Explorer
Explorer
615 Views
Registered: ‎11-29-2015

Simplest way to pipeline this for loop?

I'm looking for the general, widely used method for pipelining for loops in a way that doens't add a lot of confusion or tons of extra code. The example below is the original for loop.

library IEEE;
	use IEEE.STD_LOGIC_1164.ALL;
	use IEEE.NUMERIC_STD.ALL;
	use IEEE.STD_LOGIC_UNSIGNED.ALL;
        use work.custom_functions.all;
	use work.custom_package.all;

entity lowset_diff_finder is
	Port (
        clk_from_camera             : in  std_logic;
        iDiff_array                 : in type_diff_array;
        oDisparity_conclusion       : out natural range 0 to cSearch_width_minus_1+1
    );
end lowset_diff_finder;

architecture Behavioral of lowset_diff_finder is
   
begin
    process (clk_from_camera)
        variable vLowest_diff_so_far        : unsigned(log2(cCensus_vector_length)-1 downto 0) := (others => '1');
    begin
        if rising_edge(clk_from_camera) then 
            
            vLowest_diff_so_far := (others => '1');
            oDisparity_conclusion <= 0;

            for k in 0 to cSearch_width_minus_1 loop
                if iDiff_array(k) < vLowest_diff_so_far then
                    vLowest_diff_so_far := iDiff_array(k);	
                    oDisparity_conclusion <= k;
                end if;
            end loop; -- for k in 0 to cSearch_width-cKernel_width loop

        end if;       -- if rising_edge(clk_from_camera) then
    end process;
end Behavioral;

I'm guessing I could do something like the code below, but not sure if this is the simplest way to do it, and it gets kind of confusing to follow. The idea is that the oDisparity_conclusion would be correct, but it would just be delayed by the total number of pipeline stages. Is this the cleanest way to do it?

 

library IEEE;
	use IEEE.STD_LOGIC_1164.ALL;
	use IEEE.NUMERIC_STD.ALL;
	use IEEE.STD_LOGIC_UNSIGNED.ALL;
        use work.custom_functions.all;
	use work.custom_package.all;

entity lowset_diff_finder is
	Port (
        clk_from_camera             : in  std_logic;
        iDiff_array                 : in type_diff_array;
        oReal_disparity_conclusion  : out natural range 0 to cSearch_width_minus_1+1
    );
end lowset_diff_finder;

architecture Behavioral of lowset_diff_finder is
    signal sPipeline_stage : natural range 0 to cSearch_width_minus_1 := 0;
    type type_diff_array_of_arrays is array (0 to cSearch_width_minus_1) of type_diff_array;
    signal sDiff_array_of_arrays                 : type_diff_array_of_arrays;
    type type_lowest_diff_so_far_array is array (0 to cSearch_width_minus_1) of unsigned(log2(cCensus_vector_length)-1 downto 0);
    signal sLowest_diff_so_far_array        : type_lowest_diff_so_far_array := (others => (others => '1'));
    type type_real_disparity_conclusion_array is array (0 to cSearch_width_minus_1) of natural range 0 to cSearch_width_minus_1;
    signal sReal_disparity_conclusion_array        : type_real_disparity_conclusion_array := (others => 0);
begin
    process (clk_from_camera)
    begin
        if rising_edge(clk_from_camera) then 

            sLowest_diff_so_far_array(sPipeline_stage) <= (others => '1');
            if sPipeline_stage = cSearch_width_minus_1 then
                sPipeline_stage <= 0;
            else
                sPipeline_stage <= sPipeline_stage + 1;
            end if;
            
            sDiff_array_of_arrays(sPipeline_stage) <= iDiff_array;

            if  sDiff_array_of_arrays(sPipeline_stage)(sPipeline_stage) < sLowest_diff_so_far_array(sPipeline_stage) then
                sLowest_diff_so_far_array(sPipeline_stage) <= sDiff_array_of_arrays(sPipeline_stage)(sPipeline_stage);	
                sReal_disparity_conclusion_array(sPipeline_stage) <= sPipeline_stage;
            end if;

            oReal_disparity_conclusion <= sReal_disparity_conclusion_array(sPipeline_stage);
        
        end if;       -- if rising_edge(clk_from_camera) then
    end process;
end Behavioral;
0 Kudos
2 Replies
Highlighted
Scholar
Scholar
562 Views
Registered: ‎08-01-2012

You have two bits of code here that are not functionally the same. The first  code looks like a direct port of C. This is likely to create an inefficent and slow circuit in an FPGA because it does the entire search in a single clock cycle.

The second one is a real attempt to create a circuit that would actually work in an FPGA. The extra clock cycles add latency, but probably increase the maximum clock speed to actually make it run rather fast.

There is no general way to convert any given C code style loop. It will really depend on the algorithm. And this is where engineering comes in. Can you create a solution that works efficently in hardware?

 

In your code, You are doing a 1 by 1 search. This will work but will be slow, and you can only input a new data-set every N clock cycles (where N in the size of the data set). If you used a merge sort, while the resource usage is larger, you can input a new data set on every clock. It still has an N clock latency, but given that you can input a new whole data set every clock, the throughput is Nx larger.

0 Kudos
Highlighted
Explorer
Explorer
540 Views
Registered: ‎11-29-2015

@richardhead

The general problem here is that I'm trying to find the minimum sum of one's of an array of census vectors (iDiff_array). (A census vector can be thought of as just an array of bits). I don't actually care what the specific minimum value is though. I just care about the index which contains the census vector with the fewest number of 1's, (since that represents the disparity between the reference image and the search image).

Is there some fast and space-efficient way to determine the index of the particular census vector (in an array of census vectors) which contains the least amount of 1's, without actually using adders to count them? I'm aware of the bit masking method but that returns the acutal number of ones in the census vector, which I don't necessarily need, and therefore may be wasting resources to find.

There must be some algorithmic advantage to not actually needing the minimum value itself, but instead needing the index of the census vecor with the least number of ones. 

The problem can be summarized to this:
Find the index of the vector containing the least number of ones, using only bitwise operations:
0: 0101
1: 1111
2: 0100
3: 1101
(The answer would be index 2)

0 Kudos