02-05-2019 05:21 PM
I'm looking for the general, widely used method for pipelining for loops in a way that doens't add a lot of confusion or tons of extra code. The example below is the original for loop.
library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.NUMERIC_STD.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; use work.custom_functions.all; use work.custom_package.all; entity lowset_diff_finder is Port ( clk_from_camera : in std_logic; iDiff_array : in type_diff_array; oDisparity_conclusion : out natural range 0 to cSearch_width_minus_1+1 ); end lowset_diff_finder; architecture Behavioral of lowset_diff_finder is begin process (clk_from_camera) variable vLowest_diff_so_far : unsigned(log2(cCensus_vector_length)-1 downto 0) := (others => '1'); begin if rising_edge(clk_from_camera) then vLowest_diff_so_far := (others => '1'); oDisparity_conclusion <= 0; for k in 0 to cSearch_width_minus_1 loop if iDiff_array(k) < vLowest_diff_so_far then vLowest_diff_so_far := iDiff_array(k); oDisparity_conclusion <= k; end if; end loop; -- for k in 0 to cSearch_width-cKernel_width loop end if; -- if rising_edge(clk_from_camera) then end process; end Behavioral;
I'm guessing I could do something like the code below, but not sure if this is the simplest way to do it, and it gets kind of confusing to follow. The idea is that the oDisparity_conclusion would be correct, but it would just be delayed by the total number of pipeline stages. Is this the cleanest way to do it?
library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.NUMERIC_STD.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; use work.custom_functions.all; use work.custom_package.all; entity lowset_diff_finder is Port ( clk_from_camera : in std_logic; iDiff_array : in type_diff_array; oReal_disparity_conclusion : out natural range 0 to cSearch_width_minus_1+1 ); end lowset_diff_finder; architecture Behavioral of lowset_diff_finder is signal sPipeline_stage : natural range 0 to cSearch_width_minus_1 := 0; type type_diff_array_of_arrays is array (0 to cSearch_width_minus_1) of type_diff_array; signal sDiff_array_of_arrays : type_diff_array_of_arrays; type type_lowest_diff_so_far_array is array (0 to cSearch_width_minus_1) of unsigned(log2(cCensus_vector_length)-1 downto 0); signal sLowest_diff_so_far_array : type_lowest_diff_so_far_array := (others => (others => '1')); type type_real_disparity_conclusion_array is array (0 to cSearch_width_minus_1) of natural range 0 to cSearch_width_minus_1; signal sReal_disparity_conclusion_array : type_real_disparity_conclusion_array := (others => 0); begin process (clk_from_camera) begin if rising_edge(clk_from_camera) then sLowest_diff_so_far_array(sPipeline_stage) <= (others => '1'); if sPipeline_stage = cSearch_width_minus_1 then sPipeline_stage <= 0; else sPipeline_stage <= sPipeline_stage + 1; end if; sDiff_array_of_arrays(sPipeline_stage) <= iDiff_array; if sDiff_array_of_arrays(sPipeline_stage)(sPipeline_stage) < sLowest_diff_so_far_array(sPipeline_stage) then sLowest_diff_so_far_array(sPipeline_stage) <= sDiff_array_of_arrays(sPipeline_stage)(sPipeline_stage); sReal_disparity_conclusion_array(sPipeline_stage) <= sPipeline_stage; end if; oReal_disparity_conclusion <= sReal_disparity_conclusion_array(sPipeline_stage); end if; -- if rising_edge(clk_from_camera) then end process; end Behavioral;
02-06-2019 02:00 AM
You have two bits of code here that are not functionally the same. The first code looks like a direct port of C. This is likely to create an inefficent and slow circuit in an FPGA because it does the entire search in a single clock cycle.
The second one is a real attempt to create a circuit that would actually work in an FPGA. The extra clock cycles add latency, but probably increase the maximum clock speed to actually make it run rather fast.
There is no general way to convert any given C code style loop. It will really depend on the algorithm. And this is where engineering comes in. Can you create a solution that works efficently in hardware?
In your code, You are doing a 1 by 1 search. This will work but will be slow, and you can only input a new data-set every N clock cycles (where N in the size of the data set). If you used a merge sort, while the resource usage is larger, you can input a new data set on every clock. It still has an N clock latency, but given that you can input a new whole data set every clock, the throughput is Nx larger.
02-06-2019 03:57 PM - edited 02-06-2019 04:14 PM
The general problem here is that I'm trying to find the minimum sum of one's of an array of census vectors (iDiff_array). (A census vector can be thought of as just an array of bits). I don't actually care what the specific minimum value is though. I just care about the index which contains the census vector with the fewest number of 1's, (since that represents the disparity between the reference image and the search image).
Is there some fast and space-efficient way to determine the index of the particular census vector (in an array of census vectors) which contains the least amount of 1's, without actually using adders to count them? I'm aware of the bit masking method but that returns the acutal number of ones in the census vector, which I don't necessarily need, and therefore may be wasting resources to find.
There must be some algorithmic advantage to not actually needing the minimum value itself, but instead needing the index of the census vecor with the least number of ones.
The problem can be summarized to this:
Find the index of the vector containing the least number of ones, using only bitwise operations:
(The answer would be index 2)