UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Visitor esi2st
Visitor
473 Views
Registered: ‎11-14-2018

Unrolling loop with a runtime variable latency

I am trying to do a design as follows:

function_loop

loop_1:UNROLLED [i=0:const_1]

      loop2:UNROLLED [j=0:const_2]

                while(x[i][j] !=0) PIPELINED

                         do_stuff

               end while

       end loop2

end loop1

end function_loop

The idea is that the while loop should take 0 to 8 iterations to finish for each x value. So I would like to create parallel modules that each take a variable number of clock cycles to finish (<=8) and as soon as they all finish (in <=8 cycles), a new iteration of funtion_loop starts. 

Has anyone worked with a similar idea concept? What are the implications of implementing it as written above?

UPDATE: I have another design idea that perhaps eases the decision making of the hardware easier and more inambigious

function_loop:

       fetch x matrix [const_1][const_2]

       while (x matrix != 0)

             loop_1:UNROLLED [i=0:const_1]

                   loop2:UNROLLED [j=0:const_2]

                         do_stuff

                  end loop

             end loop

         end while

end loop

I will compare the second idea and post my results.

Regards,

Sherif

0 Kudos
4 Replies
Explorer
Explorer
456 Views
Registered: ‎07-18-2018

Re: Unrolling loop with a runtime variable latency

What do you mean by trying to have a variable runtime?

If you want to unroll a loop by 8, it makes 8 copies in the Hardware. So loop A(i = 0; i <8; i++) becomes 8 versions of A:

A_0
A_1
A_2
A_3
A_4
A_5
A_6
A_7

If it takes 1 cycle to complete A, then the entire loop takes one cycle. if it takes 5 cycles, the entire loop takes 5 cycles. If you don't need all 8 because you maybe are feeding it data that doesn't divide by 8 nicley, you just should ingore the results of the unused portion.

If you have nested loops, you should use loop_flatten to turn them into a single loop, and then unroll it. In that case, if you are doing two nested loops of 8, assuming there is room, unroll it by 64 and just do the entire loop in a cycle shot (Assuming there is no data dependency between results)

But maybe if we had a little more of the example of what you are trying to achieve it would be helpful to undersand how to unroll it to get that to happen.

0 Kudos
Visitor esi2st
Visitor
444 Views
Registered: ‎11-14-2018

Re: Unrolling loop with a runtime variable latency

I dont want to unroll a loop by 8. I have an outer unrolled loop and inside it is a conditional loop that executes 0-8 times, unrolled. So the body of the unrolled loop has a latency of 0-8. My design question is what's the best way to do this or what are the drawbacks. I made to example codes of how I want to do it.
0 Kudos
Xilinx Employee
Xilinx Employee
428 Views
Registered: ‎01-09-2008

Re: Unrolling loop with a runtime variable latency

The second idea is better in order to make VHLS understanding your point.

Be careful: if the product of the ranges of the 2 loops is large you may end up with a very large design... if it succeeds!

The code should be:

loop_0: PIPELINED[i=0:7]

             loop_1:[i=0:const_1] // Will be automatically UNROLLED

                   loop2:[j=0:const_2]// Will be automatically UNROLLED

                         if(x[i][j] != 0) do_stuff

                  end loop

             end loop

         end loop

==================================
Olivier Trémois
XILINX EMEA DSP Specialist
0 Kudos
Explorer
Explorer
411 Views
Registered: ‎07-18-2018

Re: Unrolling loop with a runtime variable latency

The tool can't conditionally unroll a loop. The latency of an unrolled loop is how long it takes to execute it's contents. So if you have two loops:

 

 for(i = 0; i< 8; i++)
       for(j = 0; j< 8; j++)
           <Some Logical operation>

           
    And you unroll them both. You will end up with 64 parallel versions of <Some Logical operation>. If that operation has a latency of 1, the entire latency will be 1. If that operation has a latency of 8, the entire operation will have a latency of 8.
    
    If <Some Logical operation> takes a variable amount of time (Likely becuase it also has a while loop) you will need to put a TRIP count of how long it can take to get some expectation of how long it will take. But generally doing anything with while statments is not desired.

But it helps to know what the goal is of the problem trying to be solved is, and an example with what you are getting from the tool vs what you expect the tool to be doing.

 

For example a really simple version might look like:

#define SIZE 8

void unroll (int A[SIZE][SIZE],int B[SIZE][SIZE], int C[SIZE][SIZE]) {
	int i,j;
	int matrix_A[SIZE][SIZE];
	int matrix_B[SIZE][SIZE];
	int tmp;

	/*BURST read Data to Local VAR to operate on*/
	for(i = 0; i < SIZE; i++) { //ROWS
		for(i = 0; i < SIZE; i++) { //COLS
			matrix_A[i][j] = A[i][j];
		}
	}
	for(i = 0; i < SIZE; i++) { //ROWS
		for(i = 0; i < SIZE; i++) { //COLS
			matrix_B[i][j] = B[i][j];
		}
	}

	//DO SOMETHING
	PIPELINE:for(i = 0; i< 8; i++)
	{
		for(j = 0; j<8; j++)
		{
			VARIABLE_LENGTH:while(matrix_A[i][j] > 0 ) {
				tmp = matrix_A[i][j] + matrix_B[i][j]; //Return
				matrix_A[i][j] = matrix_A[i][j] - matrix_B[i][j];
			}
			C[i][j] = tmp;
		}
	}
}

And the reports after we tell it how long that loop might run for we can see that the Pipeline Directive at the top unrolled the loops as much as possible. But we still have the possiblity that each parallel operation will take upto 8 iterations at 3 cycles each:

LATENCY.PNG

But this is just one possible example, what the actual code is trying to do will change what we would want to do

 

0 Kudos