**UPGRADE YOUR BROWSER**

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- Community Forums
- :
- Forums
- :
- Software Development and Acceleration
- :
- HLS
- :
- Unrolling loop with a runtime variable latency

- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page

esi2st

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-23-2019 05:40 AM - edited 01-23-2019 05:46 AM

473 Views

Registered:
11-14-2018

Unrolling loop with a runtime variable latency

I am trying to do a design as follows:

function_loop

loop_1:UNROLLED [i=0:const_1]

loop2:UNROLLED [j=0:const_2]

while(x[i][j] !=0) PIPELINED

do_stuff

end while

end loop2

end loop1

end function_loop

The idea is that the while loop should take 0 to 8 iterations to finish for each x value. So I would like to create parallel modules that each take a variable number of clock cycles to finish (<=8) and as soon as they all finish (in <=8 cycles), a new iteration of funtion_loop starts.

Has anyone worked with a similar idea concept? What are the implications of implementing it as written above?

UPDATE: I have another design idea that perhaps eases the decision making of the hardware easier and more inambigious

function_loop:

fetch x matrix [const_1][const_2]

while (x matrix != 0)

loop_1:UNROLLED [i=0:const_1]

loop2:UNROLLED [j=0:const_2]

do_stuff

end loop

end loop

end while

end loop

I will compare the second idea and post my results.

Regards,

Sherif

4 Replies

evant_nq

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-23-2019 07:06 AM

456 Views

Registered:
07-18-2018

Re: Unrolling loop with a runtime variable latency

What do you mean by trying to have a variable runtime?

If you want to unroll a loop by 8, it makes 8 copies in the Hardware. So loop A(i = 0; i <8; i++) becomes 8 versions of A:

A_0

A_1

A_2

A_3

A_4

A_5

A_6

A_7

If it takes 1 cycle to complete A, then the entire loop takes one cycle. if it takes 5 cycles, the entire loop takes 5 cycles. If you don't need all 8 because you maybe are feeding it data that doesn't divide by 8 nicley, you just should ingore the results of the unused portion.

If you have nested loops, you should use loop_flatten to turn them into a single loop, and then unroll it. In that case, if you are doing two nested loops of 8, assuming there is room, unroll it by 64 and just do the entire loop in a cycle shot (Assuming there is no data dependency between results)

But maybe if we had a little more of the example of what you are trying to achieve it would be helpful to undersand how to unroll it to get that to happen.

esi2st

Visitor

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-23-2019 07:59 AM

444 Views

Registered:
11-14-2018

Re: Unrolling loop with a runtime variable latency

oliviert

Xilinx Employee

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-23-2019 10:34 AM

428 Views

Registered:
01-09-2008

Re: Unrolling loop with a runtime variable latency

The second idea is better in order to make VHLS understanding your point.

Be careful: if the product of the ranges of the 2 loops is large you may end up with a very large design... if it succeeds!

The code should be:

loop_0: PIPELINED[i=0:7]

loop_1:[i=0:const_1] // Will be automatically UNROLLED

loop2:[j=0:const_2]// Will be automatically UNROLLED

if(x[i][j] != 0) do_stuff

end loop

end loop

end loop

==================================

Olivier Trémois

XILINX EMEA DSP Specialist

Olivier Trémois

XILINX EMEA DSP Specialist

evant_nq

Explorer

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Email to a Friend
- Report Inappropriate Content

01-23-2019 12:27 PM

411 Views

Registered:
07-18-2018

Re: Unrolling loop with a runtime variable latency

The tool can't conditionally unroll a loop. The latency of an unrolled loop is how long it takes to execute it's contents. So if you have two loops:

for(i = 0; i< 8; i++) for(j = 0; j< 8; j++) <Some Logical operation>

And you unroll them both. You will end up with 64 parallel versions of <Some Logical operation>. If that operation has a latency of 1, the entire latency will be 1. If that operation has a latency of 8, the entire operation will have a latency of 8.

If <Some Logical operation> takes a variable amount of time (Likely becuase it also has a while loop) you will need to put a TRIP count of how long it can take to get some expectation of how long it will take. But generally doing anything with while statments is not desired.

But it helps to know what the goal is of the problem trying to be solved is, and an example with what you are getting from the tool vs what you expect the tool to be doing.

For example a really simple version might look like:

#define SIZE 8 void unroll (int A[SIZE][SIZE],int B[SIZE][SIZE], int C[SIZE][SIZE]) { int i,j; int matrix_A[SIZE][SIZE]; int matrix_B[SIZE][SIZE]; int tmp; /*BURST read Data to Local VAR to operate on*/ for(i = 0; i < SIZE; i++) { //ROWS for(i = 0; i < SIZE; i++) { //COLS matrix_A[i][j] = A[i][j]; } } for(i = 0; i < SIZE; i++) { //ROWS for(i = 0; i < SIZE; i++) { //COLS matrix_B[i][j] = B[i][j]; } } //DO SOMETHING PIPELINE:for(i = 0; i< 8; i++) { for(j = 0; j<8; j++) { VARIABLE_LENGTH:while(matrix_A[i][j] > 0 ) { tmp = matrix_A[i][j] + matrix_B[i][j]; //Return matrix_A[i][j] = matrix_A[i][j] - matrix_B[i][j]; } C[i][j] = tmp; } } }

And the reports after we tell it how long that loop might run for we can see that the Pipeline Directive at the top unrolled the loops as much as possible. But we still have the possiblity that each parallel operation will take upto 8 iterations at 3 cycles each:

But this is just one possible example, what the actual code is trying to do will change what we would want to do