cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
6,896 Views
Registered: ‎11-18-2014

DATAFLOW

Hi

 

I wish to use the HLS DATAFLOW pragma to accelerate this function. I have thre stages :

  - read: fetch data from ram to local buf

  - execute: computation in local_buf

 - write: put back local_buf in ram

 

I execute those three stages multiple times and I want to recover the comunication (read and write)  with the computaions (execute)

 

To this extend, I write the following code:

 

void dummy_bad(DATATYPE* physmem) {
#pragma HLS INTERFACE m_axi port=physmem offset=off
#pragma HLS INTERFACE ap_ctrl_hs port=return

#pragma HLS DATAFLOW

	DATATYPE local_buf[1024];

	main_loop: for(int k=0; k<512; k++) {
		//
		//	 Read stage
		//
		read : {
			memcpy(local_buf,
			       physmem+1024*k,
			       1024*sizeof(DATATYPE));
		}

		//
		//	 Execute stage
		//
		execute : {
			for(int i=0; i<1024; i++) {
				#pragma HLS PIPELINE II=1
				local_buf[i] *= 4.2;
			}
		}

		//
		//	 Write stage
		//
		write : {
			memcpy(physmem+1024*k,
				local_buf,
			       1024*sizeof(DATATYPE));
		}

	}
	return;
}

If I understand correctly the documentation found in ug902, I expect vivado hls to duplicate the local_buf array in a ping pong style so that it can execute in parallel the three stages.

 

However vivado_hls found a latency for main_loop that shows it execute the three sequentialy.

 

I put a drawing of what vivado hls do first and what I expected in attachement. The blue and red color are here to express the ping and pong local buf used by each stage.

 

Thanks in advance for your answer.

 

Nicolas E.

dataflow.png
0 Kudos
3 Replies
Visitor
Visitor
6,894 Views
Registered: ‎11-18-2014

Re: DATAFLOW

I also try to duplicate memory port and local_buf manualy whithout any more sucess.

 

void dummy(DATATYPE* physmem_in, DATATYPE* physmem_out) {
#pragma HLS INTERFACE m_axi port=physmem_in offset=off
#pragma HLS INTERFACE m_axi port=physmem_out offset=off
#pragma HLS INTERFACE ap_ctrl_hs port=return

#pragma HLS DATAFLOW

	DATATYPE local_buf_in[1024], local_buf_out[1024];

	for(int k=0; k<512; k++) {
		//
		//	 Read stage
		//
		read : {
			memcpy(local_buf_in,
			       physmem_in+1024*k,
			       1024*sizeof(DATATYPE));
		}

		//
		//	 Execute stage
		//
		execute : {
			for(int i=0; i<1024; i++) {
				#pragma HLS PIPELINE II=1
				local_buf_out[i] = 4.2*local_buf_in[i];
			}
		}

		//
		//	 Write stage
		//
		write : {
			memcpy(physmem_out+1024*k,
				local_buf_out,
			       1024*sizeof(DATATYPE));
		}

	}
	return;
}
0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
6,875 Views
Registered: ‎08-17-2011

Re: DATAFLOW


@nestibal wrote:

I also try to duplicate memory port and local_buf manualy whithout any more sucess.

 


HI there

 

That's a good starting point as the data needs to flow in between blocks - ie you can't reuse the same buffer all over the place or using the same top level interface to read and wirte. the VHLS tool should give you warnings about that!

Note that outside the IP you can connect the 2 AXIM interface to the same AXI interconnect.

 

Actally the intention of your design is to have the dataflow inside the for k-loop, not outside of it.

if it's outside then there is only 1 process (the large for loop) and as such there is nothing to apply dataflow on.

 

if you try to put it inside, it should not work today : you can't have it inside for k-loop; this kind of loop will need to happen outside the IP or you need to change your code to that:

 

 

#include <string.h>

typedef unsigned int DATATYPE;

void top(DATATYPE* physmem_in, DATATYPE* physmem_out) {
#pragma HLS INTERFACE m_axi port=physmem_in
#pragma HLS INTERFACE m_axi port=physmem_out
#pragma HLS INTERFACE ap_ctrl_hs port=return

#pragma HLS DATAFLOW
    DATATYPE local_buf_in[512*1024], local_buf_out[512*1024];

#pragma HLS stream variable=local_buf_in depth=2048
#pragma HLS stream variable=local_buf_out depth=2048
#warning "*** really need to do a cosim to confirm the depth used is correct! ***"

    //	 Read stage
read :
    for(int k=0; k<512; k++) {
        memcpy(&local_buf_in[k*1024],
                physmem_in+1024*k,
                1024*sizeof(DATATYPE));
    }

    //	 Execute stage
execute :
    for(int k=0; k<512; k++) {
        for(int i=0; i<1024; i++) {
#pragma HLS PIPELINE II=1
            local_buf_out[1024*k+i] = 4.2*local_buf_in[1024*k+i];
        }
    }

    //	 Write stage
write :
    for(int k=0; k<512; k++) {
        memcpy(physmem_out+1024*k,
                &local_buf_out[1024*k],
                1024*sizeof(DATATYPE));
    }
    return;
}

 

the latency is reported as 528,897 which seems correct given that you need at least 1024*512 ==524,288 cycles

you should check the tutorial and the user guide to further understand what happens here.

 

- Hervé

SIGNATURE:
* New Dedicated Vivado HLS forums* http://forums.xilinx.com/t5/High-Level-Synthesis-HLS/bd-p/hls
* Readme/Guidance* http://forums.xilinx.com/t5/New-Users-Forum/README-first-Help-for-new-users/td-p/219369

* Please mark the Answer as "Accept as solution" if information provided is helpful.
* Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
Highlighted
Visitor
Visitor
6,199 Views
Registered: ‎10-15-2015

Re: DATAFLOW

@herver Hi, I have the same problem and use your algorithm that works but the latency is higher.

If I synthesize exactly your code the latency is nearly doubled.

Latency=1051159, Interval=528898

(Part=xc7z020clg484-1, Period=10)

Thanks

0 Kudos