cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
reaiken
Explorer
Explorer
2,640 Views
Registered: ‎07-18-2011

How to properly dataflow functions in HLS?

Jump to solution

I am trying to make an IP block that will take a 480x18 streaming input and send it to a axi master output in bursts of 18, using Vivado HLS 2018.3.

I cannot get the functions to properly run in parallel.   I would expect the function to read the first 18 streaming words and immediately start sending them out the axi master, but instead it appears the function has a latency of three entire 18x480 frames before anything comes out.

I have tried to duplicate the canonical form 1 shown on page 148 of UG902 v2019.1, but I am not sure I have coded it correctly for parallel operation.

My design is split into three functions, a read function to read the axi-stream input, a transfer function to transfer the 18 word data blocks from the input function to the output function (based on guidance from UG902), and an output function to write the data out on the axi master.  

All these  functions seem to work in the cosimulation, and the correct axi master data is coming out, but there is just way more latency than I would expect.   I tried adding a #pragma HLS PIPELINE statement in each of the inner for loops, but it didn't change anything.

 

This is my code.  WIDTH and HEIGHT are defined as 18 and 480 in the header file.

 

#include "rdma.h"

void rdma (AXI_STREAM &s_axis, volatile int *m_axi)
{
#pragma HLS INTERFACE m_axi depth=491520 port=m_axi offset=off max_write_burst_length=32 name=m_axi
#pragma HLS INTERFACE axis register both port=s_axis
#pragma HLS STREAM variable=s_axis depth=8640 dim=1

#pragma HLS DATAFLOW
	int dataIn[WIDTH];
	int dataOut[WIDTH];

	getData(s_axis, dataIn);
	transferData(dataIn, dataOut);
	sendData (m_axi, dataOut);
}

void getData (AXI_STREAM &s_axis, int *dataIn)
{
	for(int row=0;row<HEIGHT;row++)
	{
		for(int col=0;col<WIDTH;col++)
		{
			ap_axiu<24,1,1,1> valIn = s_axis.read();
			dataIn[col] = (int)valIn.data;
		}
	}
}

void transferData(int *dataIn, int *dataOut)
{
	for(int row=0;row<HEIGHT;row++)
	{
		for(int col=0;col<WIDTH;col++)
		{
			dataOut[col] = dataIn[col];
		}
	}
}

void sendData (volatile int *m_axi, int *dataOut)
{
	int dataBurst[WIDTH];

	for(int row=0;row<HEIGHT;row++)
	{
		for(int col=0;col<WIDTH;col++)
		{
			dataBurst[col] = dataOut[col];
		}
		memcpy((int *)(m_axi+row*STRIDE), dataBurst, WIDTH*sizeof(int));
	}
}

 

 

This is the cosimulation output waveform:

 

cosim.jpg

 

And this is the schedule:

 

schedule.jpg

 

Synthesis warnings:

 

warnings.jpg

 

Any help would be greatly appreciated!

0 Kudos
1 Solution

Accepted Solutions
u4223374
Advisor
Advisor
2,573 Views
Registered: ‎04-26-2015

Does it pass C simulation correctly? That's the first step. I'm going to go with "no", because your buffers need to be 480*18 rather than just 480 elements long.

I think you need to do three things:

 

(1) Resize your buffers to 480*18 and change the loops to move 480*18 pixels into or out of the buffers.

(2) Use the STREAM pragma to make both buffers streams rather than ping-pong buffers.

(3) Run C simulation and make sure it's right. A design that passes C sim won't necessarily pass cosim or work in hardware (eg. if you've specified an incorrect pragma) - but a design that doesn't pass C sim is virtually guaranteed not to work in cosim or in hardware.

 

I think that'll probably make it work.

 

View solution in original post

11 Replies
u4223374
Advisor
Advisor
2,574 Views
Registered: ‎04-26-2015

Does it pass C simulation correctly? That's the first step. I'm going to go with "no", because your buffers need to be 480*18 rather than just 480 elements long.

I think you need to do three things:

 

(1) Resize your buffers to 480*18 and change the loops to move 480*18 pixels into or out of the buffers.

(2) Use the STREAM pragma to make both buffers streams rather than ping-pong buffers.

(3) Run C simulation and make sure it's right. A design that passes C sim won't necessarily pass cosim or work in hardware (eg. if you've specified an incorrect pragma) - but a design that doesn't pass C sim is virtually guaranteed not to work in cosim or in hardware.

 

I think that'll probably make it work.

 

View solution in original post

reaiken
Explorer
Explorer
2,545 Views
Registered: ‎07-18-2011

@u4223374 

First of all, thank you for responding to my message!  It's people like you and also the Xilinx employees on these forums that take time to answer questions for those of us out here in the trenches that make this forum a great resource.

As for the buffers, I have them set for WIDTH deep, which is actually 18, not 480.  The reason I was trying to do that was because I wanted it to synthesize a ping-pong buffer of only 18 words deep, so it would read in 18 words of the inner for loop, then transfer them to an 18 word buffer to be processed by the output side inner for loop, and repeat until all 480 lines are transferred.   I was trying to minimize the block ram usage, so I didn't want the buffers to be 480x18.  

Perhaps this is not the way things work in HLS?  

Anyway, I'll try your suggestions and see what happens.   Thanks!

 

0 Kudos
reaiken
Explorer
Explorer
2,521 Views
Registered: ‎07-18-2011

@u4223374 

Well, I do believe it is working!  I'm getting results that look much more like what I had expected.   The data is now coming out after three blocks of 18 are written in.

Also, the implementation only takes 2 BRAMs!    I had thought it would have to buffer up the entire frame if the arrays were made 18x480, but apparently it doesn't when the STREAM pragma is added (I checked, and it takes 50 BRAMs without the STREAM pragma). 

I think I am lacking a fundamental understanding of the way this HLS C synthesis operates when transferring data between functions.   My thinking of synthesising a ping-pong buffer of size 18, was obviously flawed.   I'll have to read up more on the STREAM pragma to see why it reduced the BRAM usage from 50 down to 2.

Below is the new cosim output waveform.   As you can see, the axi master data is sending bursts of 18 (the axi-stream slave sends repeated 0-17 counts).

I do see a delay of 52 clocks between address changes (I have a memory stride of 1024, so the address changes after every burst of 18), and it is exerting back-pressure on the incoming axi-stream.

Anyway, thank you for your help, I'll mark this one closed and go study the user's guides some more.

 

 

cosim2.jpg

 

0 Kudos
u4223374
Advisor
Advisor
2,488 Views
Registered: ‎04-26-2015

@reaiken I think I can explain.

 

The first thing to understand is that, as above, HLS is aiming to match the behaviour of a single-threaded C program. You can do this with 18-element buffers, but not how you were doing it. Instead the process has to be something like this:

#include "rdma.h"

void rdma (AXI_STREAM &s_axis, volatile int *m_axi)
{
#pragma HLS INTERFACE m_axi depth=491520 port=m_axi offset=off max_write_burst_length=32 name=m_axi
#pragma HLS INTERFACE axis register both port=s_axis
#pragma HLS STREAM variable=s_axis depth=8640 dim=1



	for (int row = 0; row < HEIGHT; row++) {
		#pragma HLS DATAFLOW
		int dataIn[WIDTH];
		int dataOut[WIDTH];
		getData(s_axis, dataIn);
		transferData(dataIn, dataOut);
		sendData (m_axi, dataOut);
	}
}

void getData (AXI_STREAM &s_axis, int *dataIn)
{
	for(int col=0;col<WIDTH;col++)
	{
		ap_axiu<24,1,1,1> valIn = s_axis.read();
		dataIn[col] = (int)valIn.data;
	}
}

void transferData(int *dataIn, int *dataOut)
{
	for(int col=0;col<WIDTH;col++)
	{
		dataOut[col] = dataIn[col];
	}
}

void sendData (volatile int *m_axi, int *dataOut)
{
	int dataBurst[WIDTH];

	for(int col=0;col<WIDTH;col++)
	{
		dataBurst[col] = dataOut[col];
	}
	memcpy((int *)(m_axi+row*STRIDE), dataBurst, WIDTH*sizeof(int));
}

I think that'll simulate properly (and also synthesize properly). However, for synthesis it's not ideal, because every time HLS has to start one of those functions there's an overhead of a couple of clock cycles - and when the whole thing is only 18 cycles long, a couple of cycles matters.

 

The better approach is to use "big" buffers, but then tell HLS that you actually want it to just stream data between the functions (that's what the STREAM pragma is for). It can do this with a FIFO, and even a half-BRAM (HLS calls it one BRAM, but really it's half of one of the FPGA's BRAM blocks) gives you plenty of FIFO length for this.

 

 

With that said - I'm not very fond of dataflow, and I reckon you could cut the time down significantly further by removing dataflow. I can try to do a demo if you want.

 

reaiken
Explorer
Explorer
2,450 Views
Registered: ‎07-18-2011

 

@u4223374 

That code also worked, but I had to add a static row variable to the sendData function that increments after the memcpy, or add the outer row loop in the function to get the STRIDE correct.

Also, the time between bursts increased to 72 clocks, but adding an HLS STREAM  pragma to the dataIn and dataOut buffers brought it back down to 52 clocks between bursts.    Removing the DATAFLOW pragma increased the delay between bursts to 128 clocks, so I left it in.

It looks a bit more well-behaved at the start now, and the initial delay to the first output burst is now much shorter.

 

 

cosim3.jpg

 

0 Kudos
u4223374
Advisor
Advisor
2,408 Views
Registered: ‎04-26-2015

@reaiken Would you mind feeding this code to your simulation and seeing what happens?

 

#include "rdma.h"

void rdma (AXI_STREAM &s_axis, volatile int *m_axi)
{
#pragma HLS INTERFACE m_axi depth=491520 port=m_axi offset=off max_write_burst_length=32 name=m_axi
#pragma HLS INTERFACE axis register both port=s_axis
#pragma HLS STREAM variable=s_axis depth=8640 dim=1

	
	int dataA[WIDTH];
	int dataB[WIDTH];

	for (int row = 0; row <= HEIGHT; row++) {
		bool readEn = (row < HEIGHT);
		bool writeEn = (row > 0);
		
		if (row & 0x01) {
			getData(s_axis, dataA,readEn);
			sendData (m_axi, dataB,row,writeEn);
		} else {
			getData(s_axis, dataB,readEn);
			sendData (m_axi, dataA,row,writeEn);
		}
	}
}

void getData (AXI_STREAM &s_axis, int *dataIn, bool enable)
{
	if (!enable) {
		return;
	}
	for(int col=0;col<WIDTH;col++)
	{
		ap_axiu<24,1,1,1> valIn = s_axis.read();
		dataIn[col] = (int)valIn.data;
	}
}


void sendData (int *m_axi, int *dataOut, int row, bool enable)
{
	if (!enable) {
		return;
	}
	memcpy((int *)(m_axi+row*STRIDE), dataOut, WIDTH*sizeof(int));
}

This is how I'd normally implement it (no dataflow).

reaiken
Explorer
Explorer
2,375 Views
Registered: ‎07-18-2011

 

@u4223374 

That works, except I had to change two things to get the output data correct on the first and last burst,  because the first buffer has to be filled before the data can be read out, and one last burst has to be sent after the loop is complete, as there is a lag of one burst cyle in the processing.

I changed the memcpy statement in the sendData function to this:

     memcpy((int *)(m_axi+(row-1)*STRIDE), dataOut, WIDTH*sizeof(int));

and added this statement after the for loop in the top function to send the last data burst:

     sendData(m_axi, dataA, HEIGHT, 1); 

Now we are down to 35 clock cycles between bursts instead of 52, a definite improvement over the DATAFLOW method.  

 

cosim4.jpg

 

 

0 Kudos
u4223374
Advisor
Advisor
2,345 Views
Registered: ‎04-26-2015

@reaiken I thought I had already dealt with the 1-cycle lag; you'll notice that my loop does one more iteration than your original one (I use <= instead of <) with only the read function enabled for the first iteration and only the write function enabled for the last iteration.

 

Good catch on the row used for writing, I definitely should have seen that.

reaiken
Explorer
Explorer
2,320 Views
Registered: ‎07-18-2011

 

@u4223374 

Yes, you are correct, I missed the "<=" in your code.  That is more elegant than my solution.

Again, thank you for all your help, I have learned a lot through this exchange.   Being new to HLS, and coming from a Verilog background, I can see it requires a different mindset. 

I really like the way HLS handles AXI streams, masters, and AXI-lite controls.  That alone will save an enormous amount of time when developing new IP HW blocks.   Also, the test bench simulation in C is much faster than coding up Verilog and writing a test bench to simulate it, and the ability to simulate using actual video images is great.

I am, however, a bit disappointed that HLS DATAFLOW didn't synthesis a ping-pong buffer with less delay between output bursts.   It seems to me you could read buffer B while writing buffer A, and vice-versa,  without the serial function dependency, and end up with continous reads and writes after filling the initial 18 word buffer, with no extra delay between bursts, and no back-pressure to the AXI-stream.    I thought that was what the DATAFLOW pragma was supposed to be all about, creating a parallel processing of the serial C functions.  Or am I just missing something here?  

 

 

0 Kudos
u4223374
Advisor
Advisor
2,280 Views
Registered: ‎04-26-2015

@reaiken 

It's definitely a different mindset - although it's crucial to remember that while HLS is different from HDL, it's also very different from normal C/C++. If you start treating it like a regular C compiler then your projects invariably either fail or become very inefficient (in resources and performance) very quickly.

 

It's a massive time-saver for complex algorithms, especially when you start getting into things like floating-point math. The ability to very quickly adjust the throughput/resources mix is extremely handy; I've had a couple of designs where I can make either a fast block or a resource-efficient block just by moving a pipeline pragma up or down one loop level.

 

However, as you've found, you do sometimes get into a situation where you can see exactly what should happen, but HLS can't figure out what you mean. That can get very frustrating.

 

In your case, I think that HLS is probably spending half its time setting up and closing down AXI Master transfers and dataflow pipelines. I've got a feeling that older versions (eg. 2015.4) actually did this in fewer cycles, although they were less flexible and less reliable. To be honest, all the HLS projects I've done have ended up discarding dataflow and implementing the layout I posted above.

 

 

0 Kudos
khalid17160
Visitor
Visitor
1,507 Views
Registered: ‎12-02-2019

@u4223374 Can u share example of good design which can teach how to use hls in an efficient manner. like it always happens while writing I start writing like a normal C/C++ code. 
It would be a great help.

0 Kudos