cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Participant
Participant
715 Views
Registered: ‎05-28-2020

Simultaneous DDR memory reading problem

Jump to solution

Hello everyone,

We are implementing a design on PYNQ-Z1 to read two separate frames simultaneously from two different locations of DDR memory using two Video Frame Buffer Read (v2.1) IPs. Both reading must be synchronized. But we have got problem as we are not able to read at same time. When two IPs are started to read at same time, none of the IPs work. But they work when they are started individually one after another. So, finally, we figured out the issue that multiple frame read/write IPs cannot access DDR memory to read/write simultaneously.

So, we followed one of the Xilinx forums. Its link is attached below. As this forum suggests, we monitored and controlled tuser, tready, tlast and tvalid signals states by using AXI GPIO IP and also implemented AXI4-Stream Data FIFO IP for buffering, but still no avail.

Did we miss something? Or Might there be any other way to achieve this?

 

The forum link:

https://forums.xilinx.com/t5/Video-and-Audio/How-to-synchronize-multiple-VDMA-MM2S-cores/m-p/835732

 

Thank You,

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Explorer
Explorer
610 Views
Registered: ‎07-18-2011

@Nikhil_Thapa 

The problem is likely that your custom HLS IP has stopped the AXI stream reads because it is waiting on one or the other streams.

I would recommend dropping the frame buffer read and FIFO IP blocks altogether.   They are not needed in your situation, and are just wasting resources.

Since you are already doing your own custom HLS IP for processing, you can just read the DDR directly for the two frames you want to read synchronously, process them, and then stream them out.   You can use a 1-line buffer so you will have fast burst read accesses.  If you DATAFLOW the read and write functions, you won't have to wait for 2 lines to fill before you start processing.

Something like this:

    void functionName(AXI_STREAM &m_axis, volatile int *m_axi_1, volatile int *m_axi_2);

and define each of the two line buffers like this:

    int dataIn1[WIDTH];
    #pragma HLS STREAM variable=dataIn1 dim=1

Set up an outer row loop to process each line...

and read both memories using memcpy:

    memcpy((void *)dataIn1, (int *)(m_axi_1+(row*stride)), width*sizeof(int));  // read complete line 1 from memory 1

    memcpy((void *)dataIn2, (int *)(m_axi_2+(row*stride)), width*sizeof(int));  // read complete line 1 from memory 2

Then process your two data buffers inline in your column loop where you do your video processing and stream them out on m_axis.

Since you are using frame buffer write IP to put the images in memory, you will know the frames you are currently writing, so you can control the frame synchronization in your output side custom IP by writing the proper starting read addresses to m_axi_1 and m_axi_2 during their interrupt.

View solution in original post

14 Replies
Highlighted
Explorer
Explorer
683 Views
Registered: ‎07-18-2011

 

@Nikhil_Thapa 

What do you mean when you say multiple IPs are started simultaneously and don't work? 

None of the frame buffer read or write IPs truly start exactly at the same time, they are all initialized in sequence and then started in sequence by the controlling program.   The interrupts occur when they are finished processing a frame of video, and the SW decides how to configure the next frame read or write.  

DDR writes and reads also never occur simultaneously, they are processed by the MIG in an orderly fashion.  It is up to the HW to buffer enough memory using FIFOs to be able to process the data as needed.

Processing of simultaneous multiple frames is done all the time in IP such as the Video Mixer, for instance.   What function are you trying to accomplish? 

 

Highlighted
Participant
Participant
631 Views
Registered: ‎05-28-2020

 

@reaiken 

First of all, we appreciate your reply.

And we also agree with you. We had already run Frame buffer IPs in orderly fashion. They were working. They were also generating interrupt in orderly fashion after processing each frame. There was no problem up to that.
But as a matter of fact, we are doing pixel operation. We have stored two images frame at separate location of DDR memory. We are using Vivado HLS IP for pixel operation. So that, we need to have both frame readings synchronized to each other. We have also attached the block diagram below in Fig(1).

Later on, we added FIFOs to synchronize them. We have also attached the block diagram for this in Fig(2). But we could not achieve the synchronization. So, we are here to get helped from you.
Are we missing something? Or Our way of doing is incorrect?

Thank you,

 

Fig(1). Before adding FIFOs

image.png

 

Fig(2). After adding FIFOs

image.png

0 Kudos
Highlighted
Explorer
Explorer
611 Views
Registered: ‎07-18-2011

@Nikhil_Thapa 

The problem is likely that your custom HLS IP has stopped the AXI stream reads because it is waiting on one or the other streams.

I would recommend dropping the frame buffer read and FIFO IP blocks altogether.   They are not needed in your situation, and are just wasting resources.

Since you are already doing your own custom HLS IP for processing, you can just read the DDR directly for the two frames you want to read synchronously, process them, and then stream them out.   You can use a 1-line buffer so you will have fast burst read accesses.  If you DATAFLOW the read and write functions, you won't have to wait for 2 lines to fill before you start processing.

Something like this:

    void functionName(AXI_STREAM &m_axis, volatile int *m_axi_1, volatile int *m_axi_2);

and define each of the two line buffers like this:

    int dataIn1[WIDTH];
    #pragma HLS STREAM variable=dataIn1 dim=1

Set up an outer row loop to process each line...

and read both memories using memcpy:

    memcpy((void *)dataIn1, (int *)(m_axi_1+(row*stride)), width*sizeof(int));  // read complete line 1 from memory 1

    memcpy((void *)dataIn2, (int *)(m_axi_2+(row*stride)), width*sizeof(int));  // read complete line 1 from memory 2

Then process your two data buffers inline in your column loop where you do your video processing and stream them out on m_axis.

Since you are using frame buffer write IP to put the images in memory, you will know the frames you are currently writing, so you can control the frame synchronization in your output side custom IP by writing the proper starting read addresses to m_axi_1 and m_axi_2 during their interrupt.

View solution in original post

Highlighted
Participant
Participant
568 Views
Registered: ‎05-28-2020

Hi @reaiken 

We are very thankful for your informative reply. We tried to generate custom Vivado HLS IP including your points. But we got an error while synthesizing it.

image_2020_10_02T11_00_53_906Z.png

 

We did the thing that you mentioned in the reply. We also included the DATAFLOW pragma. But the error still persists. What might be the reason after this?

 

 

0 Kudos
Highlighted
Explorer
Explorer
556 Views
Registered: ‎07-18-2011

@Nikhil_Thapa 

DATAFLOW is rather tricky.   You have to obey certain rules or it won't work.  Variables that are being used have to be defined after the DATAFLOW pragma.    You also have to code it a certain way to be able to get the line buffers to work correctly.  This is all explained in the HLS user's guides and tutorials UG871, UG902, and UG1270.   I highly recommend you study these and work through all the tutorials.

This is how you would code up a simple 2 pixel per clock 32-bit Y/CbCr VDMA in HLS using DATAFLOW.  You can modify it to include two m_axi interfaces.   The parameters are set in the AXI interface, so it will create a device driver that will allow you to set the raster size and memory read location in your interrupt service routine.

Notice how the DATAFLOW pragma is located in the outer row loop, and "currentRow" is also defined in the DATAFLOW area.  If I were to use "row" instead of "currentRow" it would flag the error you are seeing, even though they are effectively the same value.  Variables used in the DATAFLOW region must be defined in that region, just like the dataIn[] and dataOut[] arrays are defined.

Note also the HLS_PIPELINE pragma after loop2 in the send routine.   This is critical in order to be able to achieve a low enough latency to process an entire line of video in real time, otherwise it will be too slow.   Try compiling it with and without this flag and check the results.

The HLS_LOOP_TRIPCOUNT pragmas are added so you will get proper latency results after synthesis, because width, stride, and height are variables and the synthesis won't calculate true worst-case numbers.  It will either return "?" or base it on the highest value of a defined size, such as 2048 for an 11-bit uint, when you may really only ever see 1080 max iterations through the loop

Beware that the code below may contain errors!   I synthesized it to check for syntax errors, but I didn't make a test bench to see if it actually does the intended function, but it should at least give you an idea of how to code DATAFLOW for memory input and streaming output.

Header file vdma.h:

 

 

#ifndef SRC_VDMA_H_
#define SRC_VDMA_H_

#include <stdio.h>
#include <string.h>
#include "hls_stream.h"
#include "hls_video.h"
#include "ap_axi_sdata.h"

#define WIDTH 		1920
#define HEIGHT 		1080
#define STRIDE 		1920

#define YUV_BLACK 	0x80008000
#define YUV_WHITE 	0x80FF80FF
#define YUV_RED 	0xD4416441
#define YUV_GREEN 	0x3A704870
#define YUV_BLUE 	0x7223D423

typedef ap_uint<32> 			uint_axi;	// define the axi bus data width
typedef ap_axiu<32,1,1,1> 		axis_SC;	// define a 32-bit Y/CbCr axi stream with side channel data
typedef hls::stream<axis_SC>	AXI_STREAM;

void vdma(AXI_STREAM &s_axis, volatile int *m_axi_1, int width, int height, int stride);
void getData (volatile int *m_axi_1, int *dataIn, int width, int height, int stride, int row);
void transferData(int *dataIn, int *dataOut, int width);
void sendData(AXI_STREAM &m_axis, int *dataIn, int width, int row);

#endif	// SRC_VDMA_H_

 

 

Source file vdma.cpp:

 

 

#include "vdma.h"

void vdma(AXI_STREAM &s_axis, volatile int *m_axi_1, int width, int height, int stride)
{
#pragma HLS INTERFACE s_axilite port=return bundle=A
#pragma HLS INTERFACE s_axilite port=width bundle=A
#pragma HLS INTERFACE s_axilite port=height bundle=A
#pragma HLS INTERFACE s_axilite port=stride bundle=A
#pragma HLS INTERFACE m_axi depth=4147200 port=m_axi_1 offset=slave bundle=A max_write_burst_length=64	// 1920*1080*2 for Y/CbCr
#pragma HLS INTERFACE axis register both port=s_axis
#pragma HLS STREAM variable=s_axis depth=2073600 dim=1		// 1920x1080  = 2073600

	loop1: for(int row=0;row<height;row++)
	{
		#pragma HLS LOOP_TRIPCOUNT max=1080
		#pragma HLS DATAFLOW

		int dataIn[WIDTH/2];
		#pragma HLS STREAM variable=dataIn dim=1
		int dataOut[WIDTH/2];
		#pragma HLS STREAM variable=dataOut dim=1

		int currentRow = row;
		int widthDiv2 = width>>1;		// 2 pixels per clock
		int strideDiv2 = stride>>1;		// 2 pixels per clock

		getData(m_axi_1, dataIn, widthDiv2, height, strideDiv2, currentRow);
		transferData(dataIn, dataOut, widthDiv2);
		sendData(s_axis, dataOut, widthDiv2, currentRow);
	}
}

void getData (volatile int *m_axi_1, int *dataIn, int width, int height, int stride, int row)
{
	memcpy((void *)dataIn, (int *)(m_axi_1+(row*stride)), width*sizeof(int));	// read complete line 1 from memory
}

void transferData(int *dataIn, int *dataOut, int width)
{
	for(int col=0;col<width;col++)
	{
		#pragma HLS LOOP_TRIPCOUNT max=960   // 1920 at 2 pixels per clock
		dataOut[col] = dataIn[col];
	}
}

void sendData(AXI_STREAM &m_axis, int *dataIn, int width, int row)
{
	axis_SC video;

	loop2: for(int col=0;col<width;col++)
	{
	#pragma HLS LOOP_TRIPCOUNT max=960   // 1920 at 2 pixels per clock
	#pragma HLS PIPELINE
		if((row==0)&&(col==0))
			video.user = 1;
		else
			video.user = 0;

		if(col==(width-1))
			video.last = 1;
		else
			video.last = 0;

		video.data = dataIn[col];

		m_axis << video;		// send video to AXI4-Stream
	}
}

 

 

 

Highlighted
Participant
Participant
475 Views
Registered: ‎05-28-2020

Hi @reaiken 

We are very thankful.

Your informative reply and your code helped us to better understand DATAFLOW pragma application. We also became able to solve our issue and generate custom HLS IP. But we got one issue. 

We implemented our HLS IP to Vivado BD and generated the bitstream successfully. In the design, we connected the output stream from HLS IP to AXI4Stream-to-Video Out IP. But while running the design on the board, we found from ILA status that HLS IP was working and generating correct data. On the other hand, AXI4Stream-to-video out IP was receiving correct data and timing information on its input side but that IP was not generating any data on its output side. We also checked three signals status, such as, locked, underflow and overflow. They were all LOW. We have also attached its ILA status below for your information.  

We are stuck here. Is this caused by HLS IP? Could you give us any information or suggestion about this?

Thank You,

 

ILA status:

image.png

0 Kudos
Highlighted
Explorer
Explorer
442 Views
Registered: ‎07-18-2011

@Nikhil_Thapa 

Do you have the resets implemented correctly?

The AXI-Stream to Video Out IP requires an active-high reset on the vid_io_reset pin, while the Video Timing Controller requires an active-low reset on the resetn pin.

You can drive both of these resets from a single Processor System Reset IP block.

Make sure your clocking is set up correctly.  I typically use independent clock mode with separate AXI and Video clocks.

If that isn't the problem, post your block diagram, it is difficult to tell what is wrong from just the ILA output.

Highlighted
Participant
Participant
423 Views
Registered: ‎05-28-2020

Hi @reaiken ,

The AXI-Stream to Video Out IP requires an active-high reset on the vid_io_reset pin, while the Video Timing Controller requires an active-low reset on the resetn pin.

We have not connected any reset signals to these pins. We think this should be normally working.

 

Make sure your clocking is set up correctly.  I typically use independent clock mode with separate AXI and Video clocks.

We also have customized the AXI4stream-to-video out IP independent clock mode. We have given 200MHz for AXI and 25.175MHz pixel clock for 640x480 video resolution.

Below we have included the part of our design block diagram. You can check it. Here, NUC IP is the custom IP after we followed your previous replies.

Thank You,

 

axi4stream issue.png

0 Kudos
Highlighted
Explorer
Explorer
418 Views
Registered: ‎07-18-2011

@Nikhil_Thapa 

You must connect the resets or it won't work properly.  You have them floating with no pin tie-offs.

Use the Processor System Reset IP block with the peripheral_reset (active-high) connected to the AXI-Stream to Vid Out vid_io_out_reset pin and the peripheral_aresetn (active-low) connected to the VTC resetn pin.

You must also connect your aresetn pin to the AXI reset.

0 Kudos
Highlighted
Participant
Participant
348 Views
Registered: ‎05-28-2020

@reaiken ,

As you mentioned in the reply, we tied-off all the reset pins to corresponding reset signals. But still that did not work. We got all the status signals LOW. May be it is due to HLS IP. When we stopped that IP, we got underflow state from AXI4Stream-to-Video Out IP.
We generated another HLS IP by exactly following your code in previous replies, but that still generated same issue. We believe that HLS IP should normally be working. Are we missing anything while generating HLS IP. We also followed UG902.

 

0 Kudos
Highlighted
Explorer
Explorer
336 Views
Registered: ‎07-18-2011

@Nikhil_Thapa 

Did you also enable global pin tie-offs for all the other floating pins, like the ce signals?

In my experience, it is always best to start debugging a video path in small steps, starting at the back.    I would instantiate a Video Test Pattern generator configured for your video format and connect that directly to the AXI-Stream to Video Out IP and get that section working, then connect your custom IP and debug it.

 

Highlighted
Scholar
Scholar
335 Views
Registered: ‎03-28-2016

@Nikhil_Thapa 

I have seen that some of the Xilinx IP (DMA and VDMA primarily) require that the "TKeep" signal be properly set in the AXI4-Stream interface.  The width of TKeep is determined by the width of TData.  TKeep has 1 bit for each byte of TData.  If TData is 32-bit then TKeep is 4-bit.  In almost all cases, TKeep should always be set to all "1"s.  By default, it will be set to all "0"s.  Try setting the TKeep to all "1" in your HLS IP.

Ted Booth | Tech. Lead FPGA Design Engineer | DesignLinx Solutions
https://www.designlinxhs.com
Highlighted
Participant
Participant
271 Views
Registered: ‎05-28-2020

@reaiken ,

Thanks for reply,
We did as you said that we used TPG and HLS IP to debug video path. We also used AXI4Stream Switch IP to select between TPG stream and HLS IP stream. When selecting TPG stream, we got output but selecting HLS IP stream did not produce any output. However, when looking at ILA status, there was still data with handshaking signal.

@tedbooth ,
Thanks for reply
We regenerated the HLS IP including TKEEP signal all "1"s. But that did not change anything either. We have also included ILA status waveform below.

Further more, we tested HLS IP with Video Processing Subsystem (VPSS) IP, where we fed HLS IP stream to VPSS IP. We found same thing there. There was also data incoming but VPSS was generating messy output as if there was no stream from HLS IP. We also tested the HLS IP by setting the IP parameters to generate black pixel data (all zeros). But the output from VPSS did not change.
From these test results, we are suspecting that there might be issue in HLS IP AXI4Stream data interface.

@reaiken , could you also generate HLS IP of your code and test it? This is only for testing purpose on your side.

Thanks,

 

ILA status waveform:

image.png

0 Kudos
Highlighted
Participant
Participant
241 Views
Registered: ‎05-28-2020

Hello Everyone,

We have now solved this issue ourselves.
While going through ILA waveform of HLS IP, we especially did inspection of TUSER and TLAST signal states. We found the TLAST signal was not changing its state. It was LOW every time while there was still data coming from HLS IP. We also studied AXI4Stream Protocol. And finally, we found the fact that the issue was due to lack of Packet Boundary. Because of this, downstream IP was not able to detect actual data packet.
We then turned to check our HLS IP code. There was little problematic for asserting TLAST signal. We corrected it and then generated IP again. Then, it started working.
We must not forget to thank you all for helping directly or indirectly to figure out our issue. And finally, many thanks for joining and sharing the knowledge in this forum conversation.