cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
stgateizo
Explorer
Explorer
956 Views
Registered: ‎10-07-2016

Video Processing Subsystem V2.3: Bandwidth issue...

Dear Xilinx experts,

I got stuck with the Video Processing Subsystem from Xilinx, and I need somebody who can give a good advice to find the root cause of my desing issue. So I would be glad if anybody could help me…

 

Introduction:

I use the video processing subsystem in my design, in order to downscale a 4K Video to any arbitrary output resolution, and in order to do framerate conversion, by using external DDR3-memory, which is connected via the MIG to my video processing subsystem. The axi-streaming input of the video processing subsystem is driven by a 4K source, which is sitting outside my FPGA, but there are placed some self-written IP-cores in front of the video processing subsystem, which do some special video pre-processing things like video composition and so on. The axi-streaming output of the video processing subsystem is directly connected to an HDMI2.0 Tx IP-Core from Xilinx.

 

What is the problem?

The axis-streaming input of the video processing subsystem is not able to process my data fast enough. As a consequence, I see TREADY going low quite often after receiving the first line, which is accepted at full speed (see ILA records below). The VPSS throttles the input stream so heavily, that the IP-core, which is sitting in front of the VPSS, will stop streaming, since it gets an internal FIFO overflow.


ILA record of axis-streaming input interface of video processing subsystem:
Trigger when SOF = 1 AND TREADY = 1 AND TVALID = 1

stgateizo_0-1619075915295.png

 

Zoomed view of above ILA record. ILA record shows the start of the second line, and the throttling of the VPSS…

stgateizo_1-1619075915301.png

When I look to the last ILA record above, it looks like there is anywhere a bandwidth bottleneck. The problem is, that I can’t find it in my design.

 

My expectations:

My expectation is, that the Video Processing Subsystem has enough bandwidth, in order to process the input stream. As you can see in the first ILA record, the IP-core which is providing the input stream for the VPSS, will provide each active video data line as one uninterrupted data burst. During the horizontal and vertical blanking phase, no data will be transferred. So my expectation is, that the VPSS should be able to handle the way, how the previous stage is providing the axis streaming data. The first line of data seems to be buffered inside the VPSS, since it can be accepted with full speed. But why will the second line be throttled so heavily?

Bandwidth Considerations:

The axis streaming input interface of the VPSS is 2 pixel wide (48bit), and runs with an axi-streaming clock of 300MHz, in order to support up to 4Kp60. => AXI-Streaming Input Bandwidth = 300MHz * 48bit = 14400Mbit/s = 1800Mbyte/s

The axi4 memory mapped interface of the VPSS is a 256bit wide data bus interface, which runs with the user interface clock of the MIG, which is 200MHz. => AXI4-Memory Mapped Interface Bandwidth = 200MHz * 256Bit = 512000Mbit/s = 6400Mbyte/s
The external DDR3-memory interface comes with a 32-bit data bus, which is clocked with 800MHz. => DDR3-Memory Bandwidth = 2x 800MHz * 32bit = 51200Mbit/s = 6400Mbyte/s

From the bandwidth point of view I do not see a problem, even if I would go with 4Kp60 into the VPSS, and with 4Kp60 out from the VPSS => 2x 14400Mbit/s = 28800Mbit/s = 3600Mbyte/s. This is roughly 56% of the total available DDR3-memroy peak bandwidth.

 

How does the configuration of the Video Processing Subsystem look like?

stgateizo_2-1619075915306.png

 

 

How is the Video Processing Subsystem configured by Microblaze?

 

------ SUBSYSTEM INPUT/OUTPUT CONFIG ------

->INPUT

        Color Format:     RGB

        Color Depth:      8

        Pixels Per Clock: 2

        Mode:             Progressive

        Frame Rate:       60Hz

        Resolution:       3840x2160 [Custom Mode]

        Pixel Clock:      554964 kHz

 

->OUTPUT

        Color Format:     RGB

        Color Depth:      8

        Pixels Per Clock: 2

        Mode:             Progressive

        Frame Rate:       50Hz

        Resolution:       1920x1080 [Custom Mode]

        Pixel Clock:      123750 kHz

 

Zoom Mode: OFF

 

Pip  Mode: OFF

 

Data Flow Map: VidIn -> SCALER-H -> SCALER-V -> VDMA -> LBOX -> CSC -> VidOut

 

How does the connection to the MIG look like:

stgateizo_3-1619075915315.png

 

stgateizo_4-1619075915321.png

 

 

Axi_interconnect_0 configuration:

stgateizo_5-1619075915324.png

 

stgateizo_6-1619075915327.png

 

stgateizo_7-1619075915331.png

 

 

MIG configuration Summary:

 

stgateizo_8-1619075915336.jpeg

 

stgateizo_9-1619075915341.jpeg

I’m currently a bit puzzled why it does not work?
Any ideas what else I can check?

Kind regards

stgateizo

0 Kudos
33 Replies
florentw
Moderator
Moderator
778 Views
Registered: ‎11-09-2015

HI @stgateizo 

I believe the output stream will have an impact on the input stream even if the AXI VDMA is present in the pipeline. So it would be interesting to see the behaviour on the output stream.

If the AXI VDMA is in the pipeline, it is interesting to have a view of the memory interface to see if this could be the bottleneck.

One other recommendation with the VPSS is to do everything step by step. I am not sure why the CSC is shown in the pipeline as input and output are RGB. But then you can try first use the VPSS as a path-through 4k60 in and out. Is it working?

Then you can use the scaler but without converting the frame rate (4k60 to 1080p60). Is it working?

I do not have any magic, you need to understand where the bottleneck might be coming and what configurations trigger it.


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
florentw
Moderator
Moderator
777 Views
Registered: ‎11-09-2015

BTW. Looking at your screenshot, it seems that you have a misunderstanding. This is not the second line which is throttling but the second frame (as per the tuser - Start of Frame).

This would point me to really look at the output stream.

I am not really sure how the VDMA is configured inside the VPSS but I am not sure it will skip any frame. Which means that you have to output as much frames as you have input. So the first frame might be buffering fine but as you are getting to the second, you might be limited by your output


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
762 Views
Registered: ‎10-07-2016

Hello Florent,
thank you for your help.

Here are my remarks:
[Florent] I believe the output stream will have an impact on the input stream even if the AXI VDMA is present in the pipeline. So it would be interesting to see the behaviour on the output stream.
[stgateizo] Why should the output stream have any influence on the input stream? When I use framerate conversion, I would not expect any influence. But you are right, we should also take a look at the output stream to get more information. I will add the otuput stream to the ILA, and take a record...

[Florent] If the AXI VDMA is in the pipeline, it is interesting to have a view of the memory interface to see if this could be the bottleneck.
[stgateizo] Yes, you are right, but Ihave a BRAM utialization issue. Currently I use 90% of the BRAMs. I can only add an ILA to the memory interface, when I remove the current ILA. So first lets look to the output stream...

[Florent] One other recommendation with the VPSS is to do everything step by step. I am not sure why the CSC is shown in the pipeline as input and output are RGB. But then you can try first use the VPSS as a path-through 4k60 in and out. Is it working?
[stgateizo] Good idea, but it is not possible, since the timing which is going into the VPSS is not synchronous to the timing provided by the HDMI Tx IP-core. So it will not work without framerate conversion. This would only work, if the pixel-clock of the HDMI IP-core would be the same as provided from the external timing generator, which is not the case...

[Florent] BTW. Looking at your screenshot, it seems that you have a misunderstanding. This is not the second line which is throttling but the second frame (as per the tuser - Start of Frame).
[stgateizo] Florent, why do you think it is the second frame and not the second line? I can see the signal Tlast toggling, which means an EOL = End Of Line. Tuser corresponds to SOF = Start Of Frame.  Can it be that you mixed it up?

Kind regards
stgateizo

0 Kudos
stgateizo
Explorer
Explorer
738 Views
Registered: ‎10-07-2016

Hi Florent,

below, you can see an ILA record (same trigger condition as before), but this time with the streaming output available. Unfortuantely I cannot increase the sampling memory of the ILA, in order to record more lines. But as you can see it looks okay on the first view.

stgateizo_0-1619092608658.png

Addtiionally, I have routed the control signals from the axi streaming input and output interface to io-pins, where I'm able to record those signals also on my mixed signal oscilloscope. The exeternal oscilloscope allows me to record much more lines, even whole frames...
In the oscilloscope screenshot below, you can see the following signals:

AXI-Streaming Input Interface:
D4 = TLAST = End Of Line
D5 = TUSER = Start Of Frame
D6 = TVALID
D7 = TREADY

AXI-Streaming Output Interface:
D8 = TLAST = End Of Line
D9 = TUSER = Start Of Frame
D10 = TVALID
D11 = TREADY

AXI-Streaming Input Interface of the module before the Video Processing Subsystem.
D0 = TLAST = End Of Line
D1 = TUSER = Start Of Frame
D2 = TVALID
D3 = TREADY

Trigger on SOF signal at the input of the VPSS.

stgateizo_1-1619092964161.png


Please note, the hperiod of the output stream is bigger than the hperiod of the input stream.
Although the framerate between the input stream and output stream differs not very much, the hperiod will differ much more, which has to do with the input and output resolution:
Input timing = 3840x2160@60Hz
Ouput timing = 1920x1080@50Hz

By the way, the output seems to work well, since I can see a pure red image on the display, which is connected to HDMI2.0 Tx output...

When I trigger several times, I can see sometimes also this behavior...

stgateizo_2-1619093741818.png

Hmmmmm ???

Kind regards

stgateizo.

0 Kudos
stgateizo
Explorer
Explorer
730 Views
Registered: ‎10-07-2016

Hi Florent,
as you can see in the driver instance of the video processing subsystem below, the CSC is present. Maybe it is always in, in a Full Fledged Design.

stgateizo_3-1619094520162.png

As you can see below, it will convert from RGB to RGB, or in other words it will do nothing...

stgateizo_4-1619094722874.png

Below you can see also the settings for the input and output video stream:

stgateizo_5-1619094806992.png

stgateizo_6-1619094865536.png

And here are the settings for PIP, Zoom etc. which I have not touched, and which is probably not enabled...

stgateizo_7-1619094947863.png

As you can see above, the Frame Buffer Start Address is also correctly set to 0x8000_0000, which is the address where my DDR3 memory starts.

Further the follwoing marked components does not have a valid pointer, instead they have a null pointer, which means that those components will not be used in the full fleged design, right?

stgateizo_0-1619095747130.png

This affects the horizontal and vertical chroma resampler, as well as the deintelracer. Looks good from my point of view...

The VDMA is programmed as shown below...

stgateizo_0-1619096454765.png

stgateizo_1-1619096503341.png

Can you see any programming issue here ??? I can't...

One question, why does the VidIn and Vidout structure of the input and output stream contain the framerate information?
The framerate for the input is given by the external source, and the framerate of the output stream is determined by the timing adjusted at HDMI Tx IP-core. So for what reason does the VPSS require this information?

Kind regards

stgateizo

0 Kudos
florentw
Moderator
Moderator
717 Views
Registered: ‎11-09-2015

HI @stgateizo 

Yes sorry I mixed up the tlast and tuser while looking at the ILA capture. But this is still that this is not the first line which is not throttling but just the previous line from the one which is throttling.

For me, this means that this is not coming directly from the VPSS but there is another element. Would you be able to get the ready and valid signals (AR,R, AW and R chanels) from the memory to see if this the memory but backward pressure?

Or can you try with lower resolutions?


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
699 Views
Registered: ‎10-07-2016

Hi Florent,

I will try to add the control signals of the memory interface to my mixed signal oscilloscope, but this will take some, time, since I have to create a debug modul first, which allows me to bring out those signals to the IOs...
Maybe I will have some results for you tomorrow....

Florent, did you take a look at the programmed parameters which I have posted before?
Do you think that the VDMA is programmed correctly ?

Kind regadrs

stgateizo

0 Kudos
florentw
Moderator
Moderator
688 Views
Registered: ‎11-09-2015

HI @stgateizo 

Florent, did you take a look at the programmed parameters which I have posted before?
Do you think that the VDMA is programmed correctly ?

No I missed that.

The CSC is present means that it is added in the HW. But there is a AXI4-Stream switch inside the VPSS so the video stream should not go through the CSC is not needed.

I am not used to look directly at the registers for the VPSS but I do not see anything really wrong, specifically looking at the VDMA

 


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
657 Views
Registered: ‎10-07-2016

Hi Florent,

stupid question, how can I route the memory interface singals like awready, awvalid, etc. to a Concat block (see below, in order to combine them with the AXIS-streaming signals to a debug vector, in order to be able to route them to IO pins ?

I have no idea how i can do this?

stgateizo_0-1619103179265.png

for the AXI-Streaming signals I have created a little module (see below), which you can hook into an AXI-streaming bus, and which just forwards all AXI-streaming signals 1 by 1, and which allows you to bring out the control signals in parallel for debugging purposes.
But this technique is not possible for an AXI4 Full MM interface, since you get then an address assignment issue...

stgateizo_1-1619103486090.png

So how can a get access to the AXI4 Full MM control signals ?

 

Regards

stgateizo

 

 

 

 

 

0 Kudos
florentw
Moderator
Moderator
633 Views
Registered: ‎11-09-2015

HI @stgateizo 

You can theorically do the same type of IP by setting the AXI type to monitor in IP packager. I wrote an article about this a while ago

https://forums.xilinx.com/t5/Design-and-Debug-Techniques-Blog/AXI-Basics-5-Create-an-AXI4-Lite-Sniffer-IP-to-use-in-Xilinx/ba-p/1064306 

 


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
627 Views
Registered: ‎10-07-2016

Hi Florent,

you are right, this should work. By the way I'm just trying anohter interesting possbility. I'm not sure if it works, but when you select an input pin of the concat block, and click on make connection, then you get a window with a list of signals whic are allowed to connect. On the bottom of this window there is a checkbox, whic is named "Hide pins belonging to Interfaces". When you you uncheck this box, you will all interface signals, also signals like awready and awvalid of the memory interface. When you select such a signal, then it is possible to make a connection to this interface signal. The connection is not displayed in the block diagram, but when you click on select the connected input pin of the concat block, then you can see in the properties window that it is connected to this interface signal. I'm not sure if it works, but I will give it a try. If it does not work, I will follow your suggestion...


Regards

stegateizo

0 Kudos
stgateizo
Explorer
Explorer
608 Views
Registered: ‎10-07-2016

Hi Florent,
will go on with your suggestion, as you can see below....

stgateizo_0-1619107798203.png

Kind regards

stgateizo

0 Kudos
stgateizo
Explorer
Explorer
578 Views
Registered: ‎10-07-2016

Hi Florent,
the control signals of the VPSS memory interface looks strange, as you can see below:

AXI-Streaming Input Interface:
D0 = TLAST = End Of Line
D1 = TUSER = Start Of Frame
D2 = TVALID
D3 = TREADY

AXI-Streaming Output Interface:
D4 = TLAST = End Of Line
D5 = TUSER = Start Of Frame
D6 = TVALID
D7 = TREADY

AXI4-Memory Control Signals
D8 = AWVALID
D9 = AWREADY
D10 = WVALID
D11 = WREADY
D12 = ARVALID
D13 = ARREADY
D14 = BVALID
D15 = BREADY

stgateizo_0-1619111137538.png

We see only one AWVALID pulse at the beginning, but AWREADY is low all the time. The same WVALID and WREADY. So it looks like the first write access does not take place, since the MIG or the interconnect fabric is not ready to accept a write transfer.
But also the read transactions does not look correct. There are ARVALID pulses, but ARREADY is also low all the time. Only BREADY is high, which affects the data read return path ?

So I think we have a problem with the MIG connection. This is strange, since I'm able to do successfully a DDR3-memory with the Microblaze, which is connected to S00_AXI in the axi_interconnect_0 fabric. Hmmmm ???

stgateizo_56-1619112047837.png

Has it soemthing to do with the interconnect fabric settings (see pictures in previous posts)...

Kind regards

stgateizo

 

0 Kudos
stgateizo
Explorer
Explorer
575 Views
Registered: ‎10-07-2016

Hi Florent,

I have to correct me !!!
The DDR3-Memory test perofrmed by the Microblaze does no longer work, when the VPSS starts accessing the DDR3 memory!
But the DDR3-Memory test passes successfully, when the VPSS is not started.

Or in ohter words, as soon as the VPSS starts accessing the DDR3-memory, also the Microblaze is no longer able to access the DDR3-memory properly.
So we should defenitly focus on the memory connection...

Kind regards
stgateizo

0 Kudos
stgateizo
Explorer
Explorer
481 Views
Registered: ‎10-07-2016

Hi Florent,
the AMBA specification says, that an address write cycle (intitiated by the VPSS) starts with the AWVALID signal going high, and it must stay high as long as as the axi-interconnect fabric responds with AWREADY = 1.

stgateizo_2-1619163081281.png


But this does not happen in this my case...

AXI4-Memory Control Signals

D8 = AWVALID
D9 = AWREADY
D10 = WVALID
D11 = WREADY
D12 = ARVALID
D13 = ARREADY
D14 = BVALID
D15 = BREADY

stgateizo_0-1619162652124.png

Below is anohter ILA record. This is a zoomed view of the very first write access, when AWVALID goes high. It goes high for 5ns, which corresponds to 1 clock cycle  at 200MHz.
But I do not see AWREADY going high. Further there is another strange thing. WVALID is toggling without waiting for WREADY. Something is completely going wrong here!

stgateizo_3-1619163697871.png

The samplingrate was 1.25GSamples/s, while the memory interface runs with 200MHz. So the sampling rate is actually fast enough to catch AWREADY, even if it is only 1 clock cycle long...

Is this a problem of the VPSS?

Kind regards

stgateizo

 

0 Kudos
stgateizo
Explorer
Explorer
462 Views
Registered: ‎10-07-2016

Hi Florent,

I have added an ILA to the memory interface of the VPSS, and I have triggered on the very first AWVALID access, afters stating the VPSS.
I don't know why, but in the ILA record I can definitely see the AWREAD signal going high.

Please click on the picture to open it with full resolution...

stgateizo_0-1619174731025.png

When I compare it with the oscilloscope record below, then I'm wondering, why I can't see AREADY going high?
AXI4-Memory Control Signals
D8 = AWVALID
D9 = AWREADY
D10 = WVALID
D11 = WREADY
D12 = ARVALID
D13 = ARREADY
D14 = BVALID
D15 = BREADY

stgateizo_1-1619174992482.png


Here is the zoomed view during the trigger point (AWVALID = 1)

stgateizo_2-1619176937193.png

As you can, seem the AWVALID and WVLAID matches perfectly with the ILA record. But AWREADY and WREADY does not match.

Is my AXI-MM Sniffer not correct?

stgateizo_3-1619177213092.png

With s_axi_mm expanded...

stgateizo_4-1619177244270.png

Interface Mode of s_axi_mm is set to "monitor"

stgateizo_5-1619177391146.png

VHDL Code:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;

-- Uncomment the following library declaration if using
-- arithmetic functions with Signed or Unsigned values
--use IEEE.NUMERIC_STD.ALL;

-- Uncomment the following library declaration if instantiating
-- any Xilinx leaf cells in this code.
--library UNISIM;
--use UNISIM.VComponents.all;

entity axi_mm_debug is
port( axi_mm_clk : in std_logic;

s_axi_mm_awaddr : in std_logic_vector ( 31 downto 0 );
s_axi_mm_awlen : in std_logic_vector ( 7 downto 0 );
s_axi_mm_awsize : in std_logic_vector ( 2 downto 0 );
s_axi_mm_awburst : in std_logic_vector ( 1 downto 0 );
s_axi_mm_awlock : in std_logic_vector ( 0 to 0 );
s_axi_mm_awcache : in std_logic_vector ( 3 downto 0 );
s_axi_mm_awprot : in std_logic_vector ( 2 downto 0 );
s_axi_mm_awqos : in std_logic_vector ( 3 downto 0 );
s_axi_mm_awvalid : in std_logic_vector ( 0 to 0 );
s_axi_mm_awready : in std_logic_vector ( 0 to 0 );
s_axi_mm_wdata : in std_logic_vector ( 255 downto 0 );
s_axi_mm_wstrb : in std_logic_vector ( 31 downto 0 );
s_axi_mm_wlast : in std_logic_vector ( 0 to 0 );
s_axi_mm_wvalid : in std_logic_vector ( 0 to 0 );
s_axi_mm_wready : in std_logic_vector ( 0 to 0 );
s_axi_mm_bresp : in std_logic_vector ( 1 downto 0 );
s_axi_mm_bvalid : in std_logic_vector ( 0 to 0 );
s_axi_mm_bready : in std_logic_vector ( 0 to 0 );
s_axi_mm_araddr : in std_logic_vector ( 31 downto 0 );
s_axi_mm_arlen : in std_logic_vector ( 7 downto 0 );
s_axi_mm_arsize : in std_logic_vector ( 2 downto 0 );
s_axi_mm_arburst : in std_logic_vector ( 1 downto 0 );
s_axi_mm_arlock : in std_logic_vector ( 0 to 0 );
s_axi_mm_arcache : in std_logic_vector ( 3 downto 0 );
s_axi_mm_arprot : in std_logic_vector ( 2 downto 0 );
s_axi_mm_arqos : in std_logic_vector ( 3 downto 0 );
s_axi_mm_arvalid : in std_logic_vector ( 0 to 0 );
s_axi_mm_arready : in std_logic_vector ( 0 to 0 );
s_axi_mm_rdata : in std_logic_vector ( 255 downto 0 );
s_axi_mm_rresp : in std_logic_vector ( 1 downto 0 );
s_axi_mm_rlast : in std_logic_vector ( 0 to 0 );
s_axi_mm_rvalid : in std_logic_vector ( 0 to 0 );
s_axi_mm_rready : in std_logic_vector ( 0 to 0 );

awvalid : out std_logic;
awready : out std_logic;
wvalid : out std_logic;
wready : out std_logic;
arvalid : out std_logic;
arready : out std_logic;
bvalid : out std_logic;
bready : out std_logic);
end axi_mm_debug;

architecture Behavioral of axi_mm_debug is

begin

awvalid <= s_axi_mm_awvalid(0);
awready <= s_axi_mm_awready(0);
wvalid <= s_axi_mm_wvalid(0);
wready <= s_axi_mm_wready(0);
arvalid <= s_axi_mm_arvalid(0);
arready <= s_axi_mm_arready(0);
bvalid <= s_axi_mm_rvalid(0);
bready <= s_axi_mm_bready(0);

end Behavioral;

 

I'm a bit puzzeled why I see here differencens between ILA and the Oscilloscope???
Can it be that the monitor interface does only record the outputs of the VPSS?

Kind regards

stgateizo

 

0 Kudos
florentw
Moderator
Moderator
453 Views
Registered: ‎11-09-2015

Hi @stgateizo 

To be honest I have no idea why you are seeing differences between the ILA and with the oscilloscope.

But the ILA is interesting here.

We can see that on the write channel, the DDR is making a really long time before accepting the write. This might really be were the bottleneck is coming.

I know when using the Zynq/ZynqMP DDR we recommend using the QoS (Quality of Service) settings to give priority to some ports. I am not sure there is the same settings in the interconnect/smartconnect or MIG controller but this is something which might be worth looking at.


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
439 Views
Registered: ‎10-07-2016

Hi Florent,

By the way, how deep is this input FIFO ?

In the VDMA driver structure, I have found a LineBufferDepth of 512 samples. I assume this is indeed the input buffer size?

stgateizo_1-1619181214725.png

If this is really the size of the axi-streaming input FIFO, then I would expect, that the FIFO gets already an overrun, before the VPSS intiiates the first memory write cycle.

Please look at the oscillsocpe record below:

stgateizo_1-1619183357336.png

 

I would tend to say, that there is written more than one complete line to the VPSS, before it starts intiating the first memory write transfer !

One line consists of 3840 pixel, which corresponds to 1920 samples. If the input buffer comes with a depth of 512 samples, I would say that this should cause an FIFO overflow.

But the main question is, why does the VPSS not start eralier with the memory write transfer ?

Kind regards
stgateizo

 



Kind regards

stgateizo

 

0 Kudos
florentw
Moderator
Moderator
419 Views
Registered: ‎11-09-2015

Hi @stgateizo 

I would assume there are multiple line buffers in the VPSS. The one you have find is only in the VDMA IP which is buffering the data before sending to the memory.

But the scaler might have one because it needs multiple line to start processing data.

I do not know if there is an overflow for the line buffer but this should be linked to the TREADY signal from the VPSS

So this is something that you need to keep in mind. First you do the scaler operation which requires to buffers I would say 2 lines. Then the video stream is going to the AXI VDMA which is buffering few samples to be able to send busts to the memory


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
412 Views
Registered: ‎10-07-2016

Hi Florent,

yes, I agree. I completely forgot, that the scaler is sitting in front of the VDMA.

stgateizo_0-1619185113953.png

Okay, this explains why we see the very first memory write with a big delay in comparison to the SOF of the input stream. Further if the scaler is sitting in front of the VDMA, the video is downscaled from 4K to FHD (1920x1080), before it reaches the VDMA. This means that there is not much memory bandwidth required to transfer a frame, maybe 500MByte/s. This is nothing compared to the peak bw of 6400MByte/s...

Since the very first write transfer is properly completed (with BVALID = 1), I'm asking me why we do not see the next memory write transfer ?

I assume we don't see the next transfer, since the IP-core which is sitting in front of the VPSS is too much throttled, which in turn causes this IP-core, to stop streaming video data after roughly 1 to 2 lines, because of an internal FIFO overflow.

So we come back to the intial question: "Why is TREADY on the axi-streaming input throttling the traffic so heavily" ???


Kind regards

stgateizo

0 Kudos
florentw
Moderator
Moderator
401 Views
Registered: ‎11-09-2015

Hi @stgateizo 

I am not sure if this is just a coincidence but the write transfer seems to be starting on after a tlast on the read side. It might be how the VPSS is doing it synchronisation. Can you try multiple capture at this point to see if this was just random or if is always happening.

PS. Maybe a better solution for you would be to use the VPSS as individual elements based on you requirements (scaler only, csc only). This way you would have more control over the AXI VMDA


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
392 Views
Registered: ‎10-07-2016

Hi Florent,

[Florent] I am not sure if this is just a coincidence but the write transfer seems to be starting on after a tlast on the read side.

[stgateizo] its just a coinsedence...

stgateizo_0-1619190408114.png

Regards

stgateizo

0 Kudos
florentw
Moderator
Moderator
384 Views
Registered: ‎11-09-2015

Hi @stgateizo 

Ok...Then I have absolutely no clue of what is going on...

Can you try to enable the journal in the VPSS to see if it displays any error. Refer to advise #4 in the following article:

https://forums.xilinx.com/t5/Design-and-Debug-Techniques-Blog/Video-Series-27-Getting-started-with-the-Video-Processing/ba-p/960896 


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
253 Views
Registered: ‎10-07-2016

Hi Florent,

I have added the XVprocSs_LogDisplay(VpssPtr) function two times in my code.

The first time, right after the intialization of the VPSS when I have called the function XVprocSs_CfgInitialize. I get the following log message in the UART console:

VPSS log
-----------
Info: Subsystem start init
Info: Topology is Full Fledged
Info: Reset_AxiS init
Info: Reset_AxiMM init
Info: Router init
Info: Csc init
Info: HScaler init
Info: VScaler init
Info: LetterBox init
Info: Video DMA init
Info: Subsystem reset
Info: Zoom window set
Info: Pip window set
Info: Subsystem ready
log end

The second time is a bit later, right after I have called the function XVprocSs_SetSubsystemConfig(). This time I get the following log message:

VPSS log
-----------
Info: Subsystem reset
Info: Subsystem configuration is valid
Info: Full mode - Set scale_down mode
Info: Full mode - Video Routing Table setup OK
Info: Subsystem reset
Info: Full mode - Video Router setup OK
Info: Full mode - Video Data Flow setup OK
Info: Subsystem start
log end
-----------

Does this really help us?


Regards stgateizo

 

0 Kudos
florentw
Moderator
Moderator
246 Views
Registered: ‎11-09-2015

Hi @stgateizo 

This does not help. Sometimes it does as it can show invalid configurations...

I am just trying to find anything which could help us finding the root cause of the behaviour.

Did you try different resolution combinations? Do you always have the same behaviour?

Did you refer to the VPSS example design? You might want to make sure you are initialising the VPSS using the same steps


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos
stgateizo
Explorer
Explorer
244 Views
Registered: ‎10-07-2016

Hi Florent,
I would like to clarify if my following understanding is correct. Please can you check...

When I look to the block diagram below of the vivado example design, I can see that the VPSS and the VTPG is reset by an AXI_GPIO.
When I look into the Microblaze code of the example design, I can see that all GPIO outputs are reset (set to low) for a short time (roughly 1000 clock cycles), always before the VPSS and the VTPG will be programmed.

I assume it is not sufficient to reset the design once ?

In my design I only reset the VPSS once at startup, but I call the function XVprocSs_Reset() each time, before I program the VPSS...

Does the function XVprocSs_Reset() do the same as a reset pulse on the VPSS ???

stgateizo_0-1619447608957.png

 

Regards
stgateizo

0 Kudos
stgateizo
Explorer
Explorer
231 Views
Registered: ‎10-07-2016

Hi Florent,

see my answers below:

[Florent] Do you always have the same behaviour?
[stgateizo] Yes, it is reproducable.

[Florent] Did you refer to the VPSS example design? You might want to make sure you are initialising the VPSS using the same steps
[Steffen] Yes, the only thing which is different, is the VPSS reset circuit,  as mentioned in my previous post. The initialization routine seems to work, otherwise I would not get a video with the lower resolution at 2560x1024@60Hz (see below).

[Florent] Did you try different resolution combinations?
[stgateizo] So far I tried only 3840x2160@60Hz (progressiv), but I can also switch to an internal VTPG as source. I get the same problem with the VTPG, when I set the output timing to 3840x2160@60Hz.

But when I set the VTPG to 2560x1024@60Hz, then the VPSS seems to work, since I can see the video properly at the HDMI Tx output interface !!!
Unfortunately, the VTPG does not provide any moving test images, so it is hard to say if the video is a live, but I get at least the expected picture at the HDMI display !!!

So we can say, that the VPSS is actually working, but not with 3840x2160@60Hz. Tomorrow I will try also other resolutions...

So it sounds really like a bandwidth topic. When I record the axi-streaming input of the VPSS with my oscilloscope at 2560x1024@60Hz, then I can see the same start as with a 4K timing. TREADY is throttling, but this time, the bandwidth is still enough to process the video stream, without tripping an FIFO overrun in the preceding stage.

AXI-Streaming Input Interface:
D0 = TLAST = End Of Line
D1 = TUSER = Start Of Frame
D2 = TVALID
D3 = TREADY

AXI-Streaming Output Interface:
D4 = TLAST = End Of Line
D5 = TUSER = Start Of Frame
D6 = TVALID
D7 = TREADY

AXI4-Memory Control Signals
D8 = AWVALID
D9 = AWREADY (not working for some reasons)
D10 = WVALID
D11 = WREADY (not working for some reasons)
D12 = ARVALID
D13 = ARREADY (not working for some reasons)
D14 = BVALID
D15 = BREADY (not working for some reasons)

First view lines after receiving a SOF signal at the input of the VPSS (D0 -D4) with 2560x1024@60Hz

stgateizo_2-1619454264751.png

 

Zoomed view at SOF, to see TREADY throttling

stgateizo_1-1619454211091.png

When you zoom further into the record above, you can see, that there is a data transfer for one clock cycle, followed by a pause of 2 clock cycles, and so on. This means I get only 1/3 of the maximum peak bandwidth of the axi streaming input interface. 


Kind regards
stgateizo.

0 Kudos
stgateizo
Explorer
Explorer
169 Views
Registered: ‎10-07-2016

Hi Florent,
I'm just comparing the example design with my design. One difference is the data widht of the axi memory interface of the VPSS. The example design comes with 512-bit while the data width in my desing has only 256-bit. Both Ip-cores are configured to support 2 samples per clock, and both support the same max resoltuion, and both are configured as a full-fleged design. The only difference is the supported bits per color component. I support 8 bits and the example design supports 10bits. Maybe this is the reason why the VPSS memory interface in the example design is twice as big...

VPSS configuration of example design:

stgateizo_2-1619508399046.png

VPSS Configuratipon of my design:

stgateizo_3-1619508463135.png

But anyhow, as already mentoined in a eraly post, the memory bandwdith shouldn't be the problem, since we need roughly 2x500MByte/s (for 1920x1080@60Hz), but we have 6400MByte/s at all...
By the way the data width of the axi interconnect and the MIG is also limited to 256-bit in my design, while the example design has 512 bit.

Kind regards

0 Kudos
florentw
Moderator
Moderator
157 Views
Registered: ‎11-09-2015

HI @stgateizo 

The data width for the memory interface of the VPSS is auto-propagated:

(You might need to remove the VPSS to have it working) If you connect the AXI-MM to a 512-bit interface, then this will be automatically propagated when you run block automation.

There is one thing you might want to try: Can you increase the max resolution for the VPSS to something higher that you are actually supporting. For example, if the max resolution you are planning to support is 3840x2160, can you set the VPSS to 3900x2200? I think I have seen weird issues on the forums if the when the VPSS was running at the max resolution supported. I was not able to reproduce but I guess it worth the shot trying.


Florent
Product Application Engineer - Xilinx Technical Support EMEA
**~ Don't forget to reply, give kudos, and accept as solution.~**
0 Kudos