UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Visitor jroullier
Visitor
8,133 Views
Registered: ‎03-30-2016

Video Processing with ARM on Zynq (Bare Metal)

Hello,

 

I want to do some basic software processing on a video flow using ARM (bare metal).

Right now, I have achieved the following :

 

Test Pattern Generator -> VDMA S2MM (HP0) to Address Space 0 ->  ARM memcpy frame from Address Space 0 to Address Space 1 -> VDMA MM2S (HP0) from Address Space 1 -> HDMI Out

 

The S2MM VDMA generates interrupts each time he is starting a frame so I know the previous one is ready. Once the 1st frame is ready (init) the interrupt routine copies it to Address Space 1 (using memcpy...) and launches the MM2S VDMA.

 

Both VDMA are in master mode with fsync.

I'm using 3 write frame buffers and 3 read frame buffers so basically the following is happening :

 

Frame buffers 0, 1, 2 in Address Space 0

Frame buffers A, B, C in Address Space 1

 

VDMA S2MM	SOFT (ARM)	VDMA MM2S

write in 0	-		-
write in 1	copy 0 to A	-
write in 2	copy 1 to B	read A
write in 0	copy 2 to C	read B
write in 1	copy 0 to A	read C
...

 

 

This is working fine as long as the memcpy is faster than a video frame.

Thing is, for a resolution of 1280x720 pixels (32 bits RGB, 8 MSB are not used) I see a lot of frame drops/tearing (it doesn't happen for lower resolutions) so I suspect the memcpy isn't finished when the next frame arrives. I think the VDMA detects that something is reading the frame buffer he is trying to write so it skips the frame buffer.

 

As I want to process the video data (simple case : invert it with a not) I tried to replace the memcpy with the following :

 

for(int y=0 ; y<vsize ; y++)
{
for(int x=0 ; x<hsize ; x++)
{
u32 data = Xil_In32(ADRESS_SPACE_0 + offset_frame + 4*(x+y*hsize)) ;
data = ~data & 0xffffff ;
Xil_Out32(ADRESS_SPACE_1 + offset_frame + 4*(x+y*hsize)) ;
}
}

 

 

which is way slower than memcpy (I see even more frame tearing and frame drops even at lower resolutions).

 

Questions are :

- Can I possibly access and process data as fast as the memcpy ?

- Is there better ways to process large video data using ARM in bare metal ?

 

Cheers,

 

Jeremy

 

PS : forgive any language mistakes as english isn't my main language

0 Kudos
6 Replies
Visitor jroullier
Visitor
8,125 Views
Registered: ‎03-30-2016

Re: Video Processing with ARM on Zynq (Bare Metal)

I just figured out I posted in the wrong section of the forum...
Hope someone can fix it
0 Kudos
Xilinx Employee
Xilinx Employee
7,637 Views
Registered: ‎08-02-2011

Re: Video Processing with ARM on Zynq (Bare Metal)

Hi Jeremy,

 

You're learning first hand the value of the FPGA! :)

 

The fact is, touching every pixel sequentially with the processor (to/from DDR) is expensive. As resolutions/frame rates go up, that becomes harder and harder for a processor to handle.

 

Yeah, you can probably do some things to optimize and solve the tearing, but you're frame rate is still going to be relatively low at HD resolutions.

 

The general philosophy is to partition the design such that the PL does the low-level pixel processing (i.e. filtering, color space conversion, edge detection, etc) in parallel on the FPGA in a streaming fashion which is relatively inexpensive. Then the higher-order algorithms (decision making, object tracking, etc) run on the ARM and process 'regions of interest' or metadata that has been extracted from the whole image.

 

What exactly are you trying to implement?

www.xilinx.com
0 Kudos
Visitor jroullier
Visitor
5,726 Views
Registered: ‎03-30-2016

Re: Video Processing with ARM on Zynq (Bare Metal)

Hi bwiec,

 

Thanks for replying and sorry for the delay.

 

My work (6 months internship) aimed to benchmark 3 methods of video processing algorithms development :

- hardware (RTL)

- software with the CPU

- so called "hybrid" with Vivado HLS

 

I managed to get like 5 FPS max on FULL HD with the CPU running the 3x3 window filter (with interrupt based triggering, manual cache flush/invalidate, opt3, ...) which is coherent with the theory. I can assume that a such filter needs approximately 150 CPU clock cycles per pixel.

 

In fact my work proved the benefits of harware accelerated computations :)

 

Regards,

 

Jeremy

0 Kudos
Xilinx Employee
Xilinx Employee
5,263 Views
Registered: ‎08-02-2011

Re: Video Processing with ARM on Zynq (Bare Metal)

Hi Jeremy,

 

Sorry I missed this... for some reason the email notifications from the forum haven't been working for me for a couple weeks.

 

Anyway, I'm glad you were able to make progress and draw a successful conclusion!

 

Have a great day.

www.xilinx.com
0 Kudos
Visitor bruce_li
Visitor
1,820 Views
Registered: ‎12-27-2017

Re: Video Processing with ARM on Zynq (Bare Metal)

Hi, jroullier

I'm puzzled about how to setup the VDMA register to achieve the video processing flow you described. Could you give me some more specific information about this? Thank you very much.

0 Kudos
Visitor bruce_li
Visitor
1,815 Views
Registered: ‎12-27-2017

Re: Video Processing with ARM on Zynq (Bare Metal)

Hi, bwiec

I'm puzzled about how to setup the VDMA register to achieve the video processing flow this post described. Could you give me some more specific information about this? Thank you very much.

0 Kudos