cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Contributor
Contributor
1,094 Views
Registered: ‎01-27-2015

Zynq System Axi Performance improvement

Hi to all!

I'm working on a Backgroung Subtractor module on a ZedBoard with Zynq 7Z020.
The system is using Petalinux and Opencv to provide the image and I have accelerated a part of the OpenCV MOG Background subtractor to run on the fpga.
But I'm not archiving great performance and I don't understand why. A description of the system is following.
I need to send to the FPGA an input image, an output image and 4 vectors (mean, weight, variance and modesused).
I chose to use AXI Streaming with VDMA IP for the in/out images, AXI Master for the in/out vectors and AXI Lite to pass some minor variables and control signals.
The structure is following:

 

fpga_structure.png

I use the HP0 port for the in/out images and HP1 for the in/out vectors.
Due to FPGA memory limitation I cannot process the entire frame, so I divide it into 30 slices.
My C++ code interchange data as described in the following pseudocode:

Video processing loop
	extract data and frame form input video with opencv
	memcpy entire input image to RAM 
	memcpy entire input vector1 to RAM
	memcpy entire input vector2 to RAM
	memcpy entire input vector3 to RAM
	memcpy entire input vector4 to RAM
	
	FPGA Loop 0 to 29  (send the commands to perfome the operations into the FPGA)
		write input slice vector1 to buffer
		write input slice vector2 to buffer
		write input slice vector3 to buffer
		write input slice vector4 to buffer 
		write input slice image to FIFO

		Reade the inputs, process the algorithm and prepare the outputs

		write output slice image to FIFO
		write output slice vector1 to RAM
		write output slice vector2 to RAM
		write output slice vector3 to RAM
		write output slice vector4 to RAM
	End FPGA Loop
	provide processed data and frame to output video with opencv
End Video processing loop

The Vivado HLS IP latency form Performance Estimation is under 1ms. I did some calculations:

One slice = 1ms; 30 slices = 30ms

Estimating the time needed to reprocess I will use 60ms.

Processing estimation = 1 / 60ms = 16.6 fps

I'm around 2fps and I don't understand why.

What is the problem? Why is so slow? How is possible to improve it?
Any suggestion is welcome! Many thanks in advance!

 

 

0 Kudos
6 Replies
Xilinx Employee
Xilinx Employee
1,021 Views
Registered: ‎10-04-2016

Hi @gicgatv ,

From what you have described, it's hard to say why the performance isn't meeting your expectations. I guess I'd start with the basics:

1. Can you translate your desired performance into MB per second of read and write traffic? I want to see if this is something Zynq 7000 can achieve.

2. Double check some of the basics of your AXI system. What are the bus widths and clock frequencies of S_AXI_HP0/1 on the Zynq 7000? What is your DDR speed and data width? (The DDR format is obviously fixed by the ZedBoard.)

From there we can talk about how to look at AXI Traffic and see if your masters are blocking each other in getting to DRAM.

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Contributor
Contributor
990 Views
Registered: ‎01-27-2015

Dear Deanna
Thank you very much for your answer!!
I answer you for points like you Did.

1) I have estimated my needs in MB and is following:

Input = (image_in + vectors_in) * 30(number of slices) * 16 (fps)= (23.040KB + 238.080 KB) * 30 * 16 = 125.338 MB/s
Output = (image_out + vectors_out) * 30(number of slices) * 16 (fps)= (7.680KB + 238.080 KB) *30*16 = 117.965 MB/s

2) The S_AXI_HP0/1 bus widths is 32 bits @ 100MHz, while the zed board DDR3 memory interface speeds up to 533MHz (1066Mbs).

Thanks to your suggestion I have prepared a version with buses widths of 64 bits.

Reading ug585 on page 661 paragraph "22.3.4 Interconnect Throughput Bottlenecks" it should be better to use port 0 and port 2
So in I have changed with 64 bits buses and HP0/2 ports, but at the moment I have no great speed improvement.

I was wondering what about using all 4 ports?Reading ug585 on page 662, I should have 4 ports @ 1,200 MB/s.

Should I use an AXI interconnect block for each port? Or using two is still a good option?
Should I change the general clock for example up to 250MHz (the maximum option allowed by the Zynq processing system)?

Many thanks in advance, Best regards

Giuseppe

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
974 Views
Registered: ‎10-04-2016

Hi @gicgatv,

With the numbers you have provided, I don't think the AXI Interfaces are the bottleneck. You should have enough bandwidth. 

That said, I'm not sure if you have a hardware or a software issue. I'll suggest two approaches you could take to narrow down the issue.

Do you have a way to characterize how much time your software is spending setting up and servicing the memcpys and VDMA transfers? From your psuedo-code, it seems like you have a lot of processor overhead setting up the vectors and setting up the VDMAs to transfer images compared to the time it takes for the Background Subtractor to do its work.

Another option would be to add System ILAs to the AXI interfaces on the Background Subtractor. Trigger on whatever kicks off the processing of a frame. (Maybe this is a write to a register in the Background Subtractor?) What you are looking for in the trace is unexpected idles--this would suggest a hardware issue. For example, does the Background Subtractor get held off when it tries to read its input vectors? Or is the image data not available when you expect it to be? 

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Contributor
Contributor
938 Views
Registered: ‎01-27-2015

Dear Deanna
Thank a billion for your answer!

I have measured c++ software times and you where right. It depends more on the CPU part.

A big problem is the memcpy between cpu and RAM because it spent around 80-100 ms to copy image and vectors. This means that for each frame I spent around 160-200 ms while the FPGA read from ram, process and write to Ram in 8,9 ms each slice of data, for each frame FPGA needs around 267 ms (8,9*30). In Vivado HLS the performance estimation of the algorithm is now 850us while the entire process meaured from software is 8.9ms.

So the problem is data copy. Is there a way to copy faster? A different memcpy instruction?

I made an experiment copying only a big vector with the same amount of memory of the 4 vectors. It reduce around 50% the copying times.
So I understood that a bigger memcpy is faster than 4 smaller memcpy. For that reason I'm thinking to make a structure change.
What do you suggest?
I need a structure able to contain:
- a 16 bit fixed vector of 69120 components
- a 16 bit fixed vector of 23040 components
- a 16 bit fixed vector of 23040 components
- a 16 bit integer vector of 7680 components

Do you suggest some different approach?

Many thanks in advance.

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
891 Views
Registered: ‎10-04-2016

Hi @gicgatv ,

I'm glad to hear you got to the root of the problem. 

There are ways to do hardware assisted memory copies. The AXI CDMA IP is a good IP for these requirements.

https://www.xilinx.com/support/documentation/ip_documentation/axi_cdma/v4_1/pg034-axi-cdma.pdf

Another option would be to ask this question on the HLS Forums to see if there is a better way to structure your HLS code to avoid the memcopy requirement. 

Regards,

Deanna

-------------------------------------------------------------------------
Don’t forget to reply, kudo, and accept as solution.
-------------------------------------------------------------------------
0 Kudos
Highlighted
Contributor
Contributor
883 Views
Registered: ‎01-27-2015

Hi Deanna,

Thank you for your answer!

I made some tests only in C++ and the easy and the best solution to reduce the memcpy to only one, is to use a vector of struct like this:

struct t_data {
float_9bits mean [9];
float_9bits variance [3];
float_4bits weight [3];
ap_int<16> modesUsed;
};

I changed all the code to use the struct and in Vivado HLS simulation is working with good results.

But when I try to synthesize the design I run in this error:

Abnormal program termination (11)

and I cannot follow. :o(

Any Idea?

I'm trying to ask the Vivado HLS forum for this error

https://forums.xilinx.com/t5/Vivado-High-Level-Synthesis-HLS/Abnormal-program-termination-11-Vivado-HLS-2017-4/m-p/993210

Many thanks in advance!

 

 

 

 

 

 

0 Kudos