UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
8,960 Views
Registered: ‎10-25-2014

FFT performance on microzed 7020

Jump to solution

Hi,

my RTL design is composed of 4 FFT engines (radix4, 2048 points max, 16 bits per component, configurable scaling), each connected to PS through an AXIDMA (no scatter-gather, 256 bursts in both directions).
System clock is 100MHz.
I am using a microzed 7020.

The SW starting the FFT looks like this

 

int runFFT(unsigned int mask /*#FFTs running in parallel*/)
{
	int i;

	for (i = 0; i < elements; i++)
	{
		XAxiDma_IntrDisable(&dataStreamer[i], XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
		XAxiDma_IntrDisable(&dataStreamer[i], XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DMA_TO_DEVICE);

		XAxiDma_Reset(&dataStreamer[i]);

		XAxiDma_IntrEnable(&dataStreamer[i], XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
		XAxiDma_IntrEnable(&dataStreamer[i], XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DMA_TO_DEVICE);
	}

	// start data transfer
	// Data destination
	for (i = 0; i < afi->elements; i++)
	{
		if ((mask & (1<<i)))
		{
			if (XAxiDma_SimpleTransfer(&axiFftElem->dataStreamer[i], ((u32) out_ptr_phy), size, XAXIDMA_DEVICE_TO_DMA))
				printf("Error in %s while initializing data transfer DEVICE_TO_DMA\n", __func__);
				return AXIFFTIF_ERR;
			}
		}
	}
	// Data source
	for (i = 0; i < afi->elements; i++)
	{
		if ((mask & (1<<i)))
		{
			if (XAxiDma_SimpleTransfer(&axiFftElem->dataStreamer[i], ((u32) in_ptr_phy), size, XAXIDMA_DMA_TO_DEVICE))
				printf("Error in %s while initializing data transfer XAXIDMA_DMA_TO_DEVICE\n", __func__);
				return AXIFFTIF_ERR;
			}
		}
	}
	
	// wait for transfer end
	for (i = 0; i < afi->elements; i++)
	{
		if ((mask & (1<<i)))
		{
			int pending = 0;
			int status;
			
			status = read(uioData[i], (void *) &pending, sizeof(int));

			if (status < 0) {
				printf("Error in %s; status = %s\n", __func__, strerror(errno));
				return AXIFFTIF_ERR;
			}
		}
	}
	// acknowledge irq
	for (i = 0; i < afi->elements; i++)
	{
		if ((mask & (1<<i)))
		{
			int reenable = 1;
			XAxiDma_IntrAckIrq(&[i], XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
			XAxiDma_IntrAckIrq(&[i], XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DMA_TO_DEVICE);
			write(uioData[i], (void *) &reenable, sizeof(int));
		}
	}

	// disable irq
	for (i = 0; i < afi->elements; i++)
	{
		if ((mask & (1<<i)))
		{
			XAxiDma_IntrDisable(&afi->element[i].dataStreamer, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DEVICE_TO_DMA);
			XAxiDma_IntrDisable(&afi->element[i].dataStreamer, XAXIDMA_IRQ_ALL_MASK, XAXIDMA_DMA_TO_DEVICE);
		}
	}
		
	return 0;
}



I call this function 10^6 times and measure the overall time; I repeat this activating 1 to 4 FFT engines and get these results:
- 1 FFT => 52us per FFT
- 2 FFTs in parallel => 36us per FFT
- 3 FFTs in parallel => 32us per FFT
- 4 FFTs in parallel => 32us per FFT

This means that using more that 2 accelerators in parallel doesn't bring any benefit.

Could you point out any weak point the the driver or do you think this is close to the maximum I can get in HW (considering FFT throughput and xbar congestion)?


Thanks,
Max

Tags (3)
0 Kudos
1 Solution

Accepted Solutions
Teacher muzaffer
Teacher
16,832 Views
Registered: ‎03-31-2012

Re: FFT performance on microzed 7020

Jump to solution
how are the axidma blocks connected to zynq ddr? all going through a single interconnect to a single hp port maybe? try connecting them to different hp ports and see if that helps.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
14 Replies
Xilinx Employee
Xilinx Employee
8,942 Views
Registered: ‎08-02-2011

Re: FFT performance on microzed 7020

Jump to solution

Hi Max,

 

There are numerous assumptions here, but just looking at theoretical numbers, transferring 2K samples (for 2K point FFT) across a 100MHz interface will take you 2048/100e6 = 20.48us. If you also account for SW setup time of DMAs, DDR latency, FFT latency, IRQ service time, etc, I think 32us latency is not unreasonable.

 

This means that using more that 2 accelerators in parallel doesn't bring any benefit.

Make sure not to confuse latency and throughput. Putting 2 accelerators in parallel will have no effect on latency for a given FFT frame. However, since you can now process 2 frames simultaneously, your throughput doubles.

 

Perhaps if you give some more details about what exactly you're trying to accomplish, we can advise on architecture decisions.

www.xilinx.com
0 Kudos
8,936 Views
Registered: ‎10-25-2014

Re: FFT performance on microzed 7020

Jump to solution

Thank you for your reply.

I have a number (4) of FFT engines in PL.
My test wanted to prove that using these guys in parallel would cost me the same time as using only one, but give me 4x throughput.

The test SW consist of a loop which:
- resets DMA
- enables the irq
- sets RX pointer, length and size
- sets TX pointer, length and size => this starts the streaming
- waits for irq
- anckowledges the irq
- diables the irq
Each of this step is done for all FFTs, i.e. "sets RX pointer, length and size" means that I write the the configuration of all DMA's S2MM side.

I expeted results like
- only 1 FFT active => x
- 2 FFTs => x + configuration overhead + interrupt handling overhead
- 3 FFTs => x + (configuration overhead + interrupt handling overhead)x2
- 3 FFTs => x + (configuration overhead + interrupt handling overhead)x3

What I got is
- 1 FFT => 52us
- 2 FFTs => 72us
- 3 FFTs => 96us
- 4 FFTs => 130us

So it seems like I have a benefit while upgrading from 1 to 2 engines, but adding more doesn't help, as the average duration per FFT is the same as for 2 FFTs (32-35us).

0 Kudos
Xilinx Employee
Xilinx Employee
8,932 Views
Registered: ‎08-02-2011

Re: FFT performance on microzed 7020

Jump to solution

Ohhh I see. Thanks for the clarification. Yeah, I'd also expect you to be able to do better than that, in that case.

 

How exactly are you measuring the time taken? Processor cycles when you're in run_fft? Or do you have some timer in the PL measuring between data start to IRQ?

 

Looks like you're in Linux? Another interesting test would be to run this in bare metal. That would help delineate HW from SW.

www.xilinx.com
0 Kudos
8,927 Views
Registered: ‎10-25-2014

Re: FFT performance on microzed 7020

Jump to solution

I start the FFT (or the group of FFTs) 10^6 times, I get the system time before and after the loop and I calculate the average per FFT.

 

Could it be that congestion over xbar slows down the execution?

0 Kudos
Xilinx Employee
Xilinx Employee
8,921 Views
Registered: ‎08-02-2011

Re: FFT performance on microzed 7020

Jump to solution
Hmmm interesting. It's possible that is the case, but it's difficult to say without more details. It could be a number of things. Memory bandwidth might also be an issue.

Are you able to drop in some ILAs to see where the holdup is happening?
www.xilinx.com
0 Kudos
Teacher muzaffer
Teacher
16,833 Views
Registered: ‎03-31-2012

Re: FFT performance on microzed 7020

Jump to solution
how are the axidma blocks connected to zynq ddr? all going through a single interconnect to a single hp port maybe? try connecting them to different hp ports and see if that helps.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
8,898 Views
Registered: ‎10-25-2014

Re: FFT performance on microzed 7020

Jump to solution

These are the numbers I collected trying to profile the software (no ILA available)

1 FFT

Avg pre = 7    Avg calc = 25    Avg post = 5

2 FFTs

Avg pre = 11    Avg calc = 36    Avg post = 8

3 FFTs

Avg pre = 17    Avg calc = 54    Avg post = 11

4 FFTs

Avg pre = 22    Avg calc = 67    Avg post = 14

 

"Avg pre" and "avg post" are the preparation and cleanup: they increase almost linearly and this is fine.

"Avg calc" was expected to be almost constant.

 

---

 

I modified the RTL:

- pipeline architecture instead of radix4

- PL clock @200MHz instead of 100 (I have timing failures but I tried it anyway)

Result: no change.

Is this reasonable?

 

---

 

All DMAs are connected to the ACP port

0 Kudos
Teacher muzaffer
Teacher
8,840 Views
Registered: ‎03-31-2012

Re: FFT performance on microzed 7020

Jump to solution
It seems you are assuming memory bandwidth is infinite. Include the bandwidth available through ACP and see if you can make sense of your numbers.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
8,770 Views
Registered: ‎10-25-2014

Re: FFT performance on microzed 7020

Jump to solution

I connected each AXIDMA to a different HP port and now results are reasonable.

 

One last thing I cannot understand is: I tried increasing the PL clock (now is 100MHz) but I always get timing failures within the AXI interconnect module.

Is there a way to reach higher speed here or not? 

0 Kudos
Teacher muzaffer
Teacher
7,305 Views
Registered: ‎03-31-2012

Re: FFT performance on microzed 7020

Jump to solution

try adding register slices to the ports which are failing timing in the interconnect.

- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
7,298 Views
Registered: ‎10-25-2014

Re: FFT performance on microzed 7020

Jump to solution

Is it possible to activate the extra register through the IP configuration options (how?) or do I have to do it manually?

0 Kudos
Teacher muzaffer
Teacher
7,287 Views
Registered: ‎03-31-2012

Re: FFT performance on microzed 7020

Jump to solution
Yes. Adding register slices is in one of the interconnect configuration pages, maybe advanced, not exactly sure.
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
7,281 Views
Registered: ‎10-25-2014

Re: FFT performance on microzed 7020

Jump to solution

By setting the optimization strategy to maximize performance I've been able to reach 166MHz.

Adding register slices didn't bring anything.

I attach the timing report.

0 Kudos
7,260 Views
Registered: ‎10-25-2014

Re: FFT performance on microzed 7020

Jump to solution
I connected each FFT engine to the memory through a different port. At ~150MHz results are
 
Using 1 accelerator(s)
Avg time per loop (using gettimeofday): 55 us
Using 2 accelerator(s)
Avg time per loop (using gettimeofday): 62 us
Using 3 accelerator(s)
Avg time per loop (using gettimeofday): 69 us
Using 4 accelerator(s)
Avg time per loop (using gettimeofday): 77 us
 
which seem consistent to me.
0 Kudos