cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
Contributor
Contributor
877 Views
Registered: ‎09-05-2018

AXI4 Lite Memory Mapped Performance: poor

Jump to solution

I am trying to improve the performance of a memory-mapped slave on my AXI4 bus.  On a Zynq 7014S, this peripheral is connected to an M_AXI_GP port on the PS.  

It takes around 24 clock cycles to do a single register write of 32-bits.  At my clock of 200MHz this means it is almost 120ns or, going by ARM clocks cycles it is around 80 cycles for the operation to complete.  The actual write completes quickly: within 4 cycles, but the processor is blocked from doing another operation for the remaining 20 cycles.

Now I am aware that a single operation will have higher overheads than a burst operation (and burst performance seems OK) but this seems like a huge amount of latency.

Approximately 4 cycles seem to be expended in the AXI peripheral completing the write operation - this seems OK to me.

Approximately 9 cycles in total seem to be expended in the protocol converter or interconnect (I have tried both); the protocol converter is faster by 1 clock cycle, it appears.  This is in addition to the 4 cycles that the peripheral takes so the interconnect eats 5-6 cycles.  It would be nice if this could be reduced.

Approximately 13 cycles seems to be taken up in the ARM domain doing unknown things, possibly waiting for the operation to complete or finding a slot for the next operation.  This is the biggest mystery to me.  Can this be improved?  

Below shows the ILA I have used to debug the performance of this system.

 

Tags (1)
ILA AXI Performance.png
0 Kudos
1 Solution

Accepted Solutions
Highlighted
Explorer
Explorer
548 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hello everybody,

I set memory attribute(Xil_SetTlbAttributes) as @johnmcd mentioned in his post.

It make axi-lite slave 4X faster than before. I strongly recommend to use it.

I added only 3 lines of code in Bare Metal CPU and 20 clk cycle decreased to 5 clk cycle.

MyXil_SetTlbAttributes(0x43C00000, 0xC06);
mtcp(XREG_CP15_INVAL_UTLB_UNLOCKED, 0);
dsb();

Related Xilinx Article

Saban

 

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->

View solution in original post

AXI-Lite_Slave_Write_Timing_5_Clock_Cycle_After_Changing_Memory_As_Shareable.png
AXI-Lite_Slave_Write_Timing_20_Clock_Cycle_Without_Changing_Memory_As_Shareable.png
14 Replies
Highlighted
Adventurer
Adventurer
862 Views
Registered: ‎03-15-2012

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

I think you can't speed it up. May due to clock domain crossings the latency is quite large. I think you have to use bursts and write/read your data in larger chunks. Using a BRAM (Controller) directly at the AXI3 is quite easy.

PS: for writing it may help playing with AWCACHE. I think, setting it to "bufferable" will allow the interconnect to acknowlege the write request (bresp) instead of the sink itself.

0 Kudos
Highlighted
Scholar
Scholar
837 Views
Registered: ‎05-21-2015

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

@tom667,

It might depend upon how you have the interconnect set up.  Some interconnect settings are deliberately slow for the purpose of using less area.  As @dm78 pointed out, clock domain crossings can be painful requiring up to two clocks of each clock in latency.

There are some other alternatives.  For example, here's an open source AXI to AXI-Lite bridge that has a lower latency.  Each beat for example has a latency of (roughly) 3 beats, including one beat for the slave to respond--that beats the 9 beats from Xilinx.  Even better, the core can maintain 100% throughput.

The following charts show the write burst performance,

axi2axil-write-burst.png

read burst performance,

axi2axil-read-burst.png

and read performance with singletons.

axi2axil-read-single.png

So ... it is possible to do better.

Dan

0 Kudos
Highlighted
Contributor
Contributor
754 Views
Registered: ‎09-05-2018

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

I have set my interconnect to be Performance optimised, but it does not seem to make much difference.  I have also used a direct Protocol Converter, in an attempt to see whether using this over the Interconnect would help; it does not seem to have helped.

The primary issue seems to be that the ARM/PS block stalls for a large number of cycles when doing any AXI operation.  The interconnects may not be that fast, but most time is wasted by the ARM.  Is there anything that can be done to improve performance there?  Would there be any benefits to moving some structures to OCM to free up Central Interconnect bus bandwidth?  And while I don't want to expend a 64-bit High Speed port on this (I already use two for other ~500MB/s parts of my design), is that likely that will be faster as it has a direct path through the Memory Controller?

 

0 Kudos
Highlighted
Xilinx Employee
Xilinx Employee
719 Views
Registered: ‎02-01-2008

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

The issue is probably due to how the space is mapped. I didn't notice if you are using zynq7 or mpsoc.

Normally, the MMU marks all PL address ranges as 'Strongly Ordered'. This means that the ARM will stall until the access response completes.

If you are using baremetal, you can use Xil_SetTlbAttributes() to change the MMU attributes. Or, you can modify the standalone BSP file translation_table.S. This file will also show you various attribute values.

 

Highlighted
Explorer
Explorer
564 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hi @dgisselq ,

I'm living same problem about AXI-Lite and AXI Interconnect.

I found a solution with AXI FULL, but I'm not sure, whether it will solve my problem or not. Because at the end I need to use AXI Interconnect again.

If it is bottleneck, problem will not solve.

In your post, you mentioned Burst Mode ,but AXI-Lite not support burst mode. 

Did you mean replacing it with AXI FULL or using your link comes with high througput as you mentioned.

 

Regards

Saban

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->
0 Kudos
Highlighted
Explorer
Explorer
549 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hello everybody,

I set memory attribute(Xil_SetTlbAttributes) as @johnmcd mentioned in his post.

It make axi-lite slave 4X faster than before. I strongly recommend to use it.

I added only 3 lines of code in Bare Metal CPU and 20 clk cycle decreased to 5 clk cycle.

MyXil_SetTlbAttributes(0x43C00000, 0xC06);
mtcp(XREG_CP15_INVAL_UTLB_UNLOCKED, 0);
dsb();

Related Xilinx Article

Saban

 

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->

View solution in original post

AXI-Lite_Slave_Write_Timing_5_Clock_Cycle_After_Changing_Memory_As_Shareable.png
AXI-Lite_Slave_Write_Timing_20_Clock_Cycle_Without_Changing_Memory_As_Shareable.png
Highlighted
Scholar
Scholar
542 Views
Registered: ‎05-21-2015

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

@sabankocal,

To achieve higher throughput through the same interconnect, you could either use AXI (full) with burst mode, or AXI-Lite following a bridge that maintains burst performance.

The AXI to AXI-Lite bridge I linked earlier can handle burst mode on the AXI side with no throughput loss to the AXI-Lite slave side--if the slave is up to it. 

Is that your question?

Dan

Highlighted
Explorer
Explorer
537 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hi @dgisselq ,

Yes. It was exactly what I asked.

Thanks a lot.

Saban

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->
0 Kudos
Highlighted
Explorer
Explorer
467 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hi @dgisselq ,

Actually this thread is related with slave port. I got acceptable result by setting "MyXil_SetTlbAttributes" for Axi-Lite Slave port.

Now I have a similar performance issue about Axi-Lite Master port. It took before 20 clock cycle, then I changed to HP0 configuration from 64bit to 32 bit and now it takes 14 clock cycle per register write. It is not acceptable for my timing.

As you mentioned in your post, I'm trying to use AXI-Lite to AXI Bridge as in Vivado Block Design. I only inserted it to my block design as Module between my Custom IP and ZYNQ7->HP0 port. Vivado adds Smart Connect Ip core as you see in attachment.

I tried to convert it to package, but Vivado can not identify port names automatically. Before losing a lot of time, I want to ask you:

How can I add axilite2axi.v as a bridge in my block design.

Thanks in advance.

Saban

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->
How_Can_I_Use_axilite_2_axi_In_Block_Design.png
0 Kudos
Highlighted
Scholar
Scholar
445 Views
Registered: ‎05-21-2015

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

@sabankocal,

Beware that a lot of Zynq processing systems are actually AXI3 systems.  To avoid the interconnect, you'll need to match clocks, resets, ID widths, data widths, and address widths (Zynq is ... 40 bits?) while also creating AXI3.  In other words, you'll have to expand the AxLOCK signals to two bits, drop the top four bits of the AxLEN signal (it's zero anyways), and add an M_AXI_WID signal.  The WID signal should match AWID--it's constant anyway.

That should help you get closer.

Dan

0 Kudos
Highlighted
Explorer
Explorer
354 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hi @dgisselq ,

Thanks a lot for your detailed guidance.

I used before Axi-Lite Master and Slave interfaces by using Xilinx's reference designs. Until this week it was fast enough for me, but nowadays I need more performance. Axi-Lite Slave is fast enough for me, but Master not.

I never deep into signaling of AXI interface. We are connectiong ZYNQ7->HP0(64bit/32bit) AXI3 interface. I compared AXI3 with ours as you mentioned. Your points are completely true, but there are some others. I do not have experience about others.

I though before, I can use your prefered design as plug and play. I have complete respect  to its developer's effort. Maybe it can be used with other parts efficiently.

That's why I want to do it with different way.

Thanks again.

Saban

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->
0 Kudos
Highlighted
Explorer
Explorer
294 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hi everybody,

I tried a lot of different options to solve AXI-Lite Master performance issue and I want to share with you some good news.

  1. If we compare AXI Full interfaces with AXI-Lite, it is very complicated than Lite version.
  2. AXI full needs more resources than AXI-Lite.
  3. If you have enough resources and throughput is more important for you, then we have same situation with you. I can share AXI Full Master interface benchmark results with you.

I tried some other options from internet to utilize AXI Full Master interface, after all I choosed to use Xilinx's example code again.

If you already uses Xilinx's Axi-Lite Master interface example code, it means, you can easily migrate your design to AXI Full Master Interface example design.

My test results with ramp data:

  1. If I set Burst Len as 16 and DATA_WIDTH as 32bit(4 bytes), 16 transactions takes only 34 clock cycles.(see attachment)
  2. If I set Burst Len as 256(max. value) and DATA_WIDTH as 32bit(4 bytes), 256 transactions takes only 274 clock cycles.(see attachment)

At the beginning I thought like that; AXI Full is very complex and I need to handle a lot of control signals to implement it, but by using Xilinx's reference design, it is unbelievable easy to use. I can say, it is a "plug and play" design.

To start a transaction, you need to just pull init_axi_txn to high and this reference design handles all other control signals for you.  

 

If you are not interested about implementation details of this design, after that line will be bored for you. Please don't continue to read this post.

How can you utilize Xilinx's Axi Full Master intreface reference design?

You need to make some initial configurations according to your needs:

  • Burst Length: How many transactions you want to execute consecutively.
  • Target Slave Base Address : Your target start address, you need to give only start address, after that reference design handles next addresses.
  • C_MASTER_LENGTH: Inside master code there are tricky variables named C_MASTER_LENGTH and C_NO_BURSTS_REQ. You can set your total number of burst transfers by using this parameter.

 

localparam integer C_MASTER_LENGTH	= 12;
// total number of burst transfers is master length divided by burst length and burst size
localparam integer C_NO_BURSTS_REQ = C_MASTER_LENGTH-clogb2((C_M_AXI_BURST_LEN*C_M_AXI_DATA_WIDTH/8)-1);

 

For examle: If you have a 32bit(4 bytes) DATA_WIDTH,  256 Burst Length and you set C_MASTER_LENGTH as 12(2^12 = 4096) and  it means, after one time triggering this design you want to send totally 4096bytes.

  • Every Burst contains 256 * 4 bytes(32 bit DATA_WIDTH) = 1024bytes.
  • This contains 4 different transactions(named C_NO_BURSTS_REQ). I added related screen shot.

 

Saban

<------------------------------------------------------------------------------>

if(solves_problem) mark_as_solution <= 1 else if(helpful) Kudo <= Kudo + 1

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->
AXI_Master_FULL_Dummy_Values_16_Transactions_Takes_34_Clock_Cycle.png
AXI_Master_FULL_Dummy_Values_Amazing_Performance_256_Transactions_Takes_274_Clock_Cycles.png
AXI_Master_FULL_Dummy_Values_Amazing_Performance_256_Transactions_Totaly_4096Bytes_ILA_View.png
0 Kudos
Highlighted
Scholar
Scholar
265 Views
Registered: ‎05-21-2015

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

@sabankocal,

Be aware, when using Xilinx's AXI master demo IP core that there are some restrictions it doesn't necessarily check for and avoid.  In particular, bursts are not allowed to cross 4kB boundaries.  The master demo never checks for this.  It doesn't need to--since the base address is properly aligned on a 4kB boundary.  If you adjust the base address to anything other than a 4kB boundary, it might become a problem you'll need to work around in order to remain protocol compliant.

Dan

0 Kudos
Highlighted
Explorer
Explorer
253 Views
Registered: ‎08-02-2019

Re: AXI4 Lite Memory Mapped Performance: poor

Jump to solution

Hi Dan,

You are right. It initially defines a 4kB total burst and never mentioned any alignment criteria to 4kB boundry.

 

localparam integer C_MASTER_LENGTH	= 12; //means 2^12 = 4096bytes = 4kB

 

Thanks a lot for your important hint.

Saban

<--- If reply is helpful, please feel free to give Kudos, and close if it answers your question --->
0 Kudos