cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Visitor
Visitor
371 Views
Registered: ‎02-28-2020

Improve IO Performance for Microblaze to IO register using AXI

Jump to solution

Hi all,

I try to improve the speed for IO operations. I have to implement a relatively complex protocol on SPI and I realized that controler register access is eating up a lot of my time. Beside a search for code overhead within the drivers I did also try to find out how long such a register access may take. I did a toggle experiment for an IO port and did just create consecutive write operations:

tklosa_0-1604333862681.jpeg

The compiler is really translating each in one asm instruction. Looking at the output I get 

tklosa_1-1604334227191.jpeg

120ns for the execution of one asm instruction. My system is running at 100MHz, the instruction therefore takes 12 clocks. I look for hints to reduce this. Below I add information for my system and things I tried already:

- Artix-7 100-2 on a TE0713 module made by Trenz

- Microblaze system is based on the example provided by Trenz for this module

- Code runs from external DDR3 memory

- Input clock for the DDR controler is 200MHz, Memory at 400MHz, CPU at 100MHz (This seems the usual approach)

- Cache is enabled

- Moving the code to internal block memory running at 100MHz has almost no speed effect (.text, .stack ...)

- Compile opt is -o3, but this not relevant here since we look at the asm already

- GPIO unit as the AXI GPIO (2.0) provided with Vivado

- In between CPU and GPIO there is a AXI Interconnect (2.1) with 1 slave interface for the Microblaze core and 22 master interfaces for peripheral devices

- There is no DMA or other access on the bus at the same time

 

Thank you  Thomas.

Tags (4)
0 Kudos
Reply
1 Solution

Accepted Solutions
Scholar
Scholar
262 Views
Registered: ‎05-21-2015

@tklosa ,

The AXI protocol requires that there can be no combinatorial paths between inputs and outputs. 

axi-spec-registered.png

That means that the minimum number of clocks to go through any bridge is two--one in, and one out.  The minimum number of clocks through any slave is one.  Interconnects often require 2-3 more clocks just to handle the logic required within them.

If you want performance, you will need to do one of ...

  1. Pipeline your interactions.  By this I mean issuing the next request before the result has been returned from the first one.  Some CPU's can do this, such as the ZipCPU.  The problem with doing this is that the CPU will continue several instructions beyond one that might return a bus error--simply because it can't stop on a bus error until the result finally gets returned.
  2. Use AXI (full) bursts to achieve multiple reads/writes at a time.  To my knowledge, neither MicroBlaze nor ARM do this.  Some CPU's do this when dealing with cache accesses, but GPIO devices cannot be cached--so that doesn't help here.
  3. Get rid of the CPU--for the reasons noted above.
  4. Make sure there are no AXI translations.  The AXI to AXI-lite translation taking place within the interconnect will cost at least another two clock periods.
  5. Increase your system clock rate.  This means increasing the rate of the CPU, the AXI interconnect, the GPIO controller and more.  Increasing one of these but not all of them will cause the system to slow down everytime it crosses clock domain boundaries.

I could go on, but realistically the bottom answer is not to use the CPU for any sort of real-time or time critical applications.

Dan

View solution in original post

0 Kudos
Reply
3 Replies
Scholar
Scholar
338 Views
Registered: ‎05-21-2015

@tklosa ,

That AXI GPIO module is known for getting terrible AXI4-lite performance. This AXI-lite design will outperform it by a couple clock cycles, but certainly not by 10+.

The interconnect itself is known for poor AXI4-lite performance.  Neither Microblaze or ARM issue writes to the bus all that fast, waiting instead for any prior writes to complete.

Perhaps something like this might help explain why your design takes so long to toggle GPIO pins, even though the explanation is for Wishbone + ZipCPU rather than the (proprietary and opaque, thus hard to analyze) Microblaze.

Dan

 

0 Kudos
Reply
Visitor
Visitor
269 Views
Registered: ‎02-28-2020

Hi Dan and who might read this.

Thanks for the help, I may consider the ZipCPU for a try and may be upcoming projects. It seems there is no misconfiguration or other thing I could simply change to avoid wait or arbitration clocks.

I did an ILA probe for the AXI after the Controller and the AXI in between bus mux and IO-Block.

tklosa_0-1604419927552.png

It seems the bus switch takes 4 clocks to hand the action over and the IO block requires 5 clocks to complete the operation.

0 Kudos
Reply
Scholar
Scholar
263 Views
Registered: ‎05-21-2015

@tklosa ,

The AXI protocol requires that there can be no combinatorial paths between inputs and outputs. 

axi-spec-registered.png

That means that the minimum number of clocks to go through any bridge is two--one in, and one out.  The minimum number of clocks through any slave is one.  Interconnects often require 2-3 more clocks just to handle the logic required within them.

If you want performance, you will need to do one of ...

  1. Pipeline your interactions.  By this I mean issuing the next request before the result has been returned from the first one.  Some CPU's can do this, such as the ZipCPU.  The problem with doing this is that the CPU will continue several instructions beyond one that might return a bus error--simply because it can't stop on a bus error until the result finally gets returned.
  2. Use AXI (full) bursts to achieve multiple reads/writes at a time.  To my knowledge, neither MicroBlaze nor ARM do this.  Some CPU's do this when dealing with cache accesses, but GPIO devices cannot be cached--so that doesn't help here.
  3. Get rid of the CPU--for the reasons noted above.
  4. Make sure there are no AXI translations.  The AXI to AXI-lite translation taking place within the interconnect will cost at least another two clock periods.
  5. Increase your system clock rate.  This means increasing the rate of the CPU, the AXI interconnect, the GPIO controller and more.  Increasing one of these but not all of them will cause the system to slow down everytime it crosses clock domain boundaries.

I could go on, but realistically the bottom answer is not to use the CPU for any sort of real-time or time critical applications.

Dan

View solution in original post

0 Kudos
Reply