cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
sallwine
Adventurer
Adventurer
1,377 Views
Registered: ‎09-22-2009

Xil_In32 taking excessively long time to read two registers

Jump to solution

Hello - I'm using a KC705 with Vivado 2018.3.  I've built a Micro blaze subystem with DDR, GPIO, UART, Timers, Interrupt Ctl and all works fine.  The Microblaze is being clocked at 200MHz from the DDR3 ui_clk (1/4 the DDR3 clk).  Outside of the Micro Blaze, in VDHL,  I have a 32 bit counter free running at 8ns that is always writing to a 32 bit Input GPIO register. 

I have a function defined in 'c' as:

double TickNow()
{ return (double)Xil_In32(EUB_COUNTER_REGISTER);  //free running counter registers
}

This returns the value of the counter register.  In main.c, I read the regisiter twice and print the delta:

//----- first test, measure timer delay
double tick1 = TickNow();
double tick2 = TickNow();
double delta_t = DeltaTimeNs(tick1, tick2);
xil_printf("time delay t1 %u t2 %u d %u = %u ns \r\n",
(u32)tick1, (u32)tick2,((u32)tick2-(u32)tick1), (u32)delta_t );

When this code is run, it's taking >40 usec to excute two registers read, subtract, than print.  This seems very long.  

From putty console:

time delay t1 2568499072 t2 2568505516 d 6444 = 51552 ns

time delay t1 3969736537 t2 3969743121 d 6584 = 52672 ns

time delay t1 476546419 t2 476552701 d 6282 = 50256 ns

Any ideals why this simple code excution could be taking so long?

Thanks - Scott

 

0 Kudos
1 Solution

Accepted Solutions
sallwine
Adventurer
Adventurer
1,185 Views
Registered: ‎09-22-2009

Finally able to get the Xil_In32 below 1usec, ~550 ns for read times.  Putting the command in-line helped a little bit but what made the most difference was running the SDK in release mode instead of debug mode.  There must be something that SDK does in debug mode that makes things slow down alot.  Now I'm going to try and clock the uB at 256MHz instead of 200Mhz to improve on the 550ns.  

View solution in original post

0 Kudos
12 Replies
sallwine
Adventurer
Adventurer
1,364 Views
Registered: ‎09-22-2009
I changed all the double to u32, this made a 40 usec imporvement but still seems very slow for a processor being clocked at 200MHz.
• time delay t1 42226722 t2 42228117 d 1395 = 11160 ns
• time delay t1 46721330 t2 46722724 d 1394 = 11152 ns
0 Kudos
dgisselq
Scholar
Scholar
1,354 Views
Registered: ‎05-21-2015

I wouldn't expect super high speed.  Other experiments suggest 1-3 MHz is about as good as you should be able to get.  Still, this is much slower than that.

A couple questions.  First, how are you crossing the clock barrier from an 8ns update to the bus clock?  There's a lot of ways to do this wrong, and many of the right ways don't update all that often.j

Second, is there any way you can get a trace from within the AXI subsystem showing both AXI requests to read from your device?  I think you might learn a lot from such a trace.

Third, what sort of AXI peripheral are you using?  Remember, there's a couple clocks lost every time a bus format change takes place.  Examples of such format changes would include AXI to AXI-lite, or even AXI of one data width to an AXI port of another.

Dan

0 Kudos
sallwine
Adventurer
Adventurer
1,349 Views
Registered: ‎09-22-2009
1.) I'm just reading the value of the register, I'm not clock crossing since the processor is running the application, I know I'd have a variance for 20-24 nsec but I'm way off that mark.
2.)When you say trace, are you asking to use the MDM to trace the AXI request or to drop in an ILA to capture the AXI request. I'll do that this afternoon but in the ILA that is.
3.)The AXI peripheral is a AXI GPIO, configured as input only and it's connected to a AXI Interconnect. Everything seems to be just AXI, not AXI Lite. I do have two DMA engines running 4 AXI stream data FIFO, but when this counter test is run those aren't in the loop yet.
0 Kudos
dgisselq
Scholar
Scholar
1,303 Views
Registered: ‎05-21-2015

@sallwine,

Yes, my suggestion above was to use Xilinx's ILA or something similar.

I just checked on Vivado 2018.3, and Xilinx's AXI GPIO was an AXI-lite core.  A quick look through the code reveals that it can only handle one read or one write every 5 clocks.  (It cannot handle both reads and writes at the same time).  So let's assume a single clock (not 8ns, but your bus clock rate--which you said was 20ns, right?).  Then we can judge:

  • The CPU cannot issue more than one peripheral load/store at a time, since 1) peripheral memory cannot be cached, and 2) peripheral I/Os cannot be rolled back
  • Judging from my own work building an AXI interconnect, it will take two clocks to send a transaction and another to receive one.
  • Going from AXI to AXI-lite will likely cost another two clocks out and one clock in return, based on my own AXI->WB bridge work
  • If you had to cross clock domains, that would cost a minimum of two clocks out and two more clocks in return--assuming an asynchronous FIFO implementation.  A simpler and lighter implementation might cost two more clocks.  I'll skip this count below, though, since you said you were doing everything on the same clock--even though I'm confused as to what clock rate you are using)
  • AXIs GPIO requires 5-clocks per transaction

Putting all this together, a pair of subsequent reads should take a minimum of 11 clocks.  If you are running your AXI infrastructure at 50MHz (i.e. the 20ns you cited above), the fastest you could read two values would be 4.5MHz.

The numbers you have cited above a still much slower than this.

Let me ask, are you running the background DMA on this core as well?  The AXI-GPIO core cannot handle both reads and writes at the same time--one of the two will stall.

Another possibility is that you are trying to bridge a counter across clock domains, and some of the bits are arriving earlier than others.  Were that the case, though, I'd expect the delay to be highly unpredictable.  The numbers you have cited below appear to be somewhat predictable.  Perhaps there's something more that you aren't sharing?

Dan

0 Kudos
sallwine
Adventurer
Adventurer
1,295 Views
Registered: ‎09-22-2009
I'm PARing a system ILA in the BD now. I have a AXI slot and a native port to look at the count values at the same time. So to clarify, my uB BD is connected to the DDR3 MIG clk which is 200MHz not 50MHz. On the AXI interconnect, I have 17 master ports including the controls for the two DMAs. Could that be part of my problem? I'm also fixing the 8ns counter being read at 5ns. The counter will be the same clk as the uB BD, 5ns.
0 Kudos
sallwine
Adventurer
Adventurer
1,250 Views
Registered: ‎09-22-2009
I put the counter on the 200MHz domain. It still is taking 11usec to read to registers than print out the results. My first attempt to put a system ILA in the BD failed timing so I'm working on that now to see the AXI transactions. But is still an order of magnitude slower than I'd think it should be.
0 Kudos
dgisselq
Scholar
Scholar
1,241 Views
Registered: ‎05-21-2015

@sallwine,

I just noticed something I was missing before: you are casting the counter's value to a double.  That might be the problem.  If you simply read the value in as an integer, do you still have the problem?

Dan

0 Kudos
sallwine
Adventurer
Adventurer
1,236 Views
Registered: ‎09-22-2009

I have changed them to u32 from double.  when I had them as doubles, it was 50 usec, changing them to u32 got me down to 11usec, I'll try just casting them as int.  Thanks.

0 Kudos
dgisselq
Scholar
Scholar
1,234 Views
Registered: ‎05-21-2015

There shouldn't be any difference between uint32_t's and int's.  It was a thought, though.

Dan

0 Kudos
dgisselq
Scholar
Scholar
1,232 Views
Registered: ‎05-21-2015

Another thing to try would be to remove the subroutine calls, and just call the read command in-line.  If the compiler doesn't optimize the subroutine calls, the various jumps will cost several cycles each.  If this fixes things, then try turning on all compiler optimizations.

Dan

0 Kudos
sallwine
Adventurer
Adventurer
1,186 Views
Registered: ‎09-22-2009

Finally able to get the Xil_In32 below 1usec, ~550 ns for read times.  Putting the command in-line helped a little bit but what made the most difference was running the SDK in release mode instead of debug mode.  There must be something that SDK does in debug mode that makes things slow down alot.  Now I'm going to try and clock the uB at 256MHz instead of 200Mhz to improve on the 550ns.  

View solution in original post

0 Kudos
dgisselq
Scholar
Scholar
1,177 Views
Registered: ‎05-21-2015

@sallwine,

I'm so glad you figured this out!  I had been wondering what difference it might be.

I know from my own ZipCPU efforts, that there's a very significant difference between the code generated when using "gcc -g" versus "gcc -O3".  "gcc -g" isn't likely to inline, and every value accessed will go to memory instead of the processors local register set.  Hence, for an instruction A = B+C, the CPU would read B from memory into a register, then read C from memory, then store the result into memory.  Even if the next instruction uses A, B, or C, in "-g" mode gcc would load the values again from memory.  Highly inefficient, but ... that's gcc with "-g" and no optimizations.  I don't know if Vivado uses gcc or not, although I do know that GCC support for microblaze was upstream when I last built GCC support for the ZipCPU.

Again, I'm glad to hear you figured this out!

Dan

0 Kudos