UPGRADE YOUR BROWSER

We have detected your current browser version is not the latest one. Xilinx.com uses the latest web technologies to bring you the best online experience possible. Please upgrade to a Xilinx.com supported browser:Chrome, Firefox, Internet Explorer 11, Safari. Thank you!

cancel
Showing results for 
Search instead for 
Did you mean: 
7,413 Views
Registered: ‎10-09-2013

Problems branching to dynamically loaded routine using bare metal approach

Jump to solution

I've only been working with the Zynq/Arm Cortex A9 for a few months (but I have 20+ years experience writing embedded code on other processors/DSPs) so bear with me on my long-winded explanation.  My application is a bare-metal monitor program that tests various functions of the Zynq SoC using commands transmitted over UART. The monitor is a "port" of a prior monitor written for a different processor family.   The monitor allows probing memory, hardware registers, etc and also provides the capability to upload programs dynamically using text-based (S-record) files to specific locations in memory and then branching to that routine.  I've gotten all of the functionality of the monitor to work using a Zedboard except for the "branching" part and have hit a wall.  Please don't recommend building all of the routines into the monitor or using an OS because those aren't options available to me or I would have done that already.

 

Single-stepping through the code revealed that my dynamically loaded program is correct and in the proper location yet once the processor branches to that location the registers no longer update and eventually results in a "data abort exception".  I set up a SVC/exception handler approach thinking that this was caused by user mode operation and supervisor mode would correct it but that didn't work as I expected. Ironically, the SVC approach sometimes magically starts to work after several failed attempts but cycling power on the Zedboard puts it back to a non-working situation so there's obviously some strange trick I'm missing that gets into the right mode somehow.  Attempting to disable the "execute never" functions of the ARM using the "MRC" and "MCR" instructions to the coprocessor haven't been successful either so I must be doing something wrong.

 

A very helpful Xilinx rep tried to help and suggested a different approach based of the FSBL Template routine "FsblHandoffExit()" which accepts an address, disables the instruction cache and MMU followed by branching to the supplied address.  This looked like the solution and I had been trying similar stuff so it was promising.  The problem is that for some reason executing the "MCR" instruction to disable the cache/MMU causes the memory where I stored my routine to be corrupted.  I can then put some instructions into the corrupted locations manually and they will execute so this appears to be the right track but how do I keep the loaded routine from being corrupted? I've tried multiple locations and it appears that virtually every memory location outside of my monitor program space gets corrupted.

 

This type of operation is crucial to all processors so I doubt I'm the first to have this problem.  Heck, Linux does this every time it calls a module.  Does anyone have any ideas or examples of how to fix this?

 

I appreciate any help you can provide.

0 Kudos
1 Solution

Accepted Solutions
Highlighted
9,697 Views
Registered: ‎10-09-2013

Re: Problems branching to dynamically loaded routine using bare metal approach

Jump to solution
Oops. I left out the dsb and isb instructions in my snippet. It should read as :
RunNewCode:
mov lr, r0 /* move the dest addr into link reg */
mcr 15,0,r0,cr7,cr11,1 /*clean D cache to PoU based on addr */
dsb
mcr 15,0,r0,cr7,cr5,0 /*invalidate I cache */
mcr 15,0,r0,cr7,5,6 /*invalidate branch predictor (BTB) */
dsb
isb
bx lr /* branch to new code that was loaded */


Cut and paste got me again :)
0 Kudos
10 Replies
Visitor wojciec
Visitor
7,402 Views
Registered: ‎01-09-2014

Re: Problems branching to dynamically loaded routine using bare metal approach

Jump to solution

Michael,

 

What you explained is a fairly sophisticated program. Loading code dynamically into the memory and jumping to it isn't as easy as it may sound.

 

You also didn't give enough details about the monitor architecture. You mention something about SVC, so I assume your monitor is running in protected mode (with/without MMU?).

 

I'd first try to run everything in unprotected mode and get the code executed, so that jumping between supervisor/user code isn't necessary. So no SVCs etc.

 

I'd start with a simple approach: write 5-10 instructions (maybe make special SREC file for this), and see if I get them executed. Instruction would be as simple as:

 

start:

      j    uart_loop

uart_loop:

      put 'x' in r1

      put uart_base in r2

      str r1,[r2+TXFIFO_OFFSET]

 

Now in terms of the memory:

 

The best would be to do stuff without MMU first. I think it's in theory possible to have no MMU and caches enabled, so upon writing code, I'd do all necessary i-cache flushes, L2 flushes and memory barriers (ISB/DSB).

 

With MMU, to make sure I'm executing correct stuff, I'd make sure MMU page table entry of where your program lies marks the memory as "normal". There should be no "no-execute" stuff anywhere. Then upon writing I'd try to make sure i-cache is flushed. If the L2 is used, I'd try to flush L2 too. All indifidual actions I'd follow with the DSB/ISB memory barriers.

 

HTH,

 

Wojciech

0 Kudos
Teacher muzaffer
Teacher
7,401 Views
Registered: ‎03-31-2012

Re: Problems branching to dynamically loaded routine using bare metal approach

Jump to solution
is the location you are copying your code to cacheable? assuming you are able to read the data you write back & verify, is there a chance that that location is set as XN ?
- Please mark the Answer as "Accept as solution" if information provided is helpful.
Give Kudos to a post which you think is helpful and reply oriented.
0 Kudos
7,394 Views
Registered: ‎10-09-2013

Re: Problems branching to dynamically loaded routine using bare metal approach

Jump to solution
Wojciech,
I agree the code is sophisticated. The monitor is running in user mode as the default template with Xilinx SDK for bare metal. I have a very short routine in an SREC file that just adds two numbers and transmits a character on the UART. My initial code simply jumped to the location and caused a data abort and single stepping showed that the code wouldn't change the general purpose registers even though the PC changed. I modified the code to use SVC in an attempt to run in supervised mode hoping that would allow the code to execute but it didn't. Reading the coprocessor c1 register indicates the MMU is enabled.
Modifying the code to remove the SVC approach and disable the MMU appears to allow me to jump to the code and run it EXCEPT the action of disabling the MMU causes the code to be corrupted. I am performing the DSB/ISB after disabling the MMU as well. I can modify the corrupted code by hand at that point and it will execute but that obviously isn't useful. The MMU page table entries are in the Xilinx code; that's a good suggestion so I'll try that and see if "execute never" is set.
0 Kudos
7,394 Views
Registered: ‎10-09-2013

Re: Problems branching to dynamically loaded routine using bare metal approach

Jump to solution
It is possible since that portion of the code is in the Xilinx template. I'll try to find it.
0 Kudos
7,361 Views
Registered: ‎10-09-2013

Re: Problems branching to dynamically loaded routine using bare metal approach

Jump to solution
Thanks for the ideas. The Xilinx routines were correctly leaving the XN features off and the TLB looked ok so I started researching everything I could find on the memory system. The problem was caused by cache synchronization. I modified my assembly routine to clean the data cache prior to invalidating the Icache and branch predictor arrary (also called BTB). Here's the assembly snippet from my file RunNewCode.S:

RunNewCode:
mov lr, r0 /* move the dest addr into link reg */
mcr 15,0,r0,cr7,cr11,1 /*clean D cache to PoU based on addr */
dsb
mcr 15,0,r0,cr7,cr5,0 /*invalidate I cache */
mcr 15,0,r0,cr7,5,6 /*invalidate branch predictor (BTB) */
bx lr /* branch to new code that was loaded */


The code is then called from C using
RunNewCode((unsigned long) addr);
The value in addr is the location where my new code is loaded and the compiler places it in r0 when it calls the assembly code.

Once again, thanks for the ideas and I hope this solution helps someone else.
0 Kudos
Highlighted
9,698 Views
Registered: ‎10-09-2013

Re: Problems branching to dynamically loaded routine using bare metal approach

Jump to solution
Oops. I left out the dsb and isb instructions in my snippet. It should read as :
RunNewCode:
mov lr, r0 /* move the dest addr into link reg */
mcr 15,0,r0,cr7,cr11,1 /*clean D cache to PoU based on addr */
dsb
mcr 15,0,r0,cr7,cr5,0 /*invalidate I cache */
mcr 15,0,r0,cr7,5,6 /*invalidate branch predictor (BTB) */
dsb
isb
bx lr /* branch to new code that was loaded */


Cut and paste got me again :)
0 Kudos
7,299 Views
Registered: ‎10-09-2013

Re: Problems branching to dynamically loaded routine using bare metal approach-Spoke too soon...

Jump to solution

It seems that the cache synchronization wasn't a permanent fix. It does allow my dynamically loaded code to execute some instructions but as my "test" code was replaced with the actual code I need to run the behavior changes. After 20 or so instructions the ARM stops executing in the same manner as before, i.e. single step mode shows that the registers eventually stop being updated and the processor hangs. This makes me think the processor executes until some number of prefetched instructions are performed and then stops.  Any ideas on how to keep this from happening?  Is it possible to execute code without using the dcache and Icache at all?

0 Kudos
Xilinx Employee
Xilinx Employee
7,290 Views
Registered: ‎07-31-2008

Re: Problems branching to dynamically loaded routine using bare metal approach-Spoke too soon...

Jump to solution

Hi Michael,

 

it will help if you can provide some additional information about the "routines":

  • Are they complete programs compiled with Xilinx SDK?
  • Do they have their own stack or do they share the stack of the monitor?
  • After completion of the routines will the monitor resume?

For self-modifying code, your routines are a form thereof, you need to flush the data cache and invalidate the instruction cache after downloading the code through the UART. Assuming that you are using Xilinx provided sources, some pseudo-code can look like this:

 

----

#include "xil_cache.h"

#define ROUTINE_CODE_ADDRESS

 

u32 *routine_code;

void (*f)(void);

 

routine_code = (u32 *) ROUTINE_CODE_ADDRESS

f=ROUTINE_CODE_ADDRESS;

download_routine(routine_code);

Xil_DCacheFlush();
Xil_ICacheInvalidate();
f();

----

 

If the routine is a complete standalone application compiled by SDK you actually want to disable both caches before transfering control. Replace the flush and invalidate code lines with:

Xil_DCacheDisable();
Xil_ICacheDisable();

 

- Peter

 

0 Kudos
7,264 Views
Registered: ‎10-09-2013

Re: Problems branching to dynamically loaded routine using bare metal approach-Spoke too soon...

Jump to solution
Peter,
To answer your questions: 1. Yes, the routines are compiled using Xilinx SDK. 2. Unique stack although I considered using the same one as the monitor. 3. The routines call a clean exit that returns to the monitor and at least that works.

Your pseudo code is very close to what my code looks like. The solution was in your final paragraph. I thought I was correctly disabling the Dcache and Icache in some assembly code but when I substituted the Xil_DCacheDisable() and Xil_ICacheDisable() function calls for my assembly code it started to work. At least it now executes the complete routine I loaded rather than just a portion of it and it returns to the monitor correctly. I was surprised since the assembly I used came from an example but I prefer to have it as a C call anyway.

Thanks for your help. Have a cookie on me!
Michael
0 Kudos
Xilinx Employee
Xilinx Employee
2,182 Views
Registered: ‎07-31-2008

Re: Problems branching to dynamically loaded routine using bare metal approach-Spoke too soon...

Jump to solution

Okay, glad this works.

 

Here is some additional background.

 

The standalone BSP makes the assumption that the processor is coming out of reset or a reset-like state meaning that caches are turned off, the MMU is disabled, and core register settings are in a state that allows the CPU to execute code (but not much more). This is true for MicroBlaze, Cortex-A9, and PowerPC.

 

The code in boot.S starting at the label _boot sets up additional functionality for the CPU. This is different between Cortex-A9, MicroBlaze, and PowerPC. For the Cortex-A9 the caches, VFPU, the MMU, and other features are enabled. The in boot.S invalidates the caches before it enables them. I'm pretty certain this is what got you, i.e. the caches were invalidated but not cleaned/flushed first. With that valid data in the data cache got discarded and you saw stale data.

 

The code in crt0.S starting at the label _start sets up the C runtime environment. It sets up the stack and zeroes the BSS and SBSS sections.

 

 

This gives you a number of scenarios for your loader (disclaimer, I have not tested any of this):

  1. Start from _boot (the beginning of your downloaded routine): for this to work you should disable both caches, save the stack pointer before calling the routine, and restore the stack pointer after returning from the routine (actually you want to save / restore all regisers because boot.S assumes it can use any register)
  2. Start from _start (at an offset from your downloaded routine): for this to work you should flush the data cache and invalidate the instruction cache, and save/restore registers as in (1). You will continue to use the MMU and cache setup from the monitor. You will use a separate stack.
  3. Start from main (at an offset from your downloaded routine): for this to work you should flush the data cache and invalidate the instruction cache. No need to save / restore registers as it is just a regular C function call. You will continue to use the stack, MMU, and cache setup from the monitor.

 

Obviously for all of this you have to carefully partition your memory. Once the routine overwrites the monitor code, data, or stack area all bets are off.

 

- Peter

 

0 Kudos