01-09-2014 08:15 AM
I've only been working with the Zynq/Arm Cortex A9 for a few months (but I have 20+ years experience writing embedded code on other processors/DSPs) so bear with me on my long-winded explanation. My application is a bare-metal monitor program that tests various functions of the Zynq SoC using commands transmitted over UART. The monitor is a "port" of a prior monitor written for a different processor family. The monitor allows probing memory, hardware registers, etc and also provides the capability to upload programs dynamically using text-based (S-record) files to specific locations in memory and then branching to that routine. I've gotten all of the functionality of the monitor to work using a Zedboard except for the "branching" part and have hit a wall. Please don't recommend building all of the routines into the monitor or using an OS because those aren't options available to me or I would have done that already.
Single-stepping through the code revealed that my dynamically loaded program is correct and in the proper location yet once the processor branches to that location the registers no longer update and eventually results in a "data abort exception". I set up a SVC/exception handler approach thinking that this was caused by user mode operation and supervisor mode would correct it but that didn't work as I expected. Ironically, the SVC approach sometimes magically starts to work after several failed attempts but cycling power on the Zedboard puts it back to a non-working situation so there's obviously some strange trick I'm missing that gets into the right mode somehow. Attempting to disable the "execute never" functions of the ARM using the "MRC" and "MCR" instructions to the coprocessor haven't been successful either so I must be doing something wrong.
A very helpful Xilinx rep tried to help and suggested a different approach based of the FSBL Template routine "FsblHandoffExit()" which accepts an address, disables the instruction cache and MMU followed by branching to the supplied address. This looked like the solution and I had been trying similar stuff so it was promising. The problem is that for some reason executing the "MCR" instruction to disable the cache/MMU causes the memory where I stored my routine to be corrupted. I can then put some instructions into the corrupted locations manually and they will execute so this appears to be the right track but how do I keep the loaded routine from being corrupted? I've tried multiple locations and it appears that virtually every memory location outside of my monitor program space gets corrupted.
This type of operation is crucial to all processors so I doubt I'm the first to have this problem. Heck, Linux does this every time it calls a module. Does anyone have any ideas or examples of how to fix this?
I appreciate any help you can provide.
01-14-2014 07:46 AM
01-09-2014 11:44 AM
What you explained is a fairly sophisticated program. Loading code dynamically into the memory and jumping to it isn't as easy as it may sound.
You also didn't give enough details about the monitor architecture. You mention something about SVC, so I assume your monitor is running in protected mode (with/without MMU?).
I'd first try to run everything in unprotected mode and get the code executed, so that jumping between supervisor/user code isn't necessary. So no SVCs etc.
I'd start with a simple approach: write 5-10 instructions (maybe make special SREC file for this), and see if I get them executed. Instruction would be as simple as:
put 'x' in r1
put uart_base in r2
Now in terms of the memory:
The best would be to do stuff without MMU first. I think it's in theory possible to have no MMU and caches enabled, so upon writing code, I'd do all necessary i-cache flushes, L2 flushes and memory barriers (ISB/DSB).
With MMU, to make sure I'm executing correct stuff, I'd make sure MMU page table entry of where your program lies marks the memory as "normal". There should be no "no-execute" stuff anywhere. Then upon writing I'd try to make sure i-cache is flushed. If the L2 is used, I'd try to flush L2 too. All indifidual actions I'd follow with the DSB/ISB memory barriers.
01-09-2014 11:48 AM
01-09-2014 02:22 PM
01-09-2014 02:23 PM
01-14-2014 07:44 AM
01-14-2014 07:46 AM
01-22-2014 02:19 PM
It seems that the cache synchronization wasn't a permanent fix. It does allow my dynamically loaded code to execute some instructions but as my "test" code was replaced with the actual code I need to run the behavior changes. After 20 or so instructions the ARM stops executing in the same manner as before, i.e. single step mode shows that the registers eventually stop being updated and the processor hangs. This makes me think the processor executes until some number of prefetched instructions are performed and then stops. Any ideas on how to keep this from happening? Is it possible to execute code without using the dcache and Icache at all?
01-22-2014 10:46 PM
it will help if you can provide some additional information about the "routines":
For self-modifying code, your routines are a form thereof, you need to flush the data cache and invalidate the instruction cache after downloading the code through the UART. Assuming that you are using Xilinx provided sources, some pseudo-code can look like this:
routine_code = (u32 *) ROUTINE_CODE_ADDRESS
If the routine is a complete standalone application compiled by SDK you actually want to disable both caches before transfering control. Replace the flush and invalidate code lines with:
01-23-2014 08:27 AM
01-23-2014 10:36 PM
Okay, glad this works.
Here is some additional background.
The standalone BSP makes the assumption that the processor is coming out of reset or a reset-like state meaning that caches are turned off, the MMU is disabled, and core register settings are in a state that allows the CPU to execute code (but not much more). This is true for MicroBlaze, Cortex-A9, and PowerPC.
The code in boot.S starting at the label _boot sets up additional functionality for the CPU. This is different between Cortex-A9, MicroBlaze, and PowerPC. For the Cortex-A9 the caches, VFPU, the MMU, and other features are enabled. The in boot.S invalidates the caches before it enables them. I'm pretty certain this is what got you, i.e. the caches were invalidated but not cleaned/flushed first. With that valid data in the data cache got discarded and you saw stale data.
The code in crt0.S starting at the label _start sets up the C runtime environment. It sets up the stack and zeroes the BSS and SBSS sections.
This gives you a number of scenarios for your loader (disclaimer, I have not tested any of this):
Obviously for all of this you have to carefully partition your memory. Once the routine overwrites the monitor code, data, or stack area all bets are off.