cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
Highlighted
Visitor
Visitor
1,520 Views
Registered: ‎09-14-2017

TMR Subsystem - Recovery

Jump to solution

Hi,

 

I am working with a design that has a Triple Modular Redundancy Subsystem with three Microblaze cores.
Reading the documentation (PG268 - v1.0), I found the following:

 

Recovery of the MicroBlaze Subsystem
1. The executing software is interrupted by the break signal.

2. The software break handler stores all internal MicroBlaze registers in RAM.

3. The software performs a reset of the entire MicroBlaze subsystem excluding the TMR Managers by executing a SUSPEND instruction.
4. The reset restores the TMR Manager to Voting (FT-mode) state.
5. The software starts executing from the reset vector, and reads the TMR Manager First Failing Register (FFR).
6. If the FFR indicates that one MicroBlaze sub-block is faulty a recovery should be done. If the register holds any other value, the software should not attempt a recovery.
7. The software clears the TMR Manager FFR.
8. The software restores all registers from RAM and execute an RTBD instruction to return from the break handler, to resume nominal execution at the place where the break occurred.

First of all, I was not able to find the Example Design mentioned in the Product Guide. Can you provide a link for that?

 

Secondly, I have questions regarding:
5. The software starts executing from the reset vector, and reads the TMR Manager First Failing Register (FFR).

Does this means that the startup code has to read the TMR Manager FFR and act consequently?
6. If the FFR indicates that one MicroBlaze sub-block is faulty a recovery should be done. If the register holds any other value, the software should not attempt a recovery.

So if, for instance, FFR[2] is 1 (Lockstep mismatch 2-3) what should be done?

 

Thanks in advance for your attention.


Cheers,

Emanuele

 

0 Kudos
1 Solution

Accepted Solutions
Highlighted
Xilinx Employee
Xilinx Employee
1,898 Views
Registered: ‎10-08-2010

You can generate the example design in Vivado yourself. You do it by right-clicking on an instance of the TMR Manager, and selecting Open IP Example Design. Look in Chapter 5 of PG268 for details.

 

5. Correct. FFR bit 3 indicates whether the reset was due to a recovery or a cold reset. If this is a cold reset the normal initialization and branch to the main function should be performed. If this is a recovery, an additional check that only one sub-block is faulty (FFR bits 2-0 = 011, 101, 110 and bits 20-16 = 00000) is recommended. If that is not the case the TMR subsystem is in the Fatal state, recovery is not meaningful, and the desired software behavior really depends on your application. Since you cannot trust the software, the safest approach is to enter an infinite loop and not attempt to do anything. Another option might be to try to do a cold reset. In any case, in the Fatal state, the assumption is usually that logic outside the subsystem must handle recovery.

 

6. See above. FFR bits 2-0 = 100, 010 or 001 indicates that more than one sub-block is faulty, since otherwise two processors would agree on which one is faulty. In this case recovery is not meaningful. To clarify with an example: if bits 2-0 = 011 processor 1 is faulty, because both processor 2 and 3 agree that they have a mismatch with processor 1.

 

I have made a note to explain this in more detail in the next issue of PG268.

View solution in original post

4 Replies
Highlighted
Scholar
Scholar
1,475 Views
Registered: ‎02-27-2008

You need a version of Vivado that is recent enough to include the IP, (it is part of its managed IP library)

 

That would be 2017.4.

 

 

Austin Lesea
Principal Engineer
Xilinx San Jose
0 Kudos
Highlighted
Visitor
Visitor
1,466 Views
Registered: ‎09-14-2017

Ok, thanks, I was trying to access the example design from the IP Documentation.

Anyway, the Example Design does not contain any C/C++ code at all, so all my questions regarding the software recovery still remain.

0 Kudos
Highlighted
Scholar
Scholar
1,461 Views
Registered: ‎02-27-2008

If one block is in error,

 

Something is wrong with it.  The results from the voting are still good, but if left uncorrected, this could eventually lead to an error as no voting is occurring.

 

Usually, one is also using the SEM IP to find and fix bit upsets in parallel (it is part of the same bitstream), the the SEM IP will find the flipped bit, and hopefully fix it, and indicate all is well (in effect 'recovery' of the bad uB has occurred and TMR results will be all good again).  If you didn't want to use the SEM IP (I cannot think of one reason NOT to), you would need to assert PROG_b, or power cycle as soon as the TMR status indicates an issue (but the immediate results are trustworthy).

 

The SEM IP will halt on an uncorrectable error, and the device must be restarted (PROG_b, or power cycled).

 

So if the TMR voter is clean (everyone is happy), and the SEM IP is running (RBCRC is good), then results of the uBlaze are valid and guaranteed good (or at least as good as your code).  If RBCRC is bad, there is an uncorrectable error, or the TMR voting status is bad, one must assume the results cannot be trusted (the immediate result if one block is bad is trustworthy -- but you are operating on borrowed time as any further fault means the results are invalid unless one can return to RBCRC being good by flipping back the flipped bit).

 

Recovery of your application (in the event it knows at least one block is not good - yet) is done by your software at a higher level than this hardware and depends on how you act on error in your system.

 

So, for example, if this TMR uB was on the IR nose-wheel landing camera of a commercial jet, one bad vote might be reason to warn the pilot, but not abort landing (pull out, go around not required).  But if error is uncorrectable, or has endured for some time (say 200 ms), then it is pull up and go around time (VFR required, cannot use IR camera).  Every application has its own set of requirements.

 

 

 

 

Austin Lesea
Principal Engineer
Xilinx San Jose
Highlighted
Xilinx Employee
Xilinx Employee
1,899 Views
Registered: ‎10-08-2010

You can generate the example design in Vivado yourself. You do it by right-clicking on an instance of the TMR Manager, and selecting Open IP Example Design. Look in Chapter 5 of PG268 for details.

 

5. Correct. FFR bit 3 indicates whether the reset was due to a recovery or a cold reset. If this is a cold reset the normal initialization and branch to the main function should be performed. If this is a recovery, an additional check that only one sub-block is faulty (FFR bits 2-0 = 011, 101, 110 and bits 20-16 = 00000) is recommended. If that is not the case the TMR subsystem is in the Fatal state, recovery is not meaningful, and the desired software behavior really depends on your application. Since you cannot trust the software, the safest approach is to enter an infinite loop and not attempt to do anything. Another option might be to try to do a cold reset. In any case, in the Fatal state, the assumption is usually that logic outside the subsystem must handle recovery.

 

6. See above. FFR bits 2-0 = 100, 010 or 001 indicates that more than one sub-block is faulty, since otherwise two processors would agree on which one is faulty. In this case recovery is not meaningful. To clarify with an example: if bits 2-0 = 011 processor 1 is faulty, because both processor 2 and 3 agree that they have a mismatch with processor 1.

 

I have made a note to explain this in more detail in the next issue of PG268.

View solution in original post