09-08-2020 09:13 AM
I'm currently doing a fault injection campaign using the SEM IP on a spartan-7 and artix-7.
My SEM IP is configured in "enhanced repair" mode.
I realized that some configuration bits are not correctable when I inject a fault on it (around 5% of injected bits among the essential bits of my design).
I> N C000E7A7C3
Some injections will also generate double error at detection leading to the same result.
I can provide to you the list of addresses of all the configuration bits that show this type of behaviour for my current device (xc7s50csga324).
My questions are the following:
How an injection of one error can result in an uncorrectable error?
Why the detected error is not at the same address as the injected error?
Is this phenomenon only due the fault injection or it can also be generated by an SEU?
As we have very little information about the link between configuration memory and the FPGA resource, any help to understand the origin of the problem or how to avoid it will be appreciated.
Thank you in advance.
09-08-2020 09:52 AM
If you flip two non-adjacent bits in a frame, the enhanced repair will fail.
Some config ram bits affect more than one signal (bit). The one that is most common is the GLUT_MASK bit for a CLB. It changes readback and operation of a LUT to be a LUTRAM, or SRL so all bits in the LUT become '1' on readback (LUTRAM/SRL operation are on other CRAM bits as well). If you really need to dig into this you should request support from your distributor FAE, or if you have one, your Xilinx FAE.
09-11-2020 07:54 AM
Thank you very much for your answer!
I'm actually trying to understand why does the SEM IP get stuck so many times during my neutron beam tests... and eventually to see if there is any way to mitigate this phenomenon.
As there is a rather big amount of these single points of failure (according to my error injection campaign), they seem to be the main source of failure of the scrubbing system (more probable than multiple error in the same frame).
But the behavior can be very different from one bit to another:
-Double error in the same frame
-Single errors that are continuously corrected
-SEM IP stuck
SEUs on the GLUT_MASK bit seems to be a good explanation for CRC errors. Can we do something about it?
Do you have other examples of configuration bits that could cause the types of behaviors listed above?
Thank you again for your answer!
09-11-2020 08:09 AM
As I noted,
If you desire details, you will have to engage with your distributor, or direct Xilinx FAE.
You must have one amazing neutron source! The SEM IP itself has a cross-section on the same order as the device SEFI, so basically, the probability the device itself completely goes nuts, stops altogether, or restarts by itself. So, all together, if you worry about SEE, the SEM IP is useful to mitigate upsets in CRAM for a running design (recover from functional failures in your design).
If the basic device SEFI cross-section is too high for your needs, you have the wrong architecture for your system. You either need to go to a more robust device (UtraScale+ has much lower cross sections), or you need some form of redundancy (or both).
09-14-2020 05:24 AM
This is a very interesting information!
It means that I probably misinterpreted the phenomena observed during my irradiation campaign. Not having been able to record the monitor output of the IP SEM, I considered that the persistent errors (no recovery before complete reconfiguration) in my design, which does not contain a feedback loop nor SRL/LUTRAM, came mostly from SEUs on the configuration memory which was no longer corrected by the IP SEM.
This is why I wanted to study these "blocking" phenomena of the memory scrubbing system. This interpretation must surely be wrong given the number of such events observed and almost no SEFI over the whole test campaign.
I'll have to study deeper the mechanisms that can lead to this kind of persistent error.
Unfortunately, as we are academic workers, we don't have any FAE to help us.
Thank you for your precious help!
09-14-2020 05:58 AM
Study is encouraged,
Are you part of the Xilinx University Program? If so, your professor who is registered in the program is able to get help. While at Xilinx (for 20 years), I helped many students with their degrees. The SEM IP was developed by myself an Ken Chapman for Virtex 4, where it was known as App Note 864. It is now fully supported IP since Virtex 6. It gets beam tested, and verified for proper operation in all modes for each new technology node. It is a fundamental building block in safety critical systems, security, and space markets. While only able to address type 2 blocks (configuration RAM), it is the first step in improving availability and reliability, and preventing fail-safe systems from failing unsafely. I teach its use in my embedded class at UCSC Extension (among other things).
The longest beam test campaign was by CERN: more than 6 months of data on a Artix series device. Many hundreds of thousands of upsets, and nothing new was discovered.