Sign In

Don't have a Xilinx account yet?

  • Choose to receive important news and product information
  • Gain access to special content
  • Personalize your web experience on Xilinx.com

Create Account

Username

Password

Forgot your password?
XClose Panel
Xilinx Home

Soft Errors and Availability

by Xilinx Employee on ‎02-17-2011 11:50 AM

Availability is defined as the ratio of the time a system is working, divided by the time the system is working and not working. So, for example, a system with .99999 availability has about five minutes and 26 seconds a year unavailability.  “5-nines” is a common requirement in the telecommunications industry, usually applied to the trunk side of the equipment. The line side often has a smaller availability, as it is allowed to be out of service more often. Traditionally, out of service meant a hardware failure, and a hardware failure meant replacement of a circuit card. In order to meet the requirement there was a “hot-standby” circuit card ready to take over in the event of a failure.

 


Enter the Soft Error

 

Cosmic Rays from Novae (super and not so super) and protons from our Sun impact our atmosphere and create streams of neutrons (and a few protons) that make it all the way to the earth’s surface where they may disrupt microelectronics.

 

The following link details the latest family’s robustness to these soft error effects:

 

http://forums.xilinx.com/t5/PLD-Blog/First-Look-Rosetta-on-Virtex-6-and-Spartan-6-FPGA/ba-p/88018

 

For the latest data on Single Event Upset (SEU), see:

 

http://www.xilinx.com/support/documentation/user_guides/ug116.pdf

 

page 21.

 

 

Testing

 

I will not go over the Rosetta program here, but it is the best, most accurate data, published quarterly.  No other manufacturer does this, which should tell you that Xilinx is far ahead of everyone (ASIC, ASSP, FPGA).

 

http://www.xilinx.com/support/documentation/white_papers/wp286.pdf

 

Remember to always use ug116.pdf (above) for the latest data.

 

 

De-rating

 

Not every particle strike which results in a transient or flipped bit causes a functional failure. De-rating refers to the ratio of particle strikes to functional failures. In an FPGA device, the ratio of hits to functional failures is from one in ten (1:10) to one in eighty (1:80), depending on the nature of the customer’s design. For example, a 256-bit AES encryption and decryption core fully unrolled (all actions performed in parallel), if it filled an entire FPGA device, is about 1:11. An FPGA device programmed to watch how you are driving and help you avoid dangerous situations (like a car stopped ahead of you) is about 1:80. Why is this?

 

Depending on the number of signals that are critical to the function of the design at a clock edge, the de-rating can vary between the above two extremes. Since in any design roughly 90% of the FPGA device is completely unused (all those potential interconnects are not used), it is sensible that the de-rating cannot be worse than 1:10.  But, how can it be 1:80?  Not every signal was critical on each clock edge. Examples of this are: self-check logic that has no effect on function if it fails, startup logic that is used only once, error handling (like ‘404’ “page not found”) which is used only in exceptions, and bits of wires which just do not matter (like address bit 4 of the stack pointer when the stack is never deeper than 7).

 

 

SEU IP Core

 

Released recently, and now fully supported, the SEU monitor IP core finds and fixes upsets:

 

http://www.xilinx.com/support/documentation/ip_documentation/sem_ug764.pdf

 

Additionally, through the use of an additional SPI Flash device, the core identifies bits which are essential to the operation of the FPGA device.  Unfortunately, Xilinx is only able to determine if these bits would affect the hardware operation of the FPGA device; we cannot determine if the bit flip would affect your design.  This feature identifies roughly one in three upsets as “essential.”  Identifying the upset as “critical” to your design is still a subject of research. If needed, the core may be used to walk through the list of essential bits, with you keeping track of which bits really were critical.  A new map can then be made which informs you only when a functional bit flip has occurred. This is done by using the “error-injection” feature of the SEU Monitor IP core which allows you flip any bit, and then see what your system does as a consequence of the flip.

 

 

OK – But How Does this Affect Availability?

 

Imagine if every time an upset occurs, you take yourself out of service, reconfigure, and start over.  Instead of causing a functional failure every one in 10 to every one in 80 times, you are out of service one in one, or every upset.  This might be prudent if the system can tolerate a short outage. The driver assistance design is a good case in point: 100 milliseconds of outage is OK, not detecting that stopped truck ahead of you is not OK.  However, in the majority of cases, the already low level of upsets does not warrant getting your customers upset with each and every upset (Pun intended: it can be very upsetting).

 

The most common level of mitigation for upsets is just to log that it happened. The system log then contains the time and date of the event, and now in perhaps the one in 10 to one in 80 times there is an upset there is a matching outage, while the system at a higher level took some remedial action. The upset is also repaired, as repair is shown to improve the mean time to functional failure by an additional 30% or so.

 

If the logging strategy does not meet the availability requirement, then using the essential bits feature is unlikely to meet the requirement, unless the design can transfer operation smoothly over to a spare unit while the unit struck repairs itself and is ready to go back into service.

 

 

“I Do Not Have Redundancy”

 

Some legacy systems made no use of redundancy, as you can quickly surmise; they probably never met their availability requirement, and never will.  A hardware failure alone would be sufficient to violate the requirement, as dispatching a technician for repair and replacement means many minutes of down-time before service is restored.  Soft errors, while more frequent than hardware failures (see page 20 of ug116.pdf above), rarely cause a non-redundant system to fail to meet the availability criteria--the lack of redundancy is the cause.

 

Even if triple-modular redundancy with triplicated voters is used to fully mitigate the effects of the soft error, the first hardware failure will cause the system to exceed its availability requirement.

 

Bottom line: if availability is a requirement, consider a redundant architecture. Use the SEU Monitor IP to find and fix, log, and transfer function gracefully to the spare unit in the event of an essential bit upset.

 

If availability is not the problem, but reliability is (mean time between failures), then consider find and fix, and log;  do not attempt to stop, or reconfigure, or start over. In the case where you must know if you are OK or not, consider the use of the essential bits as the starting point to find the actual critical bits by flipping them one by one, and looking for functional failure. This might only be needed for safety critical systems where redundancy to a spare unit is too costly, takes too much power, or takes too much space.

 

With the addition of a separate module of RTL (Verilog or VHDL) to perform a watch-dog function, the restart option may be only taken when the watch-dog notes that the repair has not returned the system to a functional state. The last “single point of failure” then becomes the watch-dog, which is usually made as small as possible, and as simple as possible.  For example, the SEU Monitor IP itself fails roughly 1:1500 upsets (as measured in a neutron beam), so that the likelihood of the SEU Monitor itself failing drops below that of a hardware failure.

 

As soon as the cause of failure drops below the hardware failure rate you are “done,” as nothing you can do will improve matters.  Adding anything only increases the hard failure rate--more hardware, more probability of hardware failure!  Often, engineers do not realize that there is a point of diminishing returns and do not know when to stop.  It has been suggested that once any large source of failures is reduced to the background, underlying hardware failure rate, you are done.

 

If the resulting failure rate is too high, or availability too low, then you must be designing a human space flight vehicle or a nuclear reactor, in which case there are other clear requirements and methods to use, and criteria to meet.  Remember, if you do have a safety critical system which may affect human life or the environment if it fails, you need to contact Xilinx.  We need to inform you of the best methods to use the device if it is used to do these sorts of tasks, and what standards apply to its intended use that must be met.

Comments
by justin0000brown on ‎03-07-2011 01:49 PM

Thanks Austin for the informative article.  It does give much insight into Xilinx's assertiveness in preventing SEU's.  Do the soft error calculations / Mb posted in the links include in them de-rating at mentioned above?  If not, then would I actually only see 1/10 or less of the failures of the calculation posting in the link?

 

Justin

by Xilinx Employee on ‎03-07-2011 02:00 PM

Justin,

 

The figures in ug116.pdf are before any derating, the raw data.

 

Since derating depends on the placed, and routed RTL unique to a customer, derating can not be predicted (at the present time).  This is a problem I am working on, but it is very difficult to know when your RTL is working, or broken:  especially when I do not have the RTL in question.

 

The goal (in future) is to create the tools necessary to measure the derating in ISE.  Right now, all we can offer is the guidance given in the blog posting:  derating is always greater than 10:1, and is never more than 100:1 (for a device at or near "full").

 

Use of the SEU Monitor IP allows for upset insertion, so if ~1000 upsets are inserted randomly, one can get a pretty good idea of what the derating is by just inserting errors at random  until the functionality is broken.  After each inserted error, the function is checked, and if no problem is found, the flipped bit is flipped back.  If you allow errors to accumulate, you are not really measuring what is needed.

 

 

 

About the Author
  • Austin graduated from UC Berkeley in 1974 and 1975 with his BS EECS in Electromagnetic (E&M) Theory and MS EECS in Communications and Information Theory. He worked in the telecommunications field for 20 years designing optical, microwave, and copper-based transmission systems. Austin joined the IC Design department for the Virtex product line at Xilinx in 1998. His role for the last four years is working for Xilinx Research Labs, where he is looking beyond the present technology issues. Austin has 69 patents.