When it comes to safety-critical systems (planes, trains, automobiles), or to safe industrial controls (windmills, factory steam plants), the designer must take a number of facts into account. First, failures are always an “option”1 – they can, and will happen. Second, when a failure occurs, it is your job to make sure that the failure does not result in damage or loss of life.
As opposed to a commercial system, where a failure results in inconvenience or frustration, the designer must differentiate and apply the proper design philosophy. In a commercial system (server, router, telephone system, cell phone switch) failure is extremely unlikely to result in damage, destruction, or loss of life. Depending on the service grade (number of users affected) and the use of the equipment (business telephone switch, or 911 service center), a more stringent set of rules may apply. But, if the system is deemed non-critical and it fails, it will result in loss of service, which is something that is not very pleasant, but is acceptable. Of course, a reputation for a low failure rate will always help in keeping customers happy (and paying for your products).
Here I will first discuss systems where failure is acceptable (loss of service), and then I will discuss systems where failure is unacceptable.
Systems that are ‘OK’ to Fail
Typically, designers wish to make their designs as reliable as possible. The hard failure rates of Xilinx FPGA devices by technology are listed in the Quarterly Reliability Report2 in Table 1-16. Take the hard failure rate for the 7 series devices, which is listed as 24 FIT (May 8, 2012). That equates to one failure per 4,756 years. Imagine having 10,000 systems in the field, and you would expect to see one customer field failure return every 174 days (on average). This is the baseline, do nothing special, failure rate for the system, if the system consisted of just one component, a Xilinx 28 nm 7 series FPGA device. Now, no system consists of just one component: there are the printed circuit boards, power supplies, connectors, LEDs, switches, and so forth.
A typical system, with all of its associated components is going to have a lifetime much shorter than the Xilinx FPGA device alone. Typical numbers range from 1000 FIT (failures per billion hours) to 10,000 FIT, of which the 7 series FPGA device is a small part. This equates to failures every 114 down to 11.4 years, or from 4 days down to less than a day between customer returns. Again, it is an inconvenience, but no one is harmed and no damage results.
Not all components have a constant failure rate. Most devices have what is known as a ‘bathtub’ curve for their mean time between failures: the intial period of life consists of more failures, known as ‘infant mortality.’ There is usually a long period after this initial period of a lower failure rate, known as the ‘normal operating life.’ Near the end of the component’s lifetime, the failure rate increases, and this period is known as the ‘end of life.’
A laptop computer, for example, has a typical service life of three years. Failures which occur immediately after purchase or after three years dominate the failures that one experiences, with far fewer failures in between.
Xilinx designs its commercial and industrial products for a 15 year minimum life. Depending on junction temperature, this time period may be more or less. Please consult the reliability and user guides.
Systems that are NOT ‘OK’ to Fail
In a system where life or the environment is threatened, there is always an allowed failure rate: it may be a very stringent requirement, but since no system is perfect and no design exists that is failsafe, there is a number, however tough it may be to achieve.
One example is for a power-control system in a windmill. The windmill, if it fails, may throw a windmill blade off, totally destroying the device and potentially injuring or killing a person nearby. Let us suppose for the sake of this blog that that number is 10 FIT, or one failure every 11,416 years. The first thing to recognize is that no single device is reliable enough to meet this requirement; the system will have to have some form of redundancy.
As the hardware failure rate of the components is too large to meet the requirement, the designer will have to provide duplication for all of the critical components, subsystems, and power supplies, along with other means to verify that there are no failures. Such systems are often designed such that when they fail, they fail safely.
Imagine two completely separate assemblies, each with a failure rate of 10,000 FIT, each checking the other through a duplicated communications channel (in this case, perhaps just two UARTs on four wires in each system). The single point of failure is now the four wires or the UARTs, but any failure would be detectable; the subassembly would fail to be able to talk to its twin, and the system would be able to immediately shut down everything safely.
There is still the possibility that both systems fail at the same time. That is the probability of failure of each, times itself, or one simultaneous failure every 130 million years. That is probably going to meet the requirements.
As another example, the space shuttle used five systems, and as long as they all agreed with one another, they could launch. After launch, as long as three of the five worked (agreed), the systems were deemed safe.
As another example, commercial airplanes use three-way redundancy for their flight controls.
When am I Done?
Often designers forget that designing for reliability is an endless task unless they have a clearly stated goal. In the above examples, if the soft failure rate (from atmospheric neutrons) was 1 FIT for an element in a 7 series device (actual rate for a PicoBlaze), what should be done? Well, the general rule that I have seen used for a very long time now is that once any failure rate is less than the hardware failure rate, you are done.
If the final system does not meet the requirement, then you have the wrong architecture (no redundancy, for example).
There are Standards for that, now….
Fortunately, designers now have international standards for industrial, aeronautic, automotive, medical, and other systems3: IEC 61508 is the umbrella standard which applies to all at the highest level, with each industry having its own subset standard. The one thing to remember about these standards is that none of the manufacturers of the components are certified (or certifiable), but that the final system and its manufacturers must be certified.
Anyone who claims to be able to sell you a 26262 (IEC automotive standard) certified FPGA device is lying; there is no such thing. The same is true for DO-254, a standard for airplanes. It is the entire system and its design that must meet the standard, not the components. Of course, if there is one FPGA device in the system and nothing else, it is pretty easy to find the failure rate (just look it up in UG116, if it is a Xilinx component). But, that is never the case; all systems are more than just one device.