Google
 
   
Login
Username:

Password:


Lost Password?

Register now!
Search
Main Menu
service
top books
Polls
What do you think about php-deluxe.net?
Excellent!
Cool
Hmm..not bad
What the hell is this?
encyclopedia
recommendation
Freenet DSL
Who's Online
9 user(s) are online (4 user(s) are browsing encyclopedia)

Members: 0
Guests: 9

more...
partner

Fault tolerant design

This article contains general theory of fault tolerant design. For specific implementations, see fault tolerant system.

Fault tolerant design refers to a method for designing a system so it will continue to operate, possibly at a reduced level, rather than failing completely, when one component of the system fails. The term is most commonly used to describe computer systems designed to lose little or no time due to problems either in the hardware or the software. An example in another field is a motor vehicle designed so it will continue to be drivable if one of the tires is punctured.

=Methods=

*Fault tolerant components. If each component, in turn, can continue to function when one of it s subcomponents fails, this will allow the total system to continue to operate, as well. Using the motor vehicle example, some cars have run flat tires, which contain a solid rubber core that allows them to be used if the surface is punctured. They can only be used for a limited time at a reduced speed, but this is still a substantial improvement over traditional tires.

*Redundancy. This means having backup components which automatically kick in should one component fail. For example, large cargo trucks can lose a tire without any major consequences. They have so many tires that no one tire is critical (with the exception of the front tires, which are used to steer).

=Disadvantages=

The advantages of fault tolerant design are obvious, but what are the disadvantages

  • Interference with fault detection. To continue the same example, it may not be obvious to the driver when a tire has been punctured, with either of the fault tolerant systems. This is usually handled with a separate automated fault detection system . In the case of the tire, an air pressure monitor detects the loss of pressure and notifies the driver. The alternative is a manual fault detection system , such as manually inspecting all tires at each stop.
  • Reduction of priority of fault correction. Even if the operator is aware of the fault, having a fault tolerant system is likely to reduce the importance of repairing the fault. If the faults are not corrected, this will eventually lead to system failure, when the fault tolerant component fails completely or when all redundant components have also failed.
  • Test difficulty. For certain critical fault tolerant systems, such as a nuclear reactor, there is no easy way to verify that the backup components are functional. The most infamous example of this is Chernobyl, where they tested the emergency backup cooling by disabling primary and secondary cooling. The backup failed, resulting in a core meltdown and massive release of radiation.
  • Cost. Both fault tolerant components and redundant components tend to increase cost. This can be a purely economic cost or can include other measures, such as weight. Manned spaceships, for example, have so many redundant and fault tolerant components that their weight is increased dramatically over unmanned systems, which don t require the same level of safety.
  • =When to use fault tolerant design=

    Providing fault tolerant design for every component is normally not an option. In such cases the following criteria may be used to determine which components should be fault tolerant:

  • How critical is the component In a car, the radio isn t critical, so this component has less need for fault tolerance.
  • How likely is the component to fail Some components, like the drive shaft in a car, are not likely to fail, so no fault tolerance is needed.
  • How expensive is it to make the component fault tolerant Requiring a redundant car engine, for example, would likely be too expensive both economically and in terms of weight and space, to be considered.
  • An example of a component that passes all the tests is a car s occupant restraint system. While we don t normally think of the primary occupant restraint system, it is gravity. If the vehicle rolls over or undergoes severe g-forces, then this primary method of occupant restraint may fail. Restraining the occupants during such an accident is absolutely critical to safety, so we pass the first test. Accidents causing occupant ejection were quite common before seat belts, so we pass the second test. The cost of a redundant restraint method like seat belts is quite low, both economically and in terms or weight and space, so we pass the third test. Therefore, adding seat belts to all vehicles is an excellent idea. Other supplemental restraint systems , such as airbags, are more expensive so may not pass that test. This is why inexpensive vehicles typically have fewer airbags than expensive vehicles.

    =Examples=

    Hardware fault tolerance sometimes requires that broken parts can be swapped out with new ones while the system is still operational. Such a system implemented with a single backup is known as single point tolerant, and represents the vast majority of fault tolerant systems. In such systems the failure rate#Mean-time-between-failure (MTBF)s should be long enough for the operators to have time to fix the broken devices before the backup also fails. It helps if the time between failures is as long as possible, but this is not specifically required in a fault tolerant system.

    Fault tolerance is notably successful in computer applications. Tandem Computers built their entire business on such machines, which used single point tolerance to create their NonStop systems with uptimes measured in decades.

    =Related terms=

    It should be noted that there is a difference between fault tolerance (systems that work even when a fault occurs) and systems that rarely have problems. For instance, the Western Electric crossbar systems had failure rates of two hours per forty years, and therefore were highly fault resistant. But when a fault did occur they still stopped operating completely, and therefore are not truly fault tolerant.