2-8 SPARC Enterprise M4000/M5000 Servers Overview Guide • August 2009
2.4 Reliability, Availability, and
Serviceability
Reliability, availability, and serviceability (RAS) are aspects of the system design that
affect the ability of the system to:
■ Operate without stopping
■ Remain accessible and usable
■ Minimize the time necessary to service the system
TABLE 2-2 defines each RAS feature.
2.4.1 Reliability
Reliability represents the length of time the midrange server can operate normally
without failure.
To improve quality, adequate components must be selected with consideration given
to the product service life and the required response in case of a failure. In
evaluations such as stress tests that check the service life, components and products
are inspected to determine whether they meet the target reliability levels.
Reliability is equally important to both hardware and software. Naturally,
trouble-free software is desired, but eliminating all software problems is difficult.
Installing the functions below leads to reliability improvements in the field.
■ Cooperates with XSCF firmware to periodically check whether the software,
including the domain OS, is running (host watchdog monitoring).
■ Periodically performs memory patrol to detect memory software errors and
stuck-at faults, even in memory areas not normally used (Memory patrol).
TABLE 2-2 RAS Definitions
RAS Feature Description
Reliability Length of time the midrange server can operate normally without
failure. The ability to detect failures with accuracy.
Availability Ratio of time during which the system is accessible and usable.
Serviceability Time required for the system to be recovered by specific
maintenance after a failure occurs.