ECN No Name Newsletter: May, 1987

The ECN No Name Newsletter is no longer being published. This is an archived issue.

[previous article] [next article]

Why do Systems Crash?

Curt Freeland

Today's computer systems are complex beasts. Keeping them running is getting more difficult (and expensive) with each new model. Fortunately, most system failures can be loosely classified in five categories: cooling system failures, power system failures, system hardware failures, system software failures, and human error. A general explanation of each category, and an approximation of downtime you can expect from them are listed in the following paragraphs.

Human error is usually the easiest failure to diagnose because the error is commonly a mistyped command at the system console terminal. These errors usually take the system out of use for 30 minutes to an hour, but human error has been known to shut down systems for a day or two.

Cooling system failures are fairly easy to diagnose, because the computer room is hot when you enter. As a safety precaution, the ECN computer rooms have protection circuits arranged to shut down the computer power when the room hits 80 degrees F. An alarm indicator informs us what the problem is, and we call Physical Plant to come repair it. Spare parts are usually available, and repairs can be accomplished within 24 hours during the work week.

System software failures (also known as bugs) are usually hard to find. Some bugs show up as soon as you install new software, while others wait for months, and show up when the system hits a high load average. Once the bugs are tracked down, they are fairly easy to fix. The difficult part is finding the bug, and understanding what is being done incorrectly (as any EE263 student should know).

Power system failures can be obvious or obscure. If the campus power goes off, the problem is easy to diagnose --no juice-- and down time could be 30 minutes to several hours. However if a power supply within some part of the computer fails, diagnosis can be difficult because the internal supplies often fail in a mode where they seem to be working. This makes it very difficult to track down and replace the faulty element, but once located, the systems should be back on-line in an hour or two.

System hardware failures are sometimes easy to diagnose, and sometimes very difficult to diagnose. For easy failures, the system prints out an error message on the console terminal. This message tells what kind of error the system is having, pinpointing a bad board, a disk drive failure, or some other faulty hardware. For more difficult failures, the system may just quit running. When no error message is printed, everything in the system is suspect. Most hardware failures can be found and fixed within a couple of hours. Sometimes a repair part has to be ordered or a disk system replaced and then the downtime may amount to a day or two. If the problem is a system design bug, the repairs may take many months to implement...first determining the design flaw, then being creative and inventing a solution while developing a temporary fix for the interim.


webmaster@ecn.purdue.edu
Last modified: Saturday, 01-Nov-97 13:36:24 EST

[HTML Check] HTML