 (Nikkei BP Group)
 (No.1 High-Tech News Site in Japanese)
|
|
Japan Suffers Many Online System Outages in 1998
|
December 28, 1998 (TOKYO) -- Japanese online service operators suffered
many outages in 1998, and for this reason they are making major efforts
to reduce the number of outages and improve the overall quality and
reliability of their services.
|
The series of troubles began with NTT Data Corp.'s
ANSER financial service network on Jan. 20, and system troubles affected
a train service management system of East Japan Railway Co. (JR East),
an online transaction system of Asahi Bank Ltd. and various other online
systems. (See table.)
Many system failures have impacted the trading system for futures options
at the Tokyo Stock Exchange. The system malfunctioned over the course
of six consecutive business days from Nov. 24 to Dec. 1.
Frequent Troubles with Large-Scale Systems
All of society is influenced when outages occur across a large-scale
online system for financial services, transportation services or other
major services. For this reason, mission critical systems have been
implemented with prevention measures for high reliability.
Why, then, have system failures occurred, even though such large-scale
online systems have double or triple safety measures so as to enhance
their reliability?
Nikkei Computer magazine investigated those cases, focusing on the sequences
of troubles. The magazine identified two specific patterns in the incidents.
The first pattern for system failure is that added functions for new
services or a system upgrade in accordance with the start of new businesses
or services have triggered system breakdowns. The second pattern is
that a system suffers from multiple problem factors including hardware
glitches and software bugs, which lead to a system outage.
New Functions Tend to Cause Outages
The Tokyo Stock Exchange and KDD Co., Ltd. had cases in which new functions
and system upgrades triggered system outages. At the TSE, such troubles
hampered a trading system for futures options on Nov. 24, and KDD's
trouble occurred in a credit card call system on Nov. 4.
According to an analysis done by Nikkei Computer, those troubles were
caused partly by insufficient validation work on introducing a new function
as well as inadequate testing prior to implementation.
Specifically, the cause of the TSE accident on Nov. 24 was a test operation
performed the previous weekend. The test was conducted with an increasing
number of files on the system in preparation for expected increases
in the number of stocks to be handled. When real transactions were started
with the difference untouched in numbers between stocks actually handled
and corresponding files, the system outage occurred. This was due to
a "time-out" in a communication server that was affected by a process
trying to match the number of stocks with the number of files.
The cause of the KDD incident was identified as a problem-ridden introduction
of a new log-recording program developed by the company based on a voice
response system on the market.
A simple bug was in the program. That bug required that log information
was continuously written beyond a boundary of a specified area, because
"acceptance inspection of the program was insufficient," according to
KDD.
In the sequence of events, an overflow occurred beyond a file size, which
was judged to be abnormal by the operating system. Then the operating
system turned off and restarted the voice response software repeatedly
and intermittently, which caused the major problem.
Loopholes in Safeguards
In some cases, a system suffers multiple glitch factors including hardware
problems and software bugs, thus leading to a system outage.
This pattern was seen in an accident of a bond purchasing and sales system
at the Tokyo Stock Exchange on April 30, and was also seen in another
outage of the cooperative online system CAFIS, run by NTT Data Corp.,
for a credit card business sector, on Jun. 16.
Usually, a large-scale online system is built as a duplex system by introducing
redundancy among processors of a host machine as well as all the hardware
including disk drives and communications devices. Many systems adopt
fault tolerant machines that can continue operating without affecting
the whole system, if a part of the hardware malfunctions.
However, system outages also occur when there are loopholes in system
safety. That's because of bugs and problems in programs and the operating
system as well as hardware problems.
In the CAFIS accident on June 16, malfunctions occurred in communication
devices and a program bug was present. However, all the hosts and front-end
processors (FEP) in CAFIS are fault tolerant machines, and networks
linking six FEP machines in a center have redundant channels for all
components.
Communications between the FEPs were down for a sustained period, because
all the communications control programs in the FEPs were initialized
by mistake when trouble occurred within the network.
During this outage, another program accessing a file didn't take the
necessary steps to recover to a normal state. And as a result, the trouble
couldn't be fixed as designed, thus the system was down for about five
hours.
These examples of troubles in large-scale online systems show that such
systems are sensitive. Round-the-clock operations are required of an
increasing number of systems especially among corporate enterprise systems.
This is in part due to Internet businesses going into full operation
and communications with overseas offices expanding dramatically.
Operators of online systems must be careful to ensure that those systems
do operate properly and with the utmost reliability at all times.
Table: Accidental outages in large-scale systems in Japan during
1998
Date of Occurrence
|
System Brought Down
|
January 20
|
Financial Service Network, Bank ANSER, Run by NTT Data
|
February 4
|
Train Service Management System, ATOS, for Controlling
the Chuo Line of East Japan Railway Co. (JR East)
|
April 30
|
Bond (such as convertible bonds) Purchasing and Selling
System Operated by the Tokyo Stock Exchange (TSE)
|
June 16
|
Cooperative Online System CAFIS, Run by NTT Data Corp.,
for Credit Card Business Sector
|
July 7
|
Cooperative Online System CAFIS, Run by NTT Data Corp.,
for Credit Card Business Sector
|
August 4
|
Transaction Online System of Asahi Bank Ltd.
|
August 11
|
Travelers Management System of All Nippon Airways Co.,
Ltd. (ANA)
|
August 25
|
Dealing System for Over-The-Counter Stocks, JASDAQ
System, Operated by the Securities Dealers Association of
Japan
|
November 4
|
Credit Card Call System of KDD Co., Ltd.
|
November 8
|
Credit Control System of DC Card Co., Ltd.
|
November 24, 25, 26, 27, 30 and December 1
|
Trading System for Futures Options Operated by the Tokyo
Stock Exchange (TSE)
|
(return to news)
Related stories:
Tokyo Train Chaos Caused by Failed Comm. Controller
NTT Data's Online Credit Card System Fails Again
(Tomohiko Hoshino, Hidenori Kawamata; Staff Editors; Nikkei
Computer)
<Visit News Center for more Asian news.>
|
|
|