Welcome to AsiaBizTech Web Site

 
Top Page
Site Map
News at a Glance Member Services AsiaBizTech Resources



Advanced Search


(Nikkei BP Group)



(No.1 High-Tech News Site in Japanese)

















  • Japan Suffers Many Online System Outages in 1998
  • December 28, 1998 (TOKYO) -- Japanese online service operators suffered many outages in 1998, and for this reason they are making major efforts to reduce the number of outages and improve the overall quality and reliability of their services.
    The series of troubles began with NTT Data Corp.'s ANSER financial service network on Jan. 20, and system troubles affected a train service management system of East Japan Railway Co. (JR East), an online transaction system of Asahi Bank Ltd. and various other online systems. (See table.)

    Many system failures have impacted the trading system for futures options at the Tokyo Stock Exchange. The system malfunctioned over the course of six consecutive business days from Nov. 24 to Dec. 1.

    Frequent Troubles with Large-Scale Systems

    All of society is influenced when outages occur across a large-scale online system for financial services, transportation services or other major services. For this reason, mission critical systems have been implemented with prevention measures for high reliability.

    Why, then, have system failures occurred, even though such large-scale online systems have double or triple safety measures so as to enhance their reliability?

    Nikkei Computer magazine investigated those cases, focusing on the sequences of troubles. The magazine identified two specific patterns in the incidents.

    The first pattern for system failure is that added functions for new services or a system upgrade in accordance with the start of new businesses or services have triggered system breakdowns. The second pattern is that a system suffers from multiple problem factors including hardware glitches and software bugs, which lead to a system outage.

    New Functions Tend to Cause Outages

    The Tokyo Stock Exchange and KDD Co., Ltd. had cases in which new functions and system upgrades triggered system outages. At the TSE, such troubles hampered a trading system for futures options on Nov. 24, and KDD's trouble occurred in a credit card call system on Nov. 4.

    According to an analysis done by Nikkei Computer, those troubles were caused partly by insufficient validation work on introducing a new function as well as inadequate testing prior to implementation.

    Specifically, the cause of the TSE accident on Nov. 24 was a test operation performed the previous weekend. The test was conducted with an increasing number of files on the system in preparation for expected increases in the number of stocks to be handled. When real transactions were started with the difference untouched in numbers between stocks actually handled and corresponding files, the system outage occurred. This was due to a "time-out" in a communication server that was affected by a process trying to match the number of stocks with the number of files.

    The cause of the KDD incident was identified as a problem-ridden introduction of a new log-recording program developed by the company based on a voice response system on the market.

    A simple bug was in the program. That bug required that log information was continuously written beyond a boundary of a specified area, because "acceptance inspection of the program was insufficient," according to KDD.

    In the sequence of events, an overflow occurred beyond a file size, which was judged to be abnormal by the operating system. Then the operating system turned off and restarted the voice response software repeatedly and intermittently, which caused the major problem.

    Loopholes in Safeguards

    In some cases, a system suffers multiple glitch factors including hardware problems and software bugs, thus leading to a system outage.

    This pattern was seen in an accident of a bond purchasing and sales system at the Tokyo Stock Exchange on April 30, and was also seen in another outage of the cooperative online system CAFIS, run by NTT Data Corp., for a credit card business sector, on Jun. 16.

    Usually, a large-scale online system is built as a duplex system by introducing redundancy among processors of a host machine as well as all the hardware including disk drives and communications devices. Many systems adopt fault tolerant machines that can continue operating without affecting the whole system, if a part of the hardware malfunctions.

    However, system outages also occur when there are loopholes in system safety. That's because of bugs and problems in programs and the operating system as well as hardware problems.

    In the CAFIS accident on June 16, malfunctions occurred in communication devices and a program bug was present. However, all the hosts and front-end processors (FEP) in CAFIS are fault tolerant machines, and networks linking six FEP machines in a center have redundant channels for all components.

    Communications between the FEPs were down for a sustained period, because all the communications control programs in the FEPs were initialized by mistake when trouble occurred within the network.

    During this outage, another program accessing a file didn't take the necessary steps to recover to a normal state. And as a result, the trouble couldn't be fixed as designed, thus the system was down for about five hours.

    These examples of troubles in large-scale online systems show that such systems are sensitive. Round-the-clock operations are required of an increasing number of systems especially among corporate enterprise systems. This is in part due to Internet businesses going into full operation and communications with overseas offices expanding dramatically.

    Operators of online systems must be careful to ensure that those systems do operate properly and with the utmost reliability at all times.

    Table: Accidental outages in large-scale systems
    in Japan during 1998

    Date of Occurrence

    System Brought Down

    January 20

    Financial Service Network, Bank ANSER, Run by NTT Data

    February 4

    Train Service Management System, ATOS, for Controlling the Chuo Line of East Japan Railway Co. (JR East)

    April 30

    Bond (such as convertible bonds) Purchasing and Selling System Operated by the Tokyo Stock Exchange (TSE)

    June 16

    Cooperative Online System CAFIS, Run by NTT Data Corp., for Credit Card Business Sector

    July 7

    Cooperative Online System CAFIS, Run by NTT Data Corp., for Credit Card Business Sector

    August 4

    Transaction Online System of Asahi Bank Ltd.

    August 11

    Travelers Management System of All Nippon Airways Co., Ltd. (ANA)

    August 25

    Dealing System for Over-The-Counter Stocks, JASDAQ System, Operated by the Securities Dealers Association of Japan

    November 4

    Credit Card Call System of KDD Co., Ltd.

    November 8

    Credit Control System of DC Card Co., Ltd.

    November 24, 25, 26, 27, 30 and December 1

    Trading System for Futures Options Operated by the Tokyo Stock Exchange (TSE)


    (return to news)

    Related stories:
    Tokyo Train Chaos Caused by Failed Comm. Controller
    NTT Data's Online Credit Card System Fails Again

    (Tomohiko Hoshino, Hidenori Kawamata; Staff Editors; Nikkei Computer)



    <Visit News Center for more Asian news.>



    Copyright © 1997-98
    Nikkei BP BizTech, Inc.
    All Rights Reserved.
    Updated: Thu Dec 24 17:37:32 1998 PDT