Disclaimer: I disclaim EVERYTHING disclaimable about this page.
Enjoy.
This week at the IEEE Real-Time Systems Symposium I heard a fascinating keynote address by David Wilner, Chief Technical Officer of Wind River Systems. Wind River makes VxWorks, the real-time embedded systems kernel that was used in the Mars Pathfinder mission. In his talk, he explained in detail the actual software problems that caused the total system resets of the Pathfinder spacecraft, how they were diagnosed, and how they were solved. I wanted to share his story with each of you.
VxWorks provides preemptive priority scheduling of threads. Tasks on the Pathfinder spacecraft were executed as threads with priorities that were assigned in the usual manner reflecting the relative urgency of these tasks.
Pathfinder contained an "information bus", which you can think of as a shared memory area used for passing information between different components of the spacecraft. A bus management task ran frequently with high priority to move certain kinds of data in and out of the information bus. Access to the bus was synchronized with mutual exclusion locks (mutexes). The meteorological data gathering task ran as an infrequent, low priority thread, and used the information bus to publish its data. When publishing its data, it would acquire a mutex, do writes to the bus, and release the mutex. If an interrupt caused the information bus thread to be scheduled while this mutex was held, and if the information bus thread then attempted to acquire this same mutex in order to retrieve published data, this would cause it to block on the mutex, waiting until the meteorological thread released the mutex before it could continue. The spacecraft also contained a communications task that ran with medium priority.
Most of the time this combination worked fine. However, very infrequently it was possible for an interrupt to occur that caused the (medium priority) communications task to be scheduled during the short interval while the (high priority) information bus thread was blocked waiting for the (low priority) meteorological data thread. In this case, the long-running communications task, having higher priority than the meteorological task, would prevent it from running, consequently preventing the blocked information bus task from running. After some time had passed, a watchdog timer would go off, notice that the data bus task had not been executed for some time, conclude that something had gone drastically wrong, and initiate a total system reset.
This scenario is a classic case of priority inversion.
HOW WAS THIS DEBUGGED?
VxWorks can be run in a mode where it records a total trace of all interesting system events, including context switches, uses of synchronization objects, and interrupts. After the failure, JPL engineers spent hours and hours running the system on the exact spacecraft replica in their lab with tracing turned on, attempting to replicate the precise conditions under which they believed that the reset occurred. Early in the morning, after all but one engineer had gone home, the engineer finally reproduced a system reset on the replica. Analysis of the trace revealed the priority inversion.
HOW WAS THE PROBLEM CORRECTED?
When created, a VxWorks mutex object accepts a boolean parameter that indicates whether priority inheritance should be performed by the mutex. The mutex in question had been initialized with the parameter off; had it been on, the low-priority meteorological thread would have inherited the priority of the high-priority data bus thread blocked on it while it held the mutex, causing it be scheduled with higher priority than the medium-priority communications task, thus preventing the priority inversion. Once diagnosed, it was clear to the JPL engineers that using priority inheritance would prevent the resets they were seeing.
VxWorks contains a C language interpreter intended to allow developers to type in C expressions and functions to be executed on the fly during system debugging. The JPL engineers fortuitously decided to launch the spacecraft with this feature still enabled. By coding convention, the initialization parameter for the mutex in question (and those for two others which could have caused the same problem) were stored in global variables, whose addresses were in symbol tables also included in the launch software, and available to the C interpreter. A short C program was uploaded to the spacecraft, which when interpreted, changed the values of these variables from FALSE to TRUE. No more system resets occurred.
ANALYSIS AND LESSONS
First and foremost, diagnosing this problem as a black box would have been impossible. Only detailed traces of actual system behavior enabled the faulty execution sequence to be captured and identified.
Secondly, leaving the "debugging" facilities in the system saved the day. Without the ability to modify the system in the field, the problem could not have been corrected.
Finally, the engineer's initial analysis that "the data bus task executes very frequently and is time-critical -- we shouldn't spend the extra time in it to perform priority inheritance" was exactly wrong. It is precisely in such time critical and important situations where correctness is essential, even at some additional performance cost.
HUMAN NATURE, DEADLINE PRESSURES
David told us that the JPL engineers later confessed that one or two system resets had occurred in their months of pre-flight testing. They had never been reproducible or explainable, and so the engineers, in a very human-nature response of denial, decided that they probably weren't important, using the rationale "it was probably caused by a hardware glitch".
Part of it too was the engineers' focus. They were extremely focused on ensuring the quality and flawless operation of the landing software. Should it have failed, the mission would have been lost. It is entirely understandable for the engineers to discount occasional glitches in the less-critical land-mission software, particularly given that a spacecraft reset was a viable recovery strategy at that phase of the mission.
THE IMPORTANCE OF GOOD THEORY/ALGORITHMS
David also said that some of the real heroes of the situation were some people from CMU who had published a paper he'd heard presented many years ago who first identified the priority inversion problem and proposed the solution. He apologized for not remembering the precise details of the paper or who wrote it. Bringing things full circle, it turns out that the three authors of this result were all in the room, and at the end of the talk were encouraged by the program chair to stand and be acknowledged. They were Lui Sha, John Lehoczky, and Raj Rajkumar. When was the last time you saw a room of people cheer a group of computer science theorists for their significant practical contribution to advancing human knowledge? :-)
It was quite a moment.
POSTLUDE
For the record, the paper was:
L. Sha, R. Rajkumar, and J. P. Lehoczky. Priority Inheritance Protocols: An Approach to Real-Time Synchronization. In IEEE Transactions on Computers, vol. 39, pp. 1175-1185, Sep. 1990.
Paris, 19 July 1996
ARIANE 5
Flight 501 Failure
Report by the Inquiry Board
The Chairman of the Board :
Prof. J. L. LIONS
On 4 June 1996, the maiden flight of the Ariane 5 launcher ended in a failure. Only about 40 seconds after initiation of the flight sequence, at an altitude of about 3700 m, the launcher veered off its flight path, broke up and exploded. Engineers from the Ariane 5 project teams of CNES and Industry immediately started to investigate the failure. Over the following days, the Director General of ESA and the Chairman of CNES set up an independent Inquiry Board and nominated the following members :
- Prof. Jacques-Louis Lions (Chairman) Academie des Sciences (France)
- Dr. Lennart Lbeck (Vice-Chairman) Swedish Space Corporation (Sweden)
- Mr. Jean-Luc Fauquembergue Delegation Generale pour l'Armement (France)
- Mr. Gilles Kahn Institut National de Recherche en Informatique et en
Automatique (INRIA), (France)
- Prof. Dr. Ing. Wolfgang Kubbat Technical University of Darmstadt (Germany)
- Dr. Ing. Stefan Levedag Daimler Benz Aerospace (Germany)
- Dr. Ing. Leonardo Mazzini Alenia Spazio (Italy)
- Mr. Didier Merle Thomson CSF (France)
- Dr. Colin O'Halloran Defence Evaluation and Research Agency (DERA), (U.K.)
The terms of reference assigned to the Board requested it
- to determine the causes of the launch failure,
- to investigate whether the qualification tests and acceptance tests were
appropriate in relation to the problem encountered,
- to recommend corrective action to remove the causes of the anomaly and
other possible weaknesses of the systems found to be at fault.
The Board started its work on 13 June 1996. It was assisted by a Technical Advisory Committee composed of :
- Dr Mauro Balduccini (BPD)
- Mr Yvan Choquer (Matra Marconi Space)
- Mr Remy Hergott (CNES)
- Mr Bernard Humbert (Aerospatiale)
- Mr Eric Lefort (ESA)
In accordance with its terms of reference, the Board concentrated its investigations on the causes of the failure, the systems supposed to be responsible, any failures of similar nature in similar systems, and events that could be linked to the accident. Consequently, the recommendations made by the Board are limited to the areas examined. The report contains the analysis of the failure, the Board's conclusions and its recommendations for corrective measures, most of which should be undertaken before the next flight of Ariane 5. There is in addition a report for restricted circulation in which the Board's findings are documented in greater technical detail. Although it consulted the telemetry data recorded during the flight, the Board has not undertaken an evaluation of those data. Nor has it made a complete review of the whole launcher and all its systems.
This report is the result of a collective effort by the Commission, assisted by the members of the Technical Advisory Committee.
We have all worked hard to present a very precise explanation of the reasons for the failure and to make a contribution towards the improvement of Ariane 5 software. This improvement is necessary to ensure the success of the programme.
The Board's findings are based on thorough and open presentations from the Ariane 5 project teams, and on documentation which has demonstrated the high quality of the Ariane 5 programme as regards engineering work in general and completeness and traceability of documents.
Chairman of the Board
On the basis of the documentation made available and the information presented to the Board, the following has been observed:
The weather at the launch site at Kourou on the morning of 4 June 1996 was acceptable for a launch that day, and presented no obstacle to the transfer of the launcher to the launch pad. In particular, there was no risk of lightning since the strength of the electric field measured at the launch site was negligible. The only uncertainty concerned fulfilment of the visibility criteria.
The countdown, which also comprises the filling of the core stage, went smoothly until H0-7 minutes when the launch was put on hold since the visibility criteria were not met at the opening of the launch window (08h35 local time). Visibility conditions improved as forecast and the launch was initiated at H0 = 09h 33mn 59s local time (=12h 33mn 59s UT). Ignition of the Vulcain engine and the two solid boosters was nominal, as was lift-off. The vehicle performed a nominal flight until approximately H0 + 37 seconds. Shortly after that time, it suddenly veered off its flight path, broke up, and exploded. A preliminary investigation of flight data showed:
The origin of the failure was thus rapidly narrowed down to the flight control system and more particularly to the Inertial Reference Systems, which obviously ceased to function almost simultaneously at around H0 + 36.7 seconds.
The information available on the launch includes:
- telemetry data received on the ground until H0 + 42 seconds
- trajectory data from radar stations
- optical observations (IR camera, films) - inspection of recovered material.
The whole of the telemetry data received in Kourou was transferred to CNES/Toulouse where the data were converted into parameter over time plots. CNES provided a copy of the data to Aerospatiale, which carried out analyses concentrating mainly on the data concerning the electrical system.
The self-destruction of the launcher occurred near to the launch pad, at an altitude of approximately 4000 m. Therefore, all the launcher debris fell back onto the ground, scattered over an area of approximately 12 km2 east of the launch pad. Recovery of material proved difficult, however, since this area is nearly all mangrove swamp or savanna.
Nevertheless, it was possible to retrieve from the debris the two Inertial Reference Systems. Of particular interest was the one which had worked in active mode and stopped functioning last, and for which, therefore, certain information was not available in the telemetry data (provision for transmission to ground of this information was confined to whichever of the two units might fail first). The results of the examination of this unit were very helpful to the analysis of the failure sequence.
Post-flight analysis of telemetry has shown a number of anomalies which have been reported to the Board. They are mostly of minor significance and such as to be expected on a demonstration flight.
One anomaly which was brought to the particular attention of the Board was the gradual development, starting at Ho + 22 seconds, of variations in the hydraulic pressure of the actuators of the main engine nozzle. These variations had a frequency of approximately 10 Hz.
There are some preliminary explanations as to the cause of these variations, which are now under investigation.
After consideration, the Board has formed the opinion that this anomaly, while significant, has no bearing on the failure of Ariane 501.
In general terms, the Flight Control System of the Ariane 5 is of a standard design. The attitude of the launcher and its movements in space are measured by an Inertial Reference System (SRI). It has its own internal computer, in which angles and velocities are calculated on the basis of information from a "strap-down" inertial platform, with laser gyros and accelerometers. The data from the SRI are transmitted through the databus to the On-Board Computer (OBC), which executes the flight program and controls the nozzles of the solid boosters and the Vulcain cryogenic engine, via servovalves and hydraulic actuators.
In order to improve reliability there is considerable redundancy at equipment level. There are two SRIs operating in parallel, with identical hardware and software. One SRI is active and one is in "hot" stand-by, and if the OBC detects that the active SRI has failed it immediately switches to the other one, provided that this unit is functioning properly. Likewise there are two OBCs, and a number of other units in the Flight Control System are also duplicated.
The design of the Ariane 5 SRI is practically the same as that of an SRI which is presently used on Ariane 4, particularly as regards the software.
Based on the extensive documentation and data on the Ariane 501 failure made available to the Board, the following chain of events, their inter-relations and causes have been established, starting with the destruction of the launcher and tracing back in time towards the primary cause.
The SRI internal events that led to the failure have been reproduced by simulation calculations. Furthermore, both SRIs were recovered during the Board's investigation and the failure context was precisely determined from memory readouts. In addition, the Board has examined the software code which was shown to be consistent with the failure scenario. The results of these examinations are documented in the Technical Report.
Therefore, it is established beyond reasonable doubt that the chain of events set out above reflects the technical causes of the failure of Ariane 501.
In the failure scenario, the primary technical causes are the Operand Error when converting the horizontal bias variable BH, and the lack of protection of this conversion which caused the SRI computer to stop.
It has been stated to the Board that not all the conversions were protected because a maximum workload target of 80% had been set for the SRI computer. To determine the vulnerability of unprotected code, an analysis was performed on every operation which could give rise to an exception, including an Operand Error. In particular, the conversion of floating point values to integers was analysed and operations involving seven variables were at risk of leading to an Operand Error. This led to protection being added to four of the variables, evidence of which appears in the Ada code. However, three of the variables were left unprotected. No reference to justification of this decision was found directly in the source code. Given the large amount of documentation associated with any industrial application, the assumption, although agreed, was essentially obscured, though not deliberately, from any external review.
The reason for the three remaining variables, including the one denoting horizontal bias, being unprotected was that further reasoning indicated that they were either physically limited or that there was a large margin of safety, a reasoning which in the case of the variable BH turned out to be faulty. It is important to note that the decision to protect certain variables but not others was taken jointly by project partners at several contractual levels.
There is no evidence that any trajectory data were used to analyse the behaviour of the unprotected variables, and it is even more important to note that it was jointly agreed not to include the Ariane 5 trajectory data in the SRI requirements and specification.
Although the source of the Operand Error has been identified, this in itself did not cause the mission to fail. The specification of the exception-handling mechanism also contributed to the failure. In the event of any kind of exception, the system specification stated that: the failure should be indicated on the databus, the failure context should be stored in an EEPROM memory (which was recovered and read out for Ariane 501), and finally, the SRI processor should be shut down.
It was the decision to cease the processor operation which finally proved fatal. Restart is not feasible since attitude is too difficult to re-calculate after a processor shutdown; therefore the Inertial Reference System becomes useless. The reason behind this drastic action lies in the culture within the Ariane programme of only addressing random hardware failures. From this point of view exception - or error - handling mechanisms are designed for a random hardware failure which can quite rationally be handled by a backup system.
Although the failure was due to a systematic software design error, mechanisms can be introduced to mitigate this type of problem. For example the computers within the SRIs could have continued to provide their best estimates of the required attitude information. There is reason for concern that a software exception should be allowed, or even required, to cause a processor to halt while handling mission-critical equipment. Indeed, the loss of a proper software function is hazardous because the same software runs in both SRI units. In the case of Ariane 501, this resulted in the switch-off of two still healthy critical units of equipment.
The original requirement acccounting for the continued operation of the alignment software after lift-off was brought forward more than 10 years ago for the earlier models of Ariane, in order to cope with the rather unlikely event of a hold in the count-down e.g. between - 9 seconds, when flight mode starts in the SRI of Ariane 4, and - 5 seconds when certain events are initiated in the launcher which take several hours to reset. The period selected for this continued alignment operation, 50 seconds after the start of flight mode, was based on the time needed for the ground equipment to resume full control of the launcher in the event of a hold.
This special feature made it possible with the earlier versions of Ariane, to restart the count- down without waiting for normal alignment, which takes 45 minutes or more, so that a short launch window could still be used. In fact, this feature was used once, in 1989 on Flight 33.
The same requirement does not apply to Ariane 5, which has a different preparation sequence and it was maintained for commonality reasons, presumably based on the view that, unless proven necessary, it was not wise to make changes in software which worked well on Ariane 4.
Even in those cases where the requirement is found to be still valid, it is questionable for the alignment function to be operating after the launcher has lifted off. Alignment of mechanical and laser strap-down platforms involves complex mathematical filter functions to properly align the x-axis to the gravity axis and to find north direction from Earth rotation sensing. The assumption of preflight alignment is that the launcher is positioned at a known and fixed position. Therefore, the alignment function is totally disrupted when performed during flight, because the measured movements of the launcher are interpreted as sensor offsets and other coefficients characterising sensor behaviour.
Returning to the software error, the Board wishes to point out that software is an expression of a highly detailed design and does not fail in the same sense as a mechanical system. Furthermore software is flexible and expressive and thus encourages highly demanding requirements, which in turn lead to complex implementations which are difficult to assess.
An underlying theme in the development of Ariane 5 is the bias towards the mitigation of random failure. The supplier of the SRI was only following the specification given to it, which stipulated that in the event of any detected exception the processor was to be stopped. The exception which occurred was not due to random failure but a design error. The exception was detected, but inappropriately handled because the view had been taken that software should be considered correct until it is shown to be at fault. The Board has reason to believe that this view is also accepted in other areas of Ariane 5 software design. The Board is in favour of the opposite view, that software should be assumed to be faulty until applying the currently accepted best practice methods can demonstrate that it is correct.
This means that critical software - in the sense that failure of the software puts the mission at risk - must be identified at a very detailed level, that exceptional behaviour must be confined, and that a reasonable back-up policy must take software failures into account.
2.3 THE TESTING AND QUALIFICATION PROCEDURES
The Flight Control System qualification for Ariane 5 follows a standard procedure and is performed at the following levels :
- Equipment qualification
- Software qualification (On-Board Computer software)
- Stage integration
- System validation tests.
The logic applied is to check at each level what could not be achieved at the previous level, thus eventually providing complete test coverage of each sub-system and of the integrated system.
Testing at equipment level was in the case of the SRI conducted rigorously with regard to all environmental factors and in fact beyond what was expected for Ariane 5. However, no test was performed to verify that the SRI would behave correctly when being subjected to the count-down and flight time sequence and the trajectory of Ariane 5.
It should be noted that for reasons of physical law, it is not feasible to test the SRI as a "black box" in the flight environment, unless one makes a completely realistic flight test, but it is possible to do ground testing by injecting simulated accelerometric signals in accordance with predicted flight parameters, while also using a turntable to simulate launcher angular movements. Had such a test been performed by the supplier or as part of the acceptance test, the failure mechanism would have been exposed.
The main explanation for the absence of this test has already been mentioned above, i.e. the SRI specification (which is supposed to be a requirements document for the SRI) does not contain the Ariane 5 trajectory data as a functional requirement.
The Board has also noted that the systems specification of the SRI does not indicate operational restrictions that emerge from the chosen implementation. Such a declaration of limitation, which should be mandatory for every mission-critical device, would have served to identify any non-compliance with the trajectory of Ariane 5.
The other principal opportunity to detect the failure mechanism beforehand was during the numerous tests and simulations carried out at the Functional Simulation Facility ISF, which is at the site of the Industrial Architect. The scope of the ISF testing is to qualify :
- the guidance, navigation and control performance in the whole flight
envelope,
- the sensors redundancy operation, - the dedicated functions of the stages,
- the flight software (On-Board Computer) compliance with all equipment
of the Flight Control Electrical System.
A large number of closed-loop simulations of the complete flight simulating ground segment operation, telemetry flow and launcher dynamics were run in order to verify :
- the nominal trajectory
- trajectories degraded with respect to internal launcher parameters
- trajectories degraded with respect to atmospheric parameters
- equipment failures and the subsequent failure isolation and recovery
In these tests many equipment items were physically present and exercised but not the two SRIs, which were simulated by specifically developed software modules. Some open-loop tests, to verify compliance of the On-Board Computer and the SRI, were performed with the actual SRI. It is understood that these were just electrical integration tests and "low-level " (bus communication) compliance tests.
It is not mandatory, even if preferable, that all the parts of the subsystem are present in all the tests at a given level. Sometimes this is not physically possible or it is not possible to exercise them completely or in a representative way. In these cases it is logical to replace them with simulators but only after a careful check that the previous test levels have covered the scope completely.
This procedure is especially important for the final system test before the system is operationally used (the tests performed on the 501 launcher itself are not addressed here since they are not specific to the Flight Control Electrical System qualification).
In order to understand the explanations given for the decision not to have the SRIs in the closed-loop simulation, it is necessary to describe the test configurations that might have been used.
Because it is not possible to simulate the large linear accelerations of the launcher in all three axes on a test bench (as discussed above), there are two ways to put the SRI in the loop:
A) To put it on a three-axis dynamic table (to stimulate the Ring Laser Gyros) and to substitute the analog output of the accelerometers (which can not be stimulated mechanically) by simulation via a dedicated test input connector and an electronic board designed for this purpose. This is similar to the method mentioned in connection with possible testing at equipment level.
B) To substitute both, the analog output of the accelerometers and the Ring Laser Gyros via a dedicated test input connector with signals produced by simulation.
The first approach is likely to provide an accurate simulation (within the limits of the three-axis dynamic table bandwidth) and is quite expensive; the second is cheaper and its performance depends essentially on the accuracy of the simulation. In both cases a large part of the electronics and the complete software are tested in the real operating environment.
When the project test philosophy was defined, the importance of having the SRIs in the loop was recognized and a decision was taken to select method B above. At a later stage of the programme (in 1992), this decision was changed. It was decided not to have the actual SRIs in the loop for the following reasons :
The opinion of the Board is that these arguments were technically valid, but since the purpose of a system simulation test is not only to verify the interfaces but also to verify the system as a whole for the particular application, there was a definite risk in assuming that critical equipment such as the SRI had been validated by qualification on its own, or by previous use on Ariane 4.
While high accuracy of a simulation is desirable, in the ISF system tests it is clearly better to compromise on accuracy but achieve all other objectives, amongst them to prove the proper system integration of equipment such as the SRI. The precision of the guidance system can be effectively demonstrated by analysis and computer simulation.
Under this heading it should be noted finally that the overriding means of preventing failures are the reviews which are an integral part of the design and qualification process, and which are carried out at all levels and involve all major partners in the project (as well as external experts). In a programme of this size, literally thousands of problems and potential failures are successfully handled in the review process and it is obviously not easy to detect software design errors of the type which were the primary technical cause of the 501 failure. Nevertheless, it is evident that the limitations of the SRI software were not fully analysed in the reviews, and it was not realised that the test coverage was inadequate to expose such limitations. Nor were the possible implications of allowing the alignment software to operate during flight realised. In these respects, the review process was a contributory factor in the failure.
In accordance with its termes of reference, the Board has examined possible other weaknesses, primarily in the Flight Control System. No weaknesses were found which were related to the failure, but in spite of the short time available, the Board has conducted an extensive review of the Flight Control System based on experience gained during the failure analysis.
The review has covered the following areas :
- The design of the electrical system,
- Embedded on-board software in subsystems other than the Inertial Reference
System,
- The On-Board Computer and the flight program software.
In addition, the Board has made an analysis of methods applied in the development programme, in particular as regards software development methodology.
The results of these efforts have been documented in the Technical Report and it is the hope of the Board that they will contribute to further improvement of the Ariane 5 Flight Control System and its software.
The Board reached the following findings:
a) During the launch preparation campaign and the count-down no events occurred which were related to the failure.
b) The meteorological conditions at the time of the launch were acceptable and did not play any part in the failure. No other external factors have been found to be of relevance.
c) Engine ignition and lift-off were essentially nominal and the environmental effects (noise and vibration) on the launcher and the payload were not found to be relevant to the failure. Propulsion performance was within specification.
d) 22 seconds after H0 (command for main cryogenic engine ignition), variations of 10 Hz frequency started to appear in the hydraulic pressure of the actuators which control the nozzle of the main engine. This phenomenon is significant and has not yet been fully explained, but after consideration it has not been found relevant to the failure.
e) At 36.7 seconds after H0 (approx. 30 seconds after lift-off) the computer within the back-up inertial reference system, which was working on stand-by for guidance and attitude control, became inoperative. This was caused by an internal variable related to the horizontal velocity of the launcher exceeding a limit which existed in the software of this computer.
f) Approx. 0.05 seconds later the active inertial reference system, identical to the back-up system in hardware and software, failed for the same reason. Since the back-up inertial system was already inoperative, correct guidance and attitude information could no longer be obtained and loss of the mission was inevitable.
g) As a result of its failure, the active inertial reference system transmitted essentially diagnostic information to the launcher's main computer, where it was interpreted as flight data and used for flight control calculations.
h) On the basis of those calculations the main computer commanded the booster nozzles, and somewhat later the main engine nozzle also, to make a large correction for an attitude deviation that had not occurred.
i) A rapid change of attitude occurred which caused the launcher to disintegrate at 39 seconds after H0 due to aerodynamic forces.
j) Destruction was automatically initiated upon disintegration, as designed, at an altitude of 4 km and a distance of 1 km from the launch pad.
k) The debris was spread over an area of 5 x 2.5 km2. Amongst the equipment recovered were the two inertial reference systems. They have been used for analysis.
l) The post-flight analysis of telemetry data has listed a number of additional anomalies which are being investigated but are not considered significant to the failure.
m) The inertial reference system of Ariane 5 is essentially common to a system which is presently flying on Ariane 4. The part of the software which caused the interruption in the inertial system computers is used before launch to align the inertial reference system and, in Ariane 4, also to enable a rapid realignment of the system in case of a late hold in the countdown. This realignment function, which does not serve any purpose on Ariane 5, was nevertheless retained for commonality reasons and allowed, as in Ariane 4, to operate for approx. 40 seconds after lift-off.
n) During design of the software of the inertial reference system used for Ariane 4 and Ariane 5, a decision was taken that it was not necessary to protect the inertial system computer from being made inoperative by an excessive value of the variable related to the horizontal velocity, a protection which was provided for several other variables of the alignment software. When taking this design decision, it was not analysed or fully understood which values this particular variable might assume when the alignment software was allowed to operate after lift-off.
o) In Ariane 4 flights using the same type of inertial reference system there has been no such failure because the trajectory during the first 40 seconds of flight is such that the particular variable related to horizontal velocity cannot reach, with an adequate operational margin, a value beyond the limit present in the software.
p) Ariane 5 has a high initial acceleration and a trajectory which leads to a build-up of horizontal velocity which is five times more rapid than for Ariane 4. The higher horizontal velocity of Ariane 5 generated, within the 40-second timeframe, the excessive value which caused the inertial system computers to cease operation.
q) The purpose of the review process, which involves all major partners in the Ariane 5 programme, is to validate design decisions and to obtain flight qualification. In this process, the limitations of the alignment software were not fully analysed and the possible implications of allowing it to continue to function during flight were not realised.
r) The specification of the inertial reference system and the tests performed at equipment level did not specifically include the Ariane 5 trajectory data. Consequently the realignment function was not tested under simulated Ariane 5 flight conditions, and the design error was not discovered.
s) It would have been technically feasible to include almost the entire inertial reference system in the overall system simulations which were performed. For a number of reasons it was decided to use the simulated output of the inertial reference system, not the system itself or its detailed simulation. Had the system been included, the failure could have been detected.
t) Post-flight simulations have been carried out on a computer with software of the inertial reference system and with a simulated environment, including the actual trajectory data from the Ariane 501 flight. These simulations have faithfully reproduced the chain of events leading to the failure of the inertial reference systems.
The failure of the Ariane 501 was caused by the complete loss of guidance and attitude information 37 seconds after start of the main engine ignition sequence (30 seconds after lift- off). This loss of information was due to specification and design errors in the software of the inertial reference system.
The extensive reviews and tests carried out during the Ariane 5 Development Programme did not include adequate analysis and testing of the inertial reference system or of the complete flight control system, which could have detected the potential failure.
On the basis of its analyses and conclusions, the Board makes the following recommendations.
R1 Switch off the alignment function of the inertial reference system immediately after lift-off. More generally, no software function should run during flight unless it is needed.
R2 Prepare a test facility including as much real equipment as technically feasible, inject realistic input data, and perform complete, closed-loop, system testing. Complete simulations must take place before any mission. A high test coverage has to be obtained.
R3 Do not allow any sensor, such as the inertial reference system, to stop sending best effort data.
R4 Organize, for each item of equipment incorporating software, a specific software qualification review. The Industrial Architect shall take part in these reviews and report on complete system testing performed with the equipment. All restrictions on use of the equipment shall be made explicit for the Review Board. Make all critical software a Configuration Controlled Item (CCI).
R5 Review all flight software (including embedded software), and in particular :
R6 Wherever technically feasible, consider confining exceptions to tasks and devise backup capabilities.
R7 Provide more data to the telemetry upon failure of any component, so that recovering equipment will be less essential.
R8 Reconsider the definition of critical components, taking failures of software origin into account (particularly single point failures).
R9 Include external (to the project) participants when reviewing specifications, code and justification documents. Make sure that these reviews consider the substance of arguments, rather than check that verifications have been made.
R10 Include trajectory data in specifications and test requirements.
R11 Review the test coverage of existing equipment and extend it where it is deemed necessary.
R12 Give the justification documents the same attention as code. Improve the technique for keeping code and its justifications consistent.
R13 Set up a team that will prepare the procedure for qualifying software, propose stringent rules for confirming such qualification, and ascertain that specification, verification and testing of software are of a consistently high quality in the Ariane 5 programme. Including external RAMS experts is to be considered.
R14 A more transparent organisation of the cooperation among the partners in the Ariane 5 programme must be considered. Close engineering cooperation, with clear cut authority and responsibility, is needed to achieve system coherence, with simple and clear interfaces between partners.
- END -
Computer Math Proof Shows Reasoning Power
By GINA KOLATA
Computers are whizzes when it comes to the grunt work of mathematics. But for creative and elegant solutions to hard mathematical problems, nothing has been able to beat the human mind. That is, perhaps, until now.
A computer program written by researchers at Argonne National Laboratory in Illinois has come up with a major mathematical proof that would have been called creative if a human had thought of it. In doing so, the computer has, for the first time, got a toehold into pure mathematics, a field described by its practitioners as more of an art form than a science. And the implications, some say, are profound, showing just how powerful computers can be at reasoning itself, at mimicking the flashes of logical insight or even genius that have characterized the best human minds.
Computers have found proofs of mathematical conjectures before, of course, but those conjectures were easy to prove. The difference this time is that the computer has solved a conjecture that stumped some of the best mathematicians for 60 years. And it did so with a program that was designed to reason, not to solve a specific problem. In that sense, the program is very different from chess-playing computer programs, for example, which are intended to solve just one problem: the moves of a chess game.
"It's a sign of power, of reasoning power," said Dr. Larry Wos, the supervisor of the computer reasoning project at Argonne. And with this result, obtained by a colleague, Dr. William McCune, he said, "We've taken a quantum leap forward."
Wos predicts that the result may mark the beginning of the end for mathematics research as it is now practiced, eventually freeing mathematicians to focus on discovering new conjectures, and leaving the proof to computers.
But the result also may challenge the very notion of creative thinking, raising the possibility that computers could take a parallel path to reach the same conclusions as great human thinkers. Or it may be that since no one has any idea how humans think, the magnificent bursts of creativity that spring apparently full blown from the minds of geniuses are actually a result of hidden, computer-like drudge work in the unconscious recesses of the brain.
Dr. Stanley Burris, a mathematician at the University of Waterloo in Canada, said that the result was "the first sort of real breakthrough in automated theorem proving," and that it did seem to be different in kind from what went before. It shows, he said, that "it's a very thin line between the mechanical and the creative and it may disappear."
Dr. Robert Boyer, a computer scientist at the University of Texas in Austin, hedged. "I think it's the most remarkable result in automated theorem proving in 30 years," he said, and "clearly a form of computer thinking." But, he added, "I don't want to make too much of that." It's best, he said, to think of a computer as "just another colleague, one that is sometimes helpful, but often not."
McCune's proof concerns a conjecture that is the very epitome of pure mathematics. "It has no applications," McCune said. His computer program proved that a set of three equations is equivalent to a Boolean algebra, that set of rules, familiar to generations of high school students, that govern unions and complements and intersections among sets.
The problem was first posed in the 1930s by Dr. Herbert Robbins, who is New Jersey Professor of Mathematics at Rutgers University in New Brunswick. Robbins said that he worked on the problem for some time, and then passed it on one of the century's most famous logicians, Dr. Albert Tarski of Stanford University. Tarski, who is now dead, worked on the problem, included it in a book, and handed it out to graduate students and visitors.
Burris, for example said that Tarski suggested the problem to him in the early 1970s, while he was visiting Stanford for a couple of months. Tarski, he said, "liked to throw out challenging problems to people passing through."
While mathematicians were batting around Robbins's problem, computer scientists were striving to see if they could get computers to reason. Among them was Wos, who started working on automated reasoning in the 1960s. It was a time when computers were primitive, clunky and slow, and researchers were divided on how to proceed. Some believed the key was to figure out how humans reasoned and then to create computer programs that mimicked the process. Wos disagreed.
"Nobody knows how humans reason," he said. "When you talk to mathematicians and say, 'I understand you proved a great theorem. How did you do it?' They'll say, 'Well, I walked around my house a lot and I read some papers and I thought,' "
So he and his colleagues followed a different path. "We didn't ask ourselves what people do when they think," Wos said. "That was irrelevant." he said. Instead, he said: "We asked how can you tell a computer what this problem is about? How can you get it to draw conclusions that follow inevitably and logically from hypotheses and thereby prove theorems?"
He and his colleagues began writing programs in which the computer would assume that the hypothesis in question was false and would then examine the consequences. If it found a contradiction, that would be proof that the hypothesis was true. The computer would also assume that the hypothesis was true and do the same thing, looking for contradictions that would show it was false.
To prevent the computer from getting lost in checking out lengthy chains of extended consequences, the investigators added strategies like ignoring any logical statements that contained more than 100 symbols.
"I got into strategy because I was a poker player and a bridge player," Wos said. "A good con man uses strategy. He figures out what your weakness is and plays to that weakness. He doesn't just randomly try to trick you."
Wos's computer programs were soon able to find proofs for basic mathematical problems. "We could do the problems sometimes better than the students and sometimes, once in a great while, better than the professors could," Wos said.
For more than 16 years, Wos and his colleagues stuck to problems from mathematics textbooks. Wos explained that when the computer tried to prove something for which a proof existed, and failed, the investigators knew that "there's a problem with our program." If they try to solve an unsolved problem, and fail, they have no way of knowing whether they missed something obvious.
"My own mathematician friends would say, 'Wos, why are you doing what we already know? Why don't you give us something new?' " Wos said. In the early 1970s, he said, he told one of his badgering friends that he thought it would be another 30, 40 or even 50 years before computers could solve major problems that had stumped mathematicians.
The first time they tried something new was in 1978, when Wos said, "a little baby problem" came along. They solved it, and then solved five others like it. Wos was ecstatic. "We had done more than I thought I could do in my lifetime," he said.
The group kept adding strategies to its programs. It added one recently that said to try things that worked in previous problems. Wos said some of his colleagues scoffed at that and that he himself did not know if it would work. But, he said, it turned out to be surprisingly useful.
In 1979, Wos learned about Robbins's problem but, although he and his colleagues tried to solve it from time to time with ever more refined computer programs, they failed. McCune joined the group in 1984, with a new Ph.D. from Northwestern University and a thirst to see how far he could push computers to go.
The fact that the Robbins conjecture was certifiably hard -- it had, after all, stumped some of the best minds in mathematics and had gone unsolved for decades -- appealed to McCune. But the problem also thwarted his best computer reasoning programs.
Finally, on Oct. 2, McCune gave the Robbins conjecture to a new automated reasoning program that he had written called EQP, for equational prover. Eight days later, on Oct. 10, the computer spewed out a proof. McCune, a low-key researcher, said he was "amazed." Wos, his exuberant supervisor, said, "Bill was in heaven."
Encouraged, McCune tried to get the computer to refine the proof. He started his program searching for a better proof on Nov. 15. It found one on Nov. 25.
McCune said he had checked his proof by computer and by hand. Dr. Mark Stickel of the Stanford Research Institute in Palo Alto, Calif., independently checked it with his computer program. And Burris independently checked it by hand.
Burris said the proof, freshly out of the computer, was in the form of statements so long that he had to print them sideways across the page, and he had to specify that the printer use small type. "It was pretty unreadable," he said. "The machine says, 'I got that step from two steps before, but it doesn't fill in all the details" He spent several days rewriting it and now has it down to three normal-looking pages of mathematical statements.
The proof will be published within a few months in The Journal of Automated Reasoning, Wos said. He will also publish it on the Internet and will put out a call for more problems as certifiably hard as Robbins's problem.
When he saw that he had actually proved Robbins's conjecture, McCune called the 81-year-old mathematician at his office at Rutgers, amazing him with the news. In a telephone interview, Robbins said he was delighted. "Isn't that marvelous," he said. "I'm glad I lived long enough to see it."
But McCune certainly had a different experience than a mathematician might have had if he or she had created the proof with pure thought. So was something lost in the process? Did McCune have anything resembling a Eureka moment?
Well, not quite, McCune said. "I have a good feeling," he said. "In a sense, I have a feeling that the computer has been creative."
McCune, however, said he had little interest in speculating on the philosophical implications of his work. "I just work on the problems and try to solve them," he said.
Not so Wos, who loves to ponder the meaning and future of human reasoning. After all, he said, he insisted on calling his computer project "automated reasoning," rather than "automated theorem proving," as some suggested, explaining that "it's not just mathematics that I care about."
Wos predicts that in a few decades computers might be as agile at reasoning as they now are at calculating.
Dr. Daniel Dennett, a philosopher who directs Tufts University's Center for Cognitive Studies, said, "Philosophers have often dreamt of such a device -- a thought processor instead of a word processor."
"If you start inferring invalidly because you left out a premise, it will say, 'Wait a minute,' " he explained.
"I want to tell people the real truth of it," Wos said. "If we succeed really big," he said, giving people hand held thought processors like they now have hand held numerical calculators, "what an incredible improvement we can make to their lives."