Introduction

Demonstrating safety for the application of ever increasingly complex technologies is always a formidable task. In the event of an accident, society wants someone to blame. How would your records face up to legal scrutiny and how would you be able to demonstrate that you have taken reasonable care as a professional engineer/manager should you be faced with a court appearance?

The problem is that many system engineers do not have the appropriate training; the required safety approaches, tools and techniques; and their managers do not know when and how they may be applied and appropriately resourced. A significant complicating factor is the confusion created when trying to combine goal-based safety criteria (e.g. FAR25.1309 or JSP553) for acceptable levels of technical failure probability with the risk-based safety criteria (e.g. Def Stan 00-56, MIL-STD 882) for acceptable levels of accident probability.

This book aims to provide a basic skill-set for both safety practitioners and their managers by defining stakeholder (civil, military and legal) expectations and showing how these expectations can be efficiently accomplished and managed throughout the product lifecycle (i.e from concept design to product disposal).

Layout

Demonstrating the accomplishment of safety requirements is likely to be a formidable task. The problem is that many system engineers:

  • do not have the appropriate training in the required safety approaches, tools and techniques;
  • their managers do not know when and how they may be applied.

A revised relationship between management and safety is the most important avenue to explore.

It is this relationship between complexity and control that lies at the heart of the problem of safety management and which is of both pragmatic and academic importance. We need some way of measuring safety and an ability to ensure that we arrive at the necessary safety parameters.

It is implicit, therefore, that all reasonably foreseeable hazards have to be identified systematically (throughout the product life-cycle, not only during development) and the risk assessed before a judgement can be made upon their acceptability.

In order to do this we have to understand the issues that influence safety and the means by which they are identified and managed. Only then can we judge the acceptability of any threats associated with the initial and continued use of a particular product.

This book attempts to address many of these issues:

Firstly, in Chapter 1, we considering the legal issues associated with system safety. The purpose of this chapter is to reinforce the liabilities assumed in the generation of safety related documentation.

In Chapter 2 we attempt to put the term “safety” into perspective, and the basic approaches used to achieve it. The next three chapters will then explore three of these approaches:

The use of Regulatory Standards is explored in Chapter 3
Chapter 4 considers the Risk-Based approach, which is widely adopted in the Military industry as well as by Health & Safety specialists.

Chapter 5 introduces the civil aeronautical approach to safety assessments, which (for the want of a better term) we shall call the “Goal-based” approach (in contrast to the Risk-based approach in Chapter 4), as it provides clear goals (i.e. failure probability targets) for system designers to achieve.

Appendix A supports chapters 4 & 5 as it summarises a list of potential tools and techniques that can be used for cause and consequence assessments
In Chapter 6 we consider the issues surrounding the application of the term “Hazard” and how the causes of hazards can be identified.;

Chapter 7 provides an introduction into the fail-safe concept, which is needed to ensure the high levels of functional integrity needed from essential systems.

The next two chapters considers the generic approach to two frequently asked for deliverables:

Chapter 8 considers the System Safety Assessment (SSA), which is usually required for the certification of a new/modified system. In the civil arena, the SSA is often based on the Goal-based approach.

In contrast, the Safety Case in considered in Chapter 9. The Safety Case is the document that manages (via the Risk-based approach) the major hazards that an operator/maintainer of a system/facility faces, as well as the means employed to control those hazards.

Probability assessment (either qualitative or quantitative) is an essential part of any safety validation (whether risk- or goal-based). Chapter 10 provides some guidance in this regard and should be read with an understanding of Chapter 7.

In Chapter 11 we continue the probability estimation theme of Chapter 10 by applying it to the Minimum Equipment List, which allows operation of a system despite deficiencies and equipment failures.

Chapter 12 explores how, via the Safety Management System, organisations manage safety as an integral part of their business management activities
Appendix A supports Chapter 6 by summarising the advantages and limitations of some of the models used for causal or consequence analyses.

Appendix B supports Chapter 4, 5, 8 and 9 by summarising useful safety criteria that can be use in safety assessments.

Appendix C provides a brief introduction to Goal Structured Notation, which is useful for defining Safety Arguments as referenced in Chapters 8 and 9.

Men are only clever at shifting blame from their own shoulders to those of others
- Titus Livius (59BC-AD12)

Introduction
Most industrial activities are regulated, and this includes military and civil aviation safety management. Ethical considerations and an increasingly litigious society regading product liability have become driving factors in changing the way we conduct the initial safety certification (which leads to the release of a system) and manage the continuing safety of the system (including operations and maintenance).

Useful links not included in the text

You may find the following additional sites of relevance when reading the above chapter:

To understand what we think, we need to hear what we are saying
- An old saying

Introduction

We often hear pundits pontificate the catch-phrase 'safety at all costs!' But what do we understand by the term 'safety'? Safety can be defined as 'freedom from unacceptable risk of harm' (ISO/IES Guide 2: 1986 Definition 2.5). But how do we determine an acceptable risk of harm?

The nice thing about standards is that you have so many to choose from; further, if you do not like any of them, you can just wait for next year's model.
- Andrew Tanenbaum

Introduction
A regulatory approach is the most common method employed to enforce a required safety standard or, in the case of many military regulations, to enforce a process. Most industrial activities are regulated, and this includes military and civil aviation safety management. Within the contect of safety, this chapter explores some of the most widely used regulations used in the western aviation industry and how these influence our approach to safety.

Useful links not included in the text

ICAO

FAA

JAA

EASA

US DoD

UK MoD

UK Health and Safety

Be wary of the man who urges an action in which he himself incurs no risk.
- Joaquin Setanti
Introduction
Risk management is a concept applied as part of a decision-making process, and can also be explained as follows:

A process … (Risk-based decision making involves a series of basic steps. It can add value to almost any situation, especially when the possibility exists for serious (or catastrophic outcomes). The steps can be used at different levels of detail and with varying degrees of formality, depending on the situation).

... that organizes information about the possibility for one or more unwanted outcomes … (This information about the possibility for one or more unwanted outcomes separates risk-based decision making from more traditional decision-making. These unwanted outcomes can project-, market-, mission and/or safety related)

... into a broad, orderly structure … (Most decisions require information not only about risk, but about other things as well. This additional information can include such things as cost, schedule requirements, and public perception. In risk-based decision making, all of the identifiable factors that affect a decision must be considered. The factors may have different levels of importance in the final decision) 

... that helps decision makers … (The only purpose of risk-based decision making is to provide enough information to help someone make a more informed decision. The information must therefore be compiled and presented in a consistent (e.g. the safety criteria applied) and user-friendly fashion(e.g. a Hazard Log) to ensure that “apples are not compared with pears”)

... make more informed management choices. (The objective of risk-based decision making is to help people make better, more logical choices without complicating their work or taking away their authority).

Useful links not included in the text

Colour pictures

In absence of clearly defined goals, we become strangely loyal to performing daily acts of trivia.

Introduction
An acceptable level of safety for aviation is normally defined in terms of an acceptable aircraft accident rate. There are two primary causes of aircraft accidents:
⦁ operational (such as pilot error, weather and operating procedures) an
⦁ technical (such as design errors, manufacturing errors, maintenance errors and part failures).
Historical accident rates indicate that technical cause factors will account for 40 to 50 per cent of the total accidents. When certifying a new (or modified) system, designers concentrate on the technical integrity of the system which has been designed around an operational requirement. Now, for a number of years, aeroplane systems were evaluated to specific requirements, to the "single fault" criterion, or to the “fail-safe design” concept (see Chapter 7).
As later-generation aeroplanes were developed, more safety-critical functions were required to be performed. This generally resulted in an increase in the complexity of the systems designed to perform these functions. The likely hazards to the aeroplane and its occupants that could arise in the event of loss of one or more functions (provided by a system or that system's malfunction) had to be considered, as also did the potential interaction between systems performing different functions.
The application of the Fail-Safe concept thus had to be supplemented by some sort of safety target (i.e. goal) against which the integrity of the system architecture could be evaluated

 

Imagine a world with no hypothetical situations...

Introduction

Safety is the freedom from accidents.  Accidents are caused by hazards.  But what exactly do we understand the term “hazard” to mean?

The term “hazard” goes by many (often confusing) definitions.

Note that the presence of a hazard does not make an accident inevitable. From the discussions in this chapter, it is proposed that an all-encompassing definition might thus rather be "A hazard is a prerequisite condition that can develop into an accident through a sequence of failures, events and actions in the process of meeting an objective".

There is a causal chain from causes to hazards to accidents. Rhys (2002, page 4) defines an accident as: "an unintended event or sequence of events which causes death, injury, environmental damage or material damage".   The accident is the undesired outcome, rather than the initiating event or any intermediate state or hazard.

"The best car safety device is a rear-view mirror with a cop in it" - Dudley Moore (1935 - 2002)

Introduction

There are many reasons why systems may fail.

The first line of defence against hazardous failure conditions is avoidance in which design and management techniques should be applied to minimise the likelihood of faults arising from random or systemic causes (see Chapter 6).

The second line of defence is based on the provision of fault tolerance as a means of dynamic protection during system operation.  Possible approaches include:

  • Fault masking, where the system or component is designed to survive potential failures with full functionality,
  • Graceful degradation (sometimes referred to as fail-soft), where the system or component is designed so that in the event of a failure its operation will be maintained but with some loss of functionality,
  • Fail-safe, where in the event of a failure, the system or component automatically reverts to one of a small set of states known to be safe, and thereafter operates in a highly restricted mode.  This may involve complete loss of functionality, or reverting to back-up/redundant features.

Life can only be understood backwards, but it must be lived forwards” - Soren Kierkegaard (1813 - 1855)

Introduction

Lloyd and Tye (1995) recall that the airworthiness requirements (e.g. BCAR and FAR) of the mid 20th century “were devised to suit the circumstances.  Separate sets of requirements were stated for each type of system and they dealt with the engineering detail intended to secure sufficient reliability”.  Where the system was such that its failure could result in serious hazard, the degree of redundancy (i.e. multiplication of the primary systems or provision of emergency systems) was stipulated.  Compliance was generally shown by some sort of an FMEA.

For simple, self-contained systems this approach had its merits.

However, systems rapidly became more complex. Complex systems have a considerable amount of interfaces and cross/inter-connections between the electrical, avionic, hydraulic and mechanical systems .  In addition, there are essential interfaces with the pilot, maintenance personnel and flight performance of the aircraft.  The aircraft designer is thus faced not only with the analysis of each individual system independently, but needs to consider how these systems act in concert with other systems.

Airworthiness Authorities could thus not continue to issue detailed engineering requirements for each new application. Firstly, this would lead to a mountain of regulatory requirements and, secondly, this approach would inhibit innovation by leading designers into sub-optimum solutions.

It therefore became necessary to have some basic objective requirement (see Chapter 5 paragraph 2) related to an acceptable level of safety, which could be applied to the Safety Certification and Release To Service (RTS) of any system or function.

This new approach required that, for Safety Certification, the designers conduct a thorough assessment of potential failures and evaluate the degree of hazard inherent in the effect of failures. With complex critical systems and functions the designer has not only to consider the effect of single failures, but also the effects of possible multiple failures – particularly if some of these failures are passive (see Chapter 6).  The designers need to show that there is an inverse relationship (see Chapter 5) between the probability of occurrence and the degree of hazard inherent in its effect.

The designers also need to consider whether the design is such that it can lead unnecessarily to errors during manufacture, maintenance or operation by the crew. Furthermore, the designer needs to consider the environment that the systems would be exposed to, which could involve large variations in atmospheric temperature, pressure, acceleration (e.g. due to gusts), vibration, and other hostile events such as lightning strikes and icing.

The vehicle to report this demonstration, for the purposes of Safety Certification and Release To Service (RTS), became known as the System Safety Assessment (SSA).

Colour Illustrations

The superior man, when resting in safety, does not forget that danger may come.  When in a state of security he does not forget the possibility of ruin. When all is orderly, he does not forget that disorder may come”   Confucius (551-479BC)

Introduction

The development of the Safety Case as an European approach to safety management can be traced though a series of major accidents (which are explored in this chapter).  Until quite recently only the people directly involved would have been held to blame for an accident.  Now it is recognised that safety is everybody’s concern. Key lessons learned for these disasters included:

  • Engineering:  Visibility is needed of decisions/assumptions that effect safety.  However, it is also recognised that engineering alone cannot guarantee safety.
  • Operations: Systems evolve, as do their operational application.  Procedures and maintenance do affect safety.  Frequent training can improve effectiveness.
  • Management:  Are responsible for the development of a safety culture in their organisations by defining safety policies and allocating resources in the development thereof.

An approach was thus called for to supplement the regulatory shortcomings, and this was termed the Safety Case.   The major push in the development of the Safety Case concept was the tri-partite (i.e. Government, Industry and Unions) Advisory Committee on Major Hazards (ACMH), which was formed after the Flixborough disaster.  The most important and far reaching of their recommendations was that owners of major hazardous sites/facilities should develop a living Safety Case to identify and control hazards so as to prevent accidents.

"Do not expect to arrive at certainty in every subject which you pursue.  There are a hundred things wherein we mortals must be content with probability, where our best light and reasoning will reach no farther". - Isaac Watts

Introduction

Amongst various requirements, the certification of an aircraft requires proof that any single failure, or reasonable sequence of failures, likely to lead to a catastrophe has a sufficiently low probability of occurrence.  This has led (refer Chapter 4) to the general principle that an inverse relationship should exist between the probability of loss of function(s) or malfunction(s) (leading to a serious Failure Condition) and the degree of hazard to the aeroplane and its occupants arising therefrom.

It should go without saying that a low probability of occurrence equates with a high level of safety.

Captain Lavendar of the Hussars, a balloon observer, unfortunately allowed the spike of his full-dress helmet to impinge against the envelope of his balloon. There was a violent explosion and the balloon carried out a series of fantastic and uncontrollable manoeuvres, whilst rapidly emptying itself of gas. The pilot was thrown clear and escaped injury as he was lucky enough to land on his helmet.
Remarks: This pilot was flying in full-dress uniform because he was the Officer of the Day. In consequence it has been recommended that pilots will not fly during periods of duty as Officer of the Day. Captain Lavendar has subsequently requested an exchange posting to the Patroville Alps, a well known muleunit of the Basques.” - No2 Brief from Daedalian Foundation Newsletter (Dec 1917)

Introduction:
A number of factors and inherent dangers exist that may influence the achievement of an acceptable level of system safety:
⦁ Aircraft are very complex and highly integrated with a multitude of critical systems involving interfaces between hardware, software and operators. These configurations and interfaces are not stagnant and continue to evolve, introducing new situations and conditions;
⦁ Aircraft, especially military, are required to operate in very demanding environments. Actual testing under realistic environmental conditions is not possible in all cases;
⦁ weight restrictions require aircraft designs to be optimised with minimum margins of safety;
⦁ redundancy is often considered an unaffordable luxury, especially for military aircraft types;
⦁ design restrictions often place limitations on safety measures;
⦁ during service life, the operational usage might change beyond that assumed in the original design and definition of the maintenance schedule;
⦁ despite testing, unexpected hazardous conditions (such as flutter and stores separation problems) may occur;
⦁ cost-cutting measures (e.g. extended maintenance intervals, less training, etc)
⦁ other imperatives, such as mission accomplishment; available financial resources and schedule constraints may at times conflict with the technical airworthiness rules and standards.
As a result personnel associated with the design, manufacture, maintenance and material support of aeronautical products may be exposed to an evolving, ever-changing, level of risk.
Until quite recently only the people directly involved would have been held to blame for an accident. Now it is recognised that safety is everybody’s concern. However, whilst individuals are responsible for their own actions, only managers have the authority and resources to correct the attitudes and organisational deficiencies which commonly cause accidents. An accident is an indication of a failure on the part of management. What is required is an ordered approach to manage safety throughout the system’s lifecycle. This ordered approach is facilitated by the Safety Management System (SMS). This chapter provides some guidance on the philosophy and approach to a Safety Management System.

Accidents are not due to lack of knowledge, but failure to use the knowledge we have
- Trevor Kletz

Introduction:
Aircraft flight has been transformed from an adventurous activity enjoyed by a selected few to a stable mass-market service industry which is largely taken for granted…… until things go wrong. The industry is then dominated by public perception of risk and the social amplification thereof. Accidents resulting in hull loss often result in fatalities and are almost always treated to extensive coverage in the national, if not world-wide, press. The aircraft industry is set to become more complex, the skies more crowded, and the budgetary pressure will increase. A new impetus must be found in pro-safety activity if the high confidence of the public is to be maintained, let alone improved, through the impending doubling of traffic by 2020 and beyond. It will not be sufficient to increase the reliability of technical systems alone.

Consider the failure mode: “Loss of CFDS function”:

Civil Regulatory Authorities would typically require (see Chapter 5) the application of FAR25.1309 (in the USA) or CS25.1309 (in Europe) when evaluating system safety.

AMC 25.1309 provides the following failure severity categories:

Severity

No Safety Affect

Minor

Major

Hazardous

Catastrophic

Effect
Failure Conditions that may not have an effect on safety, operational capability or crew workload.
At most a nuisance.
Slight reduction in safety margins.
Slight increase in crew workload.
Some inconvenience to occupants.
May require operating limitations or emergency procedures.
Significant reduction in safety margins or functional capabilities.
Significant increase in crew workload impairing crew efficiency.
Some discomfort to occupants.
Requires operating limitations or emergency procedures.
Large reduction in safety margins or functional capabilities.
Higher workload or physical distress.
Adverse effects upon occupants.

All conditions which prevent continuous safe flight and landing.

So, what is the severity of the failure condition “Loss of CFDS function“?

  • Well, some might argue that the “Loss of CFDS Function” does not affect airworthiness (i.e. the ability of the aircraft to continue safe flight and landing) and is thus a “MINOR” or “NO SAFETY EFFECT” failure
    condition.
  • Others, such as the flight crew, would argue that the CFDS is there to protect them against missiles and should be a “CATASROPHIC” failure condition.  This assumes that the CFDS is actually effective against the threat
    • e.g. many RPGs are simple “dumb” missiles with no guidance system.
    • e.g. many SPS’s are tested against simulated threats, as there are few volunteers to actually test the real thing.
  • Others might argue that the failure condition causes “Large reduction in safety margins or functional capabilities” and should thus be HAZARDOUS.

The severity classification results in a safety objective for the system (see table below), so close agreement with the regulatory authority will be required in this process.

Severity No Safety Affect Minor Major Hazardous Catastrophic
Allowable Probability Frequent Reasonably Probable Remote Extremely Remote Extremely Improbable

Consider the failure mode: “Loss of CFDS function”:

Military Regulatory Authorities would typically require (see Chapter 4) the application of DEF STAN 00-56 (in the UK) or MIL-STD-882 (in the USA) when evaluating system safety.

DEF-STAN 00-56 provides the following accident severity categories:

Negligible

Marginal

Critical

Catastrophic

At most a single minor injury or minor occupational illness.

A single severe injury or occupational illness; and/or multiple minor injuries or minor occupational illnesses.

A single death; and/or multiple severe injuries or severe occupational illnesses.

Multiple Deaths

Def Stan 00-56 Issue 2 (Part 1 Para 7.3.2.c) state that “some systems have a defensive role whereby inaction under hostile circumstances may constitute a hazard. Safety targets for such systems shall address the requirements to reduce, to a tolerable level, the risk resulting from inaction under hostile circumstances”.

So, “Loss of CFDS function” could result in the aircraft being shot down, which is obviously a “CATASTROPHIC” accident.

We now need to determine the probability of the accident occurring and classify it according to the following table:

Accident Probability
(Qualitative Probability)

Occurrence
(during operational life considering all instances of the system)

Quantitative Probability
(per Operating Hour)

Frequent

Likely to be continually experienced

< 1xE-2

Probable

Likely to occur often

< 1xE-4

Occasional

Likely to occur several times

< 1xE-6

Remote

Likely to occur some time.

< 1xE-8

Improbable

Unlikely, but may exceptionally occur

< 1xE-10

Incredible

Extremely unlikely that the event will occur at all, given the assumptions recorded about the domain of the system

< 1xE-12

The accident probability is obtained by considering all the events in the accident sequence. A simple accident sequence is illustrated below, where the probability of the accident is dependent on the probability of a projectile firing and the probability
of CFDS failure

i.e. Pprojectile x PCFDS = Paccident

The designer might be able to predict PCFDS, but has no control over Pprojectile. The operator needs to provide Pprojectile (and Def-Stan Issue 2 (Part 1 Para 7.3.2.c) states that the threat condition
can be assumed to be 1).

Once Paccident is calculated, the Risk can be determined via the following typical matrix:

Catastrophic Critical Marginal Negligible
Frequent A A A B
Probable A A B C
Occasional A B C C
Remote B C C D
Improbable C C D D
Incredible C D D D
  • Class A: These risks are deemed as being intolerable and shall be removed by the use of safety features.
  • Class B: These risks are considered as being  undesirable, and shall only be accepted when risk reduction is impracticable.
  • Class C: These risks are deemed as being tolerable with the endorsement of the Project Safety Review Committee. May need to show that risk is ALARP (see para 4).
  • Class D: These risks are accepted as being tolerable with the endorsement of normal project reviews. No further action needed.

Note, however, that the accident sequence above is far too simplistic as it assumes that a functioning CFDS will prevent all Projectiles from causing an accident. See the following development of this accident sequence, which might result in a higher
priority required for another type of protective system (e.g. Explosion Suppressant Foam in the fuel tanks) to keep the risk of this type of accident tolerable:

Consider the failure mode: “SPS functions when not required”.

This failure mode could present the following hazardous conditions:

  • During formation flying (and/or air-to-air refueling) unexpected CFDS dispensing by the lead aircraft might present all sorts of hazards to other aircraft.
  • The aircraft might be in an emergency condition in a threat environment, which may require dumping of fuel. The pilots are faced with 2 choices:
    1. Retain the self protection system, but risk possible fuel ignition if dispensing occurs
    2. Disable the self-protection system in a high threat environment
  • Uncommanded (or uncontrolled) functioning of a DIRCM (Directional Infra Red Countermeasure) system might result in severe retina (eye) damage to a third party (e.g. other pilots, ground crew, civilian personnel).

The right tool applied to the right situation can contribute to the efficiency of an investigation or an analysis.  Tools can help to facilitate teamwork, enable a systematic and transparent approach, communicate findings, and manage complex investigations.   However, the benefits of using tools are far from automatic.

The table in this Annex has been compiled from a variety of sources (ranging from textbooks, publications, and the internet, to personal experience of friends, colleagues and acquaintances).

Each of these tools has its own advantages and disadvantages and the extent to which these can be used in during various phases of the product lifecycle, and the degree to which it can be applied to Safety Assessments, vary.  A toolkit (rather than “one-size-fits-all”) approach is advocated as each tool has a distinctive function and range of application. Listed in alphabetical order, the tools/techniques most frequently used by the author have been shaded.

It is extremely important to note that as the complexity of the tool increases so does the degree of training required for the user and/or the need for an experienced evaluation team to conduct the evaluation.  On the plus side, the data derived from the more complex methodologies may be more supportable.  Unfortunately, the primary disadvantage of such tools is that "trained subject matter experts" may have limited experience in the actual operational environment and, therefore, their evaluations may not be entirely applicable to the certification

This table is intended to be thought provoking but has all the limitations of generic data.  In no circumstances should it be considered complete, applicable to all systems or wholly objective.  Many entries have no advantages/limitations listed, and space is provided for the reader to add data if desired.

Table in Annex A