Introduction to Reliability Analysis with Mathematica[Graphics:Images/index_gr_1.gif]

Dave Collins
dcollins@outbacksoftware.com

Copyright Outback Software, Ltd., 2002. This notebook may be freely copied and distributed, provided it is kept in its original form, including this notice. For other rights, contact the author. Mathematica is a registered trademark of Wolfram Research, Inc.

Acknowledgements
  
This notebook started as a project for a Mathematica course taught by
Ken Levasseur at the University of Massachusetts. Thanks also to Stan Wagon for many hints and tips given at Rocky Mountain Mathematica.

Table of Contents

Introduction

    Initialization

    Mathematical models

    Relibility modeling

Empirical models of reliability

    Non-Parametric estimation of f(t), F(t), R(t) and h(t)

    Example: The exponential distribution

    Maximum likelihood parameter estimation

    Example: The Weibull distribution

    Testing goodness of fit

Markov models

    Reliability modeling examples

Simulation or Monte Carlo methods

    Generating values of random variables

    Reliability modeling example

Summary
Bibliography

Introduction

Reliability analysis is an engineering discipline that applies various mathematical techniques to the measurement and prediction of the reliability of components and systems. The components under study may be mechanical, electronic, software, or other types. Systems could include anything from computers to rail transit. Measurements include failure rates, cumulative failures, and component lifetimes (time until failure). A variety of techniques are employed, drawn mainly from probability, statistics, and the theory of stochastic processes.

Mathematica is a software application for numeric and symbolic computation, developed by
Wolfram Research. This being an introduction, the assumption throughout is that the reader knows something about Mathematica, but is not an expert, either in Mathematica or reliability analysis.

The mathematics of reliability analysis, even in terms of specific models, has broad applicability. A reliability model with failures and repairs to failed units may have the same form as a population model in which "failures" become deaths, and "repairs" become births. A probability distribution of survival times may be used in reliability analysis to model component lifetimes, in medical research to model patient survival in a treatment group, and by actuaries to compute insurance premiums.

In this paper a context is established by discussing the general idea of using mathematics to model real-world processes, and outlining the specific methods used by reliability engineers. Following the introduction, several techniques of reliability analysis are discussed in more detail, with particular attention to the suitability of Mathematica as a tool for reliability analysis. Examples have been chosen that use mathematical models with broad applicability, and that provide good illustrations of the use of Mathematica.   

Initialization

This section consolidates data input and definition of common functions used in this notebook. Most of the cells in this section are Mathematica "initialization cells," meaning they will be automatically evaluated whenever any computational cell in the notebook is evaluated. If you are presented with a dialogue box asking if you want to perform this automatic initialization, always answer "Yes."

The following cell initializes data sets used in subsequent sections. observedPopulation is population data (in millions) for the United States from 1770-2000; ballBearingFailures is time to failure (measured in millions of revolutions) for a set of tested ball bearings; leukemiaRemission is time to spontaneous remission of leukemia in a group of control subjects for a drug trial.   

[Graphics:Images/index_gr_2.gif]

More typically, data sets like the above would be read in from external files. For example, suppose there is a text file, USpop.txt, wherein each line is a data pair like "1790    3.9", representing the U. S. population (in millions) for a given year. Then the variable observedPopulation could be created as follows:
  
f=OpenRead["USpop.txt"];
observedPopulation=ReadList["USpop.txt",{Number,Number}];
Close[f];

  
This code assumes that USpop.txt is in the default Mathematica folder, as given by Directory[ ]. Since the location of the default directory may vary depending on operating system type and installation options, this is error-prone. Thus the data is imbedded directly in the notebook, for portability. (This would be impractical, of course, for large data sets.)

Data can also be imported in popular spreadsheet formats using built-in Mathematica capabilities. Using MathLink, Mathematica's method of linking to programs written in other language such as C or Java, data could also be read in from external databases such as DB2 or Oracle.
  
Load needed packages:

[Graphics:Images/index_gr_3.gif]

Turn off unnecessary warning messages:

[Graphics:Images/index_gr_4.gif]

stepPlot[ ] plots a list of points as a step function (like Histogram[ ], but only showing the outline of the top of the bars). It is useful for plotting things like discrete probability distributions.

[Graphics:Images/index_gr_5.gif]

It takes one option, PlotWith->{[Graphics:Images/index_gr_6.gif][Graphics:Images/index_gr_7.gif], . . .}, where the fs are other functions to be plotted. The other functions are plotted in a different color. This is useful in comparing the step plot to a fitted function, for example:

[Graphics:Images/index_gr_8.gif]

[Graphics:Images/index_gr_9.gif]

(23/(1+ 50Exp[-0.3(#)])&  is the Mathematica "pure function" corresponding to [Graphics:Images/index_gr_10.gif].)

Mathematical models

A mathematical model of a process in the real world is a set of equations with independent variables taking values based on quantitative aspects of the process state at some time, say  [Graphics:Images/index_gr_11.gif], and dependent variables yielding quantitative information regarding the process state at some later time t:

    S(t) = ([Graphics:Images/index_gr_12.gif](t), [Graphics:Images/index_gr_13.gif](t), . . ., [Graphics:Images/index_gr_14.gif](t)) =
        f (S([Graphics:Images/index_gr_15.gif]), [Graphics:Images/index_gr_16.gif], t) = ([Graphics:Images/index_gr_17.gif]([Graphics:Images/index_gr_18.gif]([Graphics:Images/index_gr_19.gif]), [Graphics:Images/index_gr_20.gif]([Graphics:Images/index_gr_21.gif]), . . ., [Graphics:Images/index_gr_22.gif]([Graphics:Images/index_gr_23.gif])), . . ., [Graphics:Images/index_gr_24.gif]([Graphics:Images/index_gr_25.gif]([Graphics:Images/index_gr_26.gif]), [Graphics:Images/index_gr_27.gif]([Graphics:Images/index_gr_28.gif]), . . ., [Graphics:Images/index_gr_29.gif]([Graphics:Images/index_gr_30.gif])), [Graphics:Images/index_gr_31.gif], t)  

More generally, particularly if the modeling is done on a computer, the function f  may be implemented as an algorithmic procedure that is not necessarily expressible in closed form. A mathematical model may be descriptive (describing or explaining observed data), predictive (predicting data to be observed in the future), or both.

As an example, suppose we are interested in explaining or predicting population growth. One model of growth is that, in an environment with no constraints, future population may be predicted based on a knowledge of the initial population size, and the net growth rate (birth rate minus death rate). It is well known that this model implies exponential growth, as is shown in the function ExponentialGrowthModel below. P0 is the initial population size, and g is the growth rate. In reality, there are always limits to growth, and the logistic model takes this into account, adding a parameter C, the "carrying capacity" of a given environment, which is the maximum size that a population can achieve in that environment before resource limitations reduce the growth rate to zero. The result is a model that first grows exponentially, but then levels off and approaches the carrying capacity asymptotically. This is shown in the function LogisticGrowthModel below.

These two functions are used below to explain population growth in the United States over a two-hundred year period. The actual population figures, at ten-year intervals from census data, are in the ObservedPopulation list (in millions, plotted in black). The parameters used in the two models were estimated visually to produce a reasonable fit (more rigorous techniques of estimating and validating model parameters are described below in Empirical models of reliability). The plot shows that LogisticGrowthModel (plotted in blue) explains the observed data better than ExponentialGrowthModel (plotted in red). (More rigorous methods of measuring "better" are also described below.)

These models are described in
[23] and [8]. The 1790-1960 census data is from [9], the 1970-1990 data from [8], and the 2000 data from [24].

[Graphics:Images/index_gr_32.gif]

[Graphics:Images/index_gr_33.gif]

There are obvious deficiencies in this exercise. Both models assume a constant growth rate, whereas the actual growth rate over the period is known to have varied. The logistic model, which is a good fit, assumes a constant carrying capacity, and suggestes that the limiting population of the U. S. is about 350 million. In models of animal populations in a fixed territory, availability of food is the main limiting factor, and clearly this has changed over the history of the United States. Possibly there are other factors, such as crowding, that will indeed limit the population as the model suggests, but this remains to be shown. (Failure to recognize the elasticity of human resource limitations was the fallacy of the economist Thomas Malthus, who predicted the "crash" of human population at a level far below what has actually been achieved.)

The choice of a mathematical model involves a tension between several factors:

The model must capture essential aspects of the process under study.

There must be accurate, repeatable measurement procedures for the input and output variables, so that the model can be effectively compared to the real process.

The model must be tractable, both mathematically and economically--i.e., the cost of modelling to a given level of precision must be justified by the value of the result.

The standard advice for building a tractable model is summarized by Friedman ([6], p. 27): "Faced with a complicated problem, assume away any feature that is not essential to what you are trying to understand. When you are finished, you are left with the simplest problem whose solution will tell you what you want to know." In a certain sense this advice is circular--lacking a complete understanding of a problem, we generally can't determine what is essential and what is not. In practice, modeling proceeds iteratively through informed guesswork to trial models which are tested against empirical reality and used to produce more refined models. In spite of the apparently haphazard nature of this process, in the hands of creative scientists it has produced remarkable results. The physicist Eugene Wigner famously remarked on "the unreasonable effectiveness of mathematics" [25] for modeling processes in the real world, and even in disciplines less precise than physics, we often can assume away a surprising amount of detail and still obtain useful answers.

In the remainder of this paper, the mapping of real-world situations to mathematical models is dealt with only in a cursory way. The primary focus is on the purely mathematical aspect of the models, and how Mathematica can be used to solve them.

Relibility modeling

Reliability analysis is a apecific example of mathematical modeling, which mathematically has much in common with other disciplines. The exponential model above, for example, shows up in modified form (with a negative exponent) in reliability analysis:  f(t) = λ[Graphics:Images/index_gr_34.gif] , where λ is an empirically determined parameter, gives the probability density of failure at time 0 < t < ∞  for a certain class of components (those whose reliability does not change with age).

Reliability analysis starts with entities that can fail. These entities may or may not be repairable. The smallest entity under consideration is a component. Use of this term usually implies we are analyzing a system composed of components. Where a model deals with an entity as a whole, without regard for whether it is a system or component, it is called a unit or item. The reliability of an item is the probability that it will perform its specified function for a specified period of time under specified environmental conditions. For example, a light bulb may be specified to produce a certain output in lumens, and is considered failed if the output is reduced by more than 10%; the manufacturer indicates a 0.99 probability that it will operate for 1,000 hours, under specified conditions of ambient temperature and vibration. The remainder of this paper assumes that the various conditions are given, and emphasizes the mathematical properties of the reliability function (described below).

Most of reliability theory is based on the idea that quantities of interest are continuous random variables taking time values, with PDF, or probability density functions, p(t) satisfying p(t) >= 0, and [Graphics:Images/index_gr_35.gif] = 1. It is conventional to use uppercase letters for the general random variables (for example, T), and the corresponding lowercase letters for specific values of the random variable (for example, t). To be really precise, we consider a set E of events, with T being a measurable function T: E->÷µ, so that T(e ∈ E) = t;  in applied work, however, we just think of T as being somehow representative of a repeatable process, and t being the value of a particular observation of that process. The domain of t is normally (0, ∞ ). The CDF (cumulative distribution function) corresponding to p(t) is P(t) = Probability(T <= t) = [Graphics:Images/index_gr_36.gif]. Either the PDF or CDF characterizes the random variable's probability distribution. In applications, the CDF is actually more fundamental than the PDF, since P(T <= t) is observable, whereas "probability density at a point" is not, but rather is derived as [Graphics:Images/index_gr_37.gif]P(t).

The failure density function, f(t), is a PDF giving the density of the probability of failure of a unit. The corresponding CDF, F(t), gives the probability that failure occurs within a certain "lifetime": Probability(unit lifetime <= t) = F(t) =  [Graphics:Images/index_gr_38.gif]. For repairable units, "lifetime" is replaced by "time to failure". The mean lifetime, or mean time to failure (MTTF), or mean time between failures (MTBF) for repairable units, is given by  [Graphics:Images/index_gr_39.gif].  The negative exponential density above is one example of many that might be used as a failure density function. In principle, any probability distribution that is zero outside the interval (0, ∞ ) might be used for modeling failures. In practice, even distributions with a finite probability of negative values, such as the normal, are sometimes used, as long as the parameters of the distribution are such that the probability of  a negative value is sufficiently small.

More frequently used than the failure CDF is the reliability or survivor function:
    R(t) = (probability that the unit lifetime >= t) = 1 - F(t).
Because the total probability is 1,
    R(t) = 1 - F(t) =  1 - [Graphics:Images/index_gr_40.gif]f(t)dt =  [Graphics:Images/index_gr_41.gif]f(t)dt.

Taking as an example the negative exponential distribution with parameter 0.5, this compares f(t) (in blue), F(t) (in green), and R(t) (in red):

[Graphics:Images/index_gr_42.gif]

[Graphics:Images/index_gr_43.gif]


A more complex notion is the hazard function h(t), which is the failure rate, conditional on the unit having survived until a certain point in time. This is also called the "instantaneous age-specific failure rate" or simply the "failure rate".

This conditional rate is based on the general concept of conditional probability: the (unconditional) probability that events A and B both occur is the product of the probability that A occurs, by the probability that B occurs, given that A has occurred (the conditional probability of B given A). In symbols:
    P(A⋂B) = P(A) · P(B|A).
Thus the conditional probability is expressed as

    P(B|A) = [Graphics:Images/index_gr_44.gif].                    (1)

To derive the hazard function, consider the probability that failure occurs between times t and t + Δt:

    P(t <= T <= t + Δt) = [Graphics:Images/index_gr_45.gif]f(τ)dτ = F( t + Δt) - F(t) = (1 - F(t)) - (1 - F( t + Δt)) = R( t + Δt) - R(t).
    
Using (1), conditioning the probability on the fact that failure has not occurred before t, i.e., that the unit is still a survivor at time t,

     P(t <= T <= t + Δt | T > t) = [Graphics:Images/index_gr_46.gif]= [Graphics:Images/index_gr_47.gif]


Dividing by Δt gives the average rate of failure over the interval [t, Δt]. Finally, the hazard rate (instantaneous rate of failure) is defined as

            h(t) =  [Graphics:Images/index_gr_48.gif] [Graphics:Images/index_gr_49.gif]
              
           = [Graphics:Images/index_gr_50.gif]
           
           = [Graphics:Images/index_gr_51.gif] ,  since R'(t) = [Graphics:Images/index_gr_52.gif](1 - F(t)) = - f(t)            (2)

Note that the hazard function is not a PDF, since [Graphics:Images/index_gr_53.gif]~= 1; it can be shown that R(t)->0 fast enough as t->∞ so that [Graphics:Images/index_gr_54.gif]is unbounded

Empirical models of reliability

An empirical model is a distribution characterization, e.g., a failure CDF, based in whole or in part on measurement data. The distribution in turn is used to make predictions about units similar to the one(s) measured, or about the measured unit(s) outside the range of measurement.

It is possible to construct a "pure empirical" model, i.e., to construct a distribution completely from smoothed data. This is rarely done, since a known distribution provides more information, and is more convenient to work with. Typically, a known distribution type is chosen based on some combination of theoretical considerations, formal analysis of the measurement data, and visual inspection of the data. Having chosen a distribution type, its parameters are estimated based on the data, and a goodness of fit test is used to assess how closely it models the measurement data. This procedure may be iterated with different distibution types until one is found that offers a "sufficiently good" fit.  

Empirical models may be used alone to model simple situations. For example, suppose a manufacturer of ball bearings wishes to define a warranty period that will minimize warranty replacement costs. This might be taken to mean a period, starting when the bearing is new, during which less than 1% of the bearings will fail. Given an empirical model, say a failure CDF derived from testing a sample of bearings, the reliability function R(t) can be determined. The maximum warranty period is then the maximum value of t such that R(t) >= 0.99.

Empirical models may also be used as components of other models for more complex situations. For example, having derived empirical models for the failure of individual components comprising a system, a failure rate for the system may be derived--in some cases by a closed-form computation, but more often using a Monte Carlo simulation.

Non-Parametric estimation of f(t), F(t), R(t) and h(t)

A common starting point is the development of non-parametric estimates of the important reliability functions, i.e., estimates that are free of any assumption about the true distribution of the data. This is obviously useful in constructing a pure empirical model, but also helps to discriminate between different theoretical distributions. For example, the exponential distribution has h(t) = constant, and is the only distribution with this property.

Recall that F(t) is the probability that failure occurs in time <= t. All the CDFs commonly used in reliability theory are continuous, and strictly increasing in the interval [0,∞). Thus F(t) is invertible, and for 0 <= ρ <= 1, [Graphics:Images/index_gr_55.gif](ρ) = τ means that τ is the time such that P(t <= τ) = ρ.

Now, take a partition (0, [Graphics:Images/index_gr_56.gif], . . ., [Graphics:Images/index_gr_57.gif], . . ., [Graphics:Images/index_gr_58.gif], ∞) of the range of F such that each subinterval [Graphics:Images/index_gr_59.gif]has the same probability, i.e., F(t +  [Graphics:Images/index_gr_60.gif]) - F(t) is the same for all [Graphics:Images/index_gr_61.gif]. Then F maps this partition to a partition of 0 <= p <= 1 whose subintervals are equal. Conversely, [Graphics:Images/index_gr_62.gif] maps any partition of  0 <= p <= 1 with equal subintervals to an equiprobable partition of the range.

The mapping described above is the basis for constructing an empirical CDF from observed data points. Suppose we test a sample of n units until all have failed, giving n observed failure times 0 <= [Graphics:Images/index_gr_63.gif]<= . . .<= [Graphics:Images/index_gr_64.gif]<= . . .<= [Graphics:Images/index_gr_65.gif]. This can be taken as a mapping from the order statistics 1, . . ., i, . . ., n to t. Assuming that each [Graphics:Images/index_gr_66.gif] is drawn from an equally-likely subinterval of t, this induces a partition of p which we can take to be a normalization of the order statistics, 0 <= 1/n <= . . . i/n <= 1. This is shown in the figure below:

[Graphics:Images/index_gr_67.gif]


                        Figure 1.

The blue line is the empirical CDF, which has a step at each [Graphics:Images/index_gr_68.gif]; the step is from (i-1)/n to i/n. The red line is a continuous CDF fitted to the empirical CDF; ways of doing the fitting are described below.

The partition 0 <= 1/n <= . . . i/n <= 1 is biased, since it assumes that P(t < [Graphics:Images/index_gr_69.gif]) = 0, and that [Graphics:Images/index_gr_70.gif] is the maximum value for t. There are various methods of compensating for this bias; the preferred method for asymmetrical distributions (where the mean and median are not equal) is Benard's formula:  

    [Graphics:Images/index_gr_71.gif]  [Graphics:Images/index_gr_72.gif](t <= [Graphics:Images/index_gr_73.gif]) = [Graphics:Images/index_gr_74.gif]([Graphics:Images/index_gr_75.gif]) = [Graphics:Images/index_gr_76.gif].
    
This is the best approximation for distributions commonly encountered in reliability work. (For a comparison of Benard with other methods in the Weibull case, see [1].)

As an example, take the data set leukemiaRemission, which contains times (in months) to spontaneous remission of leukemia in a control group for a drug study (from [13]). Spontaneous remission of leukemia is apparently common; "failure" here is thus a failure of the carcinoma to continue its uncontrolled growth. The form of the dataset is {. . ., {i, [Graphics:Images/index_gr_77.gif]}, . . .}.

Before operating on a set of observation data, taking the data as a discrete function of t, it must be single-valued. The leukemiaRemission dataset is not--some of the remission times occur more than once:

[Graphics:Images/index_gr_78.gif]
[Graphics:Images/index_gr_79.gif]

This means that in the kind of step function shown in Figure 1, there is a step greater than one at certain values--e.g.,the step from t = 5 to t = 8 is from 9 to 13. The data is adjusted for this by aggregating the function pairs {10,8}, {11,8}, {12,8}, and {13,8}, leaving only {13,8}.  In general,the aggregation leaves only the observation with the largest i, for a given t. It also swaps the parameters so each pair is {t, i}:

[Graphics:Images/index_gr_80.gif]

So the dataset,when aggregated, represents a single-valued function from t to the order statistics 1, 2, . . . n:

[Graphics:Images/index_gr_81.gif]
[Graphics:Images/index_gr_82.gif]

The aggregated data is used in the following Mathematica function for [Graphics:Images/index_gr_83.gif](t), the empirical CDF, which takes as argument the list of failure time observations. (Note that n is still the number of unaggregated observations.)

[Graphics:Images/index_gr_84.gif]

The empirical CDF table is always returned. If Plot->True, it also plots the CDF table; PlotWith->{[Graphics:Images/index_gr_85.gif][Graphics:Images/index_gr_86.gif], . . .}, gives a list of other functions to be plotted, for example if a fitted function is to be compared with the empirical CDF (see Figure 3).

Having constructed the empirical CDF, it can be used directly to generate sample failure times; this is illustrated below in the section below on  Simulation or Monte Carlo Methods.  Here we look at deriving other functions from the empirical data--particularly the empirical hazard function,  [Graphics:Images/index_gr_87.gif](t), which is helpful in selecting a distribution type for fitting to the empirical data.

To obtain [Graphics:Images/index_gr_88.gif](t), we start with [Graphics:Images/index_gr_89.gif](t), the estimated PDF:

     [Graphics:Images/index_gr_90.gif]([Graphics:Images/index_gr_91.gif]) = [Graphics:Images/index_gr_92.gif][Graphics:Images/index_gr_93.gif] [Graphics:Images/index_gr_94.gif] = [Graphics:Images/index_gr_95.gif] = [Graphics:Images/index_gr_96.gif] = [Graphics:Images/index_gr_97.gif]

The reliability function is [Graphics:Images/index_gr_98.gif]([Graphics:Images/index_gr_99.gif]) = 1 - [Graphics:Images/index_gr_100.gif]([Graphics:Images/index_gr_101.gif]), so from (2),

    [Graphics:Images/index_gr_102.gif]([Graphics:Images/index_gr_103.gif]) = [Graphics:Images/index_gr_104.gif] = [Graphics:Images/index_gr_105.gif] = [Graphics:Images/index_gr_106.gif] = [Graphics:Images/index_gr_107.gif]
    
This last formula can be applied directly to the observation data to derive an empirical [Graphics:Images/index_gr_108.gif](t):

[Graphics:Images/index_gr_109.gif]

Applying this to leukemiaRemission:

[Graphics:Images/index_gr_110.gif]

[Graphics:Images/index_gr_111.gif]

[Graphics:Images/index_gr_112.gif]

                Figure 2.

The next step is to use this information to determine a distribution for fitting the data.

Example: The exponential distribution

If the hazard rate is found to be constant, h(t)  = [Graphics:Images/index_gr_113.gif]= λ , then R'(t) = - λR(t) , thus R(t) = [Graphics:Images/index_gr_114.gif]. Given the constraint R(0) = 1, this solution is unique, so the failure distribution is exponential. In Mathematica terms:

[Graphics:Images/index_gr_115.gif]
[Graphics:Images/index_gr_116.gif]

The theoretical basis for the exponential distribution is that the unit has no "memory" of its prior life; the conditional reliability function, given that a unit has survived until time τ, is the same as the reliability function starting at time 0:

    P(T > τ + t | T  > τ) = [Graphics:Images/index_gr_117.gif]= [Graphics:Images/index_gr_118.gif]=  [Graphics:Images/index_gr_119.gif] = R(t) = P(T > t).
    
Empirically, this applies to units such as electronic components that do not wear out. Does it also apply to the leukemia situation? The empirical hazard rate in Figure 2 does not make this clear. Given that it does not seem to be definitely increasing or decreasing, we can look at an average of the first half of the hazard table:

[Graphics:Images/index_gr_120.gif]
[Graphics:Images/index_gr_121.gif]

And the last half, with the last observation dropped because it seems like an outlier:

[Graphics:Images/index_gr_122.gif]
[Graphics:Images/index_gr_123.gif]

These are not very close, but perhaps (in practical work) close enough to warrant trying a fit to an exponential distribution. First, since [Graphics:Images/index_gr_124.gif]([Graphics:Images/index_gr_125.gif]) = 1 - [Graphics:Images/index_gr_126.gif]([Graphics:Images/index_gr_127.gif]),

[Graphics:Images/index_gr_128.gif]

Given R(t) = [Graphics:Images/index_gr_129.gif], log(R(t)) = - λt is a straight line. Mathematica's linear regression can be used to find the best fit to the linearized data (IncludeConstant->False is used because we know that there is no constant term):

[Graphics:Images/index_gr_130.gif]
[Graphics:Images/index_gr_131.gif]

The [Graphics:Images/index_gr_132.gif]value of .96 indicates a good fit to the data, which can be verified visually:

[Graphics:Images/index_gr_133.gif]

[Graphics:Images/index_gr_134.gif]

                    Figure 3.

This can now be used as a model for prediction. For example, suppose we are running a trial on a drug that is claimed to induce remission of leukemia. How long should the trial run? Given that spontaneous remissions occur, we may not want the trial to be so long that most patients undergo remission regardless of the drug regimen. Suppose we want a length such that no more than 50% of patients will undergo spontaneous remission. This can be determined by solving R(t) = 0.5. Using the fitted model yields a value of 5.83, so it would be appropriate to run the trial for six months.

[Graphics:Images/index_gr_135.gif]
[Graphics:Images/index_gr_136.gif]
Moment ratios

Given a continuous distribution with PDF f, or a discrete distribution with probabilities p, the rth moment about the mean μ is [Graphics:Images/index_gr_137.gif] or  [Graphics:Images/index_gr_138.gif]. The first moment is the mean, the second is the variance, the third is the skewness (a measure of the asymmetry of the distribution), and the fourth is the kurtosis (a measure of how sharply the distribution peaks). The entire set of moments completely characterizes the distribution, and ratios among them (particularly the first four) can be used to test an empirical distribution against a theoretical model.

For example, the coefficient of variation is the standard deviation σ (square root of the variance or second moment) divided by the mean μ; for an exponential distribution this ratio is always 1. Calculating the ratio using the discrete leukemiaRemission data gives

[Graphics:Images/index_gr_139.gif]
[Graphics:Images/index_gr_140.gif]

which suggests that an exponential distribution may not be the right model for this data. The coefficient of skewness, defined as the mean of [Graphics:Images/index_gr_141.gif], is always 2 for an exponential distribution, but is less than 1 for the  leukemiaRemission data:  

[Graphics:Images/index_gr_142.gif]
[Graphics:Images/index_gr_143.gif]

which again suggests that an exponential distribution may not be the right model for this data, even though the fit is visually good. A better decision could be made by obtaining more data, or by fitting to other types of distribution and picking the best.

Maximum likelihood parameter estimation

Given a set of observations, and an n-parameter distribution hypothesized to be the source of the observations, the maximum likelihood method attempts to find the set of parameter values that maximizes the probability of the given observations. This will be illustrated using the (one-parameter) exponential distribution.

Suppose [Graphics:Images/index_gr_144.gif],  [Graphics:Images/index_gr_145.gif], . . .,  [Graphics:Images/index_gr_146.gif] are believed to be observations from an exponential distribution with parameter λ. Then for each  [Graphics:Images/index_gr_147.gif], the probability density at that value is f([Graphics:Images/index_gr_148.gif]) = λ[Graphics:Images/index_gr_149.gif].  Since the observations are assumed to be independent, joint probability is the product of individual probabilities, and the joint density for all the t's is the likelihood function
    L(λ, [Graphics:Images/index_gr_150.gif],  [Graphics:Images/index_gr_151.gif], . . .,  [Graphics:Images/index_gr_152.gif])  =  [Graphics:Images/index_gr_153.gif] .
The value of λ which maximizes L for the given set of t's is the maximum likelihood (ML) estimate.

Since log is monotomically increasing, L has a maximum iff log(L) is maximized. The log likelihood is easier to work with for most distributions used in reliability, so we use
    log(L(λ, [Graphics:Images/index_gr_154.gif],  [Graphics:Images/index_gr_155.gif], . . .,  [Graphics:Images/index_gr_156.gif]))  =  [Graphics:Images/index_gr_157.gif] = n log(λ) - [Graphics:Images/index_gr_158.gif] .
    [Graphics:Images/index_gr_159.gif]log(L(λ, [Graphics:Images/index_gr_160.gif],  [Graphics:Images/index_gr_161.gif], . . .,  [Graphics:Images/index_gr_162.gif]))  = [Graphics:Images/index_gr_163.gif] -  [Graphics:Images/index_gr_164.gif]
                      =  0 when
                          λ  =  n / [Graphics:Images/index_gr_165.gif].
Note that
    [Graphics:Images/index_gr_166.gif]log(L(λ, [Graphics:Images/index_gr_167.gif],  [Graphics:Images/index_gr_168.gif], . . .,  [Graphics:Images/index_gr_169.gif]))  = -[Graphics:Images/index_gr_170.gif] <  0
so this is a maximum for the given fixed  [Graphics:Images/index_gr_171.gif],  [Graphics:Images/index_gr_172.gif], . . .,  [Graphics:Images/index_gr_173.gif].

Applying this to the example above, the maximum likelihood estimate of the exponential rate parameter for the leukemia remission data is:

[Graphics:Images/index_gr_174.gif]
[Graphics:Images/index_gr_175.gif]

This can be compared with the linear regression estimate using the Kolmogorov-Smirnov test for goodness of fit (described below):

[Graphics:Images/index_gr_176.gif]
[Graphics:Images/index_gr_177.gif]

The smaller K-S statistic for the maximum likelihood estimate indicates it is a better fit. This is in accord with the theory, which says that ML is the least biased estimate of distribution parameters. Nevertheless, in practical reliability work, linear fitting is used more often than ML. ML for the exponential distribution is exceptionally simple; for multiparameter distributions (such as the Weibull, described below) it is more difficult, and apparently practicioners feel that the somewhat better fit does not justify the effort.  In this example, either fit might appear "good enough" visually:   

[Graphics:Images/index_gr_178.gif]

[Graphics:Images/index_gr_179.gif]

(The lower continuous curve is the ML estimated fit.)

Example: The Weibull distribution

The Weibull distribution is a flexible two-parameter distribution that is widely used in reliability analysis. The theoretical justification is a theorem that says that in the limit, the smallest value in each of a large number of samples from any distribution will be Weibull-distributed. This weakest link property is valid for many types of units, particularly mechanical parts, where a failure in the weakest part of the unit propagates and causes the entire unit to fail.

The Weibull equations are:

[Graphics:Images/index_gr_180.gif]

(The Weibull, as well as the exponential distribution, is also available in the Mathematica package Statistics`ContinuousDistributions`.)

In these equations, a is the scale parameter and b is the shape parameter. Note that if b = 1, the result is an exponential distribution with λ = 1/a. Depending on the value of b, a Weibull distribution can have decreasing, constant (exponential), or  increasing hazard rate:

[Graphics:Images/index_gr_181.gif]

[Graphics:Images/index_gr_182.gif]

In looking at the hazard rate for a complex system, we can use Weibull distributions and a competing risk model to derive the hazard rate curve for the total system life. The idea is that there is a "burn-in" period with high, but decreasing, hazard rate, as weak components fail and are replaced. Then, during the operating life of the system, the hazard rate is constant; finally, as components mechanically wear out, the hazard rate increases, and presumably the system is retired. Here is a simple competing risk model:

[Graphics:Images/index_gr_183.gif]

[Graphics:Images/index_gr_184.gif]

This yields an example of the well-known "bathtub curve" of failure rate over the life of a system.

The ballBearingFailures dataset (from [14]) is failure times (measured in millions of revolutions) of a sample of 23 ball bearings, tested until all had failed. The empirical hazard rate looks like this:   

[Graphics:Images/index_gr_185.gif]

[Graphics:Images/index_gr_186.gif]

Smoothing the data makes it easier to interpret:

[Graphics:Images/index_gr_187.gif]

[Graphics:Images/index_gr_188.gif]

Aside from one outlying point, the hazard rate seems to increase and then decrease, suggesting more than one failure mode is at work. But nevertheless, we can see how good a fit we get to a single Weibull distribution. Since
    R(t) = exp[-[Graphics:Images/index_gr_189.gif]]
    log[R(t)] = -[Graphics:Images/index_gr_190.gif]
    log{-log[R(t)]} = b(log t) - b(log a),
letting x = log t, y = log{-log[R(t)]}, we have a linear equation with slope b and intercept -b(log a). After fitting an equation to the data, we can then solve for a and b.

First, we transform the data appropriately, then do the fit:

[Graphics:Images/index_gr_191.gif]
[Graphics:Images/index_gr_192.gif]

[Graphics:Images/index_gr_193.gif] of .97 indicates a good fit. Note, however, that most empirical datasets don't tell much about the tails of the distribution, very short and very long failure times; and it is the tails that often are needed to differentiate between distributions. E.g., in [26] it is shown that this particular dataset fits either Weibull or Lognormal distributions.

The regression equation yields b = 2.17951, and since intercept = -9.59963 = -b(log a), for a we have:

[Graphics:Images/index_gr_194.gif]
[Graphics:Images/index_gr_195.gif]

Visually, this gives a good fit to the data:

[Graphics:Images/index_gr_196.gif]

[Graphics:Images/index_gr_197.gif]


As an example of how this model might be used, suppose the manufacturer of the ball bearings in question wishes to determine an economic warranty period. A financial analysis might suggest that no more than 1% of the bearings should need replacement under warranty.  This means we need to find the value of t such that the Weibull reliability function yields R(t) >= 0.99.

[Graphics:Images/index_gr_198.gif]
[Graphics:Images/index_gr_199.gif]

This implies that the manufacturer should warrant the bearings for no more than about 10 million revolutions. Given that warranties are ordinarily stated in months, years or miles, some assumption about the duty cycle is needed. Suppose the bearings are used in the hubs of 26-inch radius bicycle wheels. Such a wheel makes about 388 revolutions per mile, so conservatively, the bearings could be warrantied for 25,000 miles. This would be a reasonable life for a bicycle bearing; it would not be very good if the bearings were used in automobile wheels.

Both the leukemia remission and ball bearing datasets were uncensored--that is, all items in the sample were tested to failure. Often, we have censored data of various kinds. In leukemia tests, study participants may drop out before undergoing remission, e.g., because they moved away or died from an unrelated cause. A ball bearing test would typically be run for a fixed length of time or a fixed number of failures, rather than until all items had failed. Thus we would end up with a set of failure times, and another set of times that simply said the items had not failed as of when the test was terminated. Methods exist to deal with censored datasets, but they are beyond the scope of this paper (see [13], [20], [26]).

Testing goodness of fit

The simplest formal test for goodness of fit between a data set and a hypothesized probability distribution is the Kolmogorov-Smirnov. The statistic is the maximum absolute difference between an empirical CDF probability, and the predicted probability based on the hypothesized distribution. The null hypothesis is that differences between the empirical CDF and the theoretical distribution are by chance. (Thus, in contrast to many practical uses of goodness-of-fit tests, here we typically do not want to reject the null hypothesis,)

[Graphics:Images/index_gr_200.gif]

For the rationale behind computing D+ and D-, see any detailed treatment of the K-S test, e.g., in [12]. The distribution (and thus the critical values) of the K-S statistic depend on the source of the theoretical distribution being tested. If it is fitted based on the empirical data, as is done above, the distribution of the K-S statistic is specific to the type of distibution, e.g., Weibull or exponential.

For the ball bearing data,

[Graphics:Images/index_gr_201.gif]
[Graphics:Images/index_gr_202.gif]

Using the K-S table for the Weibull distribution in [12], we find that the null hypothesis can be accepted at a significance level of 0.10; i.e., we don't reject the hypothesis that the data are sampled from the given Weibull distribution. Note that for acceptance of the null hypothesis, a higher significance level is better.

Now let us automate the entire procedure of fitting to a Weibull distribution and testing it for goodness of fit. First, the function fitWeibull fits (and returns) a Weibull CDF from its input, a set or points representing an empirical reliability function. Mathematica's Fit is used here instead of Regress, because we only need the fitted coefficients and not the correlation information. The function returns a list with the Weibull shape and scale parameters, and the CDF as a pure function:

[Graphics:Images/index_gr_203.gif]

Next, using a table of critical values for the K-S statistic for a fitted Weibull distribution from [12], the goodness of fit is tested directly from the failure dataset. The ksWeibullGof[] function returns the raw data in a list: sample size, Weibull shape and scale parameters, a boolean reject indicating whether the null hypothesis is to be rejected, the significance level alpha, the value of the K-S statistic, and the fitted Weibull CDF as a pure function:

ksWeibullGof[(data_)?ListQ] := 
Module[
{ksWeibullTable, sampleSize, sampleSizeList, fit, cdf, i,shapeParm, scaleParm, empCdf, ksStat, sigList,reject,alpha},
(* Table rows are sample size followed by pairs of 1-alpha, K-S value *)
ksWeibullTable := {
  {10, {0.9, 0.76}, {0.95, 0.819}, {0.99, 0.944}},
  {20, {0.9, 0.779}, {0.95, 0.843}, {0.99, 0.973}},
  {50, {0.9, 0.79}, {0.95, 0.856}, {0.99, 0.988}},
  {100, {0.9, 0.803}, {0.95, 0.874}, {0.99, 1.007}}};
  sampleSize := Length[data];
  sampleSizeList := ksWeibullTable[[Length[ksWeibullTable]]];
  empCdf := empiricalCDF[data, Plot -> False];
  empCdfSize := Length[empCdf];
  fit := fitWeibull[empiricalReliability[data]];
  shapeParm := fit[[1]];
  scaleParm := fit[[2]];
  cdf := fit[[3]];
  (* Select appropriate row from K-S table for sample size *)
  For[
    i = Length[ksWeibullTable], i > 1, i--,
    If[
      sampleSize < ksWeibullTable[[i]][[1]],
      sampleSizeList = ksWeibullTable[[i - 1]]
    ]
  ];
  ksStat := ksStatistic[empCdf, cdf];
  adjKsStat := Sqrt[sampleSize]*ksStat;
  sigList = Null;
  (* Select nearest higher K-S statistic value from table *)
  For[
    i = 2, i <= Length[sampleSizeList], i++,
    If[
      adjKsStat < sampleSizeList[[i]][[2]],
      sigList:= sampleSizeList[[i]]; Break[]
    ]
  ];
    reject:=sigList===Null;
    If[!reject, alpha:=1-sigList[[1]]];
    {sampleSize,shapeParm,scaleParm,reject,alpha,ksStat,cdf}
]

The printKsWeibullGof[] function prints the goodness of fit information,and returns the fitted Weibull function:

[Graphics:Images/index_gr_204.gif]


Using these on the ball bearing data:

[Graphics:Images/index_gr_205.gif]
[Graphics:Images/index_gr_206.gif]
[Graphics:Images/index_gr_207.gif]
[Graphics:Images/index_gr_208.gif]
[Graphics:Images/index_gr_209.gif]
[Graphics:Images/index_gr_210.gif]
[Graphics:Images/index_gr_211.gif]


An alternative for testing goodness of fit is the Pearson [Graphics:Images/index_gr_212.gif] test. The test operates on grouped frequency data, which makes it popular in certain fields, such as social sciences and medicine, where observations are often naturally grouped. It can be cumbersome for data that are not grouped naturally, and is also less satisfactory than Kolmogorov-Smirnov in some contexts for technical reasons (see [2]); in particular, it may not be accurate for distributions with long tails, such as the exponential and Weibull.

The idea is that for grouped data, we compute a statistic that compares [Graphics:Images/index_gr_213.gif], the number of observations found in each group, with [Graphics:Images/index_gr_214.gif], the number we would expect to find, under the hypothesis that the data come from a certain distribution. The statistic is

    [Graphics:Images/index_gr_215.gif]  =  [Graphics:Images/index_gr_216.gif][Graphics:Images/index_gr_217.gif] .

This statistic is approximately [Graphics:Images/index_gr_218.gif]-distributed, hence the name of the test. The smaller the value of the statistic, the better the fit. The [Graphics:Images/index_gr_219.gif] distribution has one parameter, the number of degrees of freedom. This is the number of independent observations, which for grouped data is the number of groups minus 1.

If the test is of observed data against a hypothesized distribution, [Graphics:Images/index_gr_220.gif] is calculated by looking at the largest and smallest observations in the group, say (if we are dealing with failure times) [Graphics:Images/index_gr_221.gif] and [Graphics:Images/index_gr_222.gif]. If the total number of observations in the data set is n, the expected number in [[Graphics:Images/index_gr_223.gif], [Graphics:Images/index_gr_224.gif]] is n |F([Graphics:Images/index_gr_225.gif]) - F([Graphics:Images/index_gr_226.gif])|, where F is the distribution CDF. The null hypothesis in this case is that the observed data come from the hypothesized distribution, and deviations are by chance. The null hypothesis is rejected at the k% level if the probability of the deviation (expressed by the [Graphics:Images/index_gr_227.gif] statistic) is less than k%.

Testing the sample failure data involves two steps--grouping the data, and then computing the [Graphics:Images/index_gr_228.gif] statistic based on the grouped data and the hypothetical distribution. To achieve acceptable accuracy, the consensus is that there should be at least three groups, with at least five observations per group. Here, any extra observations are added to the last group :

[Graphics:Images/index_gr_229.gif]

Here the leukemia data is tested against the fitted exponential distribution:

[Graphics:Images/index_gr_230.gif]
[Graphics:Images/index_gr_231.gif]

The probability can be evaluated using a table, or using Mathematica's function:

[Graphics:Images/index_gr_232.gif]
[Graphics:Images/index_gr_233.gif]

This means that the probability of a [Graphics:Images/index_gr_234.gif] value this large or larger is about 32%. Thus if we are (as is typically done) testing at the 5% significance level, we do not reject the null hypothesis. A similar procedure can be carried out on the ball bearing data:

[Graphics:Images/index_gr_235.gif]
[Graphics:Images/index_gr_236.gif]
[Graphics:Images/index_gr_237.gif]
[Graphics:Images/index_gr_238.gif]

Again, we do not reject the null hypothesis. Note that these are "p values"; the significance level is 1 - p. Note also that since the [Graphics:Images/index_gr_239.gif]distribution is known, an exact p-value or significance can be given. In the case of the Kolmogrov-Smirnov test where the hypothetical distribution is fitted from the data being tested, there is no closed form expression or other simple means of computing the distribution, so tables derived by Monte Carlo methods are used. Thus the result is given in terms of critical levels such as 0.10 (or the equivalent p-values such as 0.90) rather than exact values.

Markov models

There are many systems in the world that exhibit (at least approximately) the following characteristics:

1. At any given point in time, the system is in some defined state, which may be labeled with an integer (if the total number of states is finite or countable) or a real number (if the set of states is continuous).
2. The system periodically undergoes a transition from its current state to another state; in general, the state to which a transition is made is known only in a probabilistic sense. Transitions may occur at finite time intervals, or continuously.
3. The probability of making a transition from state i to state j at time t depends only on i and j; i.e., the system has no memory of its past states. (In the case of a process where transitions may occur continuously in time, we deal with the probability density or rate of transitions, rather than the transition probability.)

These conditions define a Markov process, named after the Russian mathematician who first studied them. Such processes are widely used in applications ranging from modeling traffic queueing, to studying phoneme sequences in natural language. Only a tiny fragment will be dealt with here.

In reliability work, we deal almost exclusively with discrete-state processes, where the states might be whether a unit is operational or not, how many units have failed, the number of repair actions, etc. Typically, the same system can be mapped to more than one state space, and we pick the one that is most useful for the problem at hand. Time is taken to be continuous, since failure can occur at any instant in time. Continuous-time systems are initially defined in terms of discrete time, where transitions occur at intervals Δt;  we can then examine how the process behaves as Δt->0.

Reliability modeling examples

To begin with, take a simple example using discrete time. Given a computer system whose availability is only measured to some finite time unit, say an hour, we can build a Markov model in which transitions occur hourly. State 1 is "the computer is up (operational)"; state 2 is "the computer is down (failed)". Suppose that from measurements, the probability that the computer will fail in any given hour is 0.01, and, given that it is down, the probability is .30 that it will be repaired in any given hour after the failure.

We first define the (square) transition probability matrix P, such that [Graphics:Images/index_gr_240.gif] is the probability of making a transition from state i to state j. For this case,

[Graphics:Images/index_gr_241.gif]
[Graphics:Images/index_gr_242.gif]

Note the use of the constraint that the sum of entries in each column vector must be 1, since the set of transitions from state i is mutually exclusive and exhaustive. Now, we add a state vector x giving the probabilities of being in the various states at a point in time (again, the entries must sum to 1); initially, we can assume that the system is up, so x(0) = (1, 0). Since [Graphics:Images/index_gr_243.gif]is the probability of being in state i before the transition, the probability of being in state j after the transition is  [Graphics:Images/index_gr_244.gif][Graphics:Images/index_gr_245.gif]. This is the jth entry in the vector Px. For example, after the first transition in the example, the state vector is

[Graphics:Images/index_gr_246.gif]
[Graphics:Images/index_gr_247.gif]

After two transitions, it is

[Graphics:Images/index_gr_248.gif]
[Graphics:Images/index_gr_249.gif]

In general, after n transitions the state vector is P(P(P...x)...) = [Graphics:Images/index_gr_250.gif]x. In situations such as this example, an obvious question is whether the process is stationary, i.e., whether the state vector converges to some value over time. Given a tool like Mathematica, the "quick and dirty" way to find out is to determine the stationary value empirically, by computing powers of P. If x is taken as a vector in which the ith entry is 1 and all others are 0,  [Graphics:Images/index_gr_251.gif]x will be the ith column of  [Graphics:Images/index_gr_252.gif]. In this case, P converges:

[Graphics:Images/index_gr_253.gif]
[Graphics:Images/index_gr_254.gif]

Thus over a long period of time, we expect the system to be available about 97% of the time. In "normal" systems encountered in reliability the process is always stationary, and it can be shown that, regardless of the initial state, the state vector converges to the (identical) column vectors of  [Graphics:Images/index_gr_255.gif], as n->∞. (See, e.g., [10], [19]).

A more formal method of determining the value of the column vectors is shown below, but first it is interesting to look at a closed-form solution for this example, using differential equations.

From the conditions of a Markov process, the probability (or probability density) of transitions does not change with time. Making the example more realistic, instead of saying that the probability of transition is, say, λ on the hour, we say the rate of transition is λ per hour. Since the rate is constant, for any interval Δt the probability of transition within the interval is λ Δt. Then the transition matrix (using λ for the rate of failure and μ for the rate of repair), for transitions every Δt, becomes   

    P = [Graphics:Images/index_gr_256.gif],  thus for one transition the equation for the state vector is
    
    x(t+Δt)  =  [Graphics:Images/index_gr_257.gif] =  Px(t)  = [Graphics:Images/index_gr_258.gif][Graphics:Images/index_gr_259.gif] .

Separating the two components of the state vector, and after some algebra, this reduces to

    [Graphics:Images/index_gr_260.gif] = - λ [Graphics:Images/index_gr_261.gif](t) + μ [Graphics:Images/index_gr_262.gif](t)     and    [Graphics:Images/index_gr_263.gif] = λ [Graphics:Images/index_gr_264.gif](t) - μ [Graphics:Images/index_gr_265.gif](t).

Taking limits as Δt -> 0,

     [Graphics:Images/index_gr_266.gif]' (t) = - λ [Graphics:Images/index_gr_267.gif](t) + μ [Graphics:Images/index_gr_268.gif](t),    [Graphics:Images/index_gr_269.gif]' (t) = λ [Graphics:Images/index_gr_270.gif](t) - μ [Graphics:Images/index_gr_271.gif](t).

Finally, taking advantage of the fact that [Graphics:Images/index_gr_272.gif](t) = 1 - [Graphics:Images/index_gr_273.gif](t), after some algebra,

     [Graphics:Images/index_gr_274.gif]' (t) =  -+ μ) [Graphics:Images/index_gr_275.gif](t) + μ ,         [Graphics:Images/index_gr_276.gif]' (t) =  -+ μ) [Graphics:Images/index_gr_277.gif](t) + λ .
     
These can be solved by Mathematica, using the example values λ = 0.01, μ = 0.30, and the initial conditions [Graphics:Images/index_gr_278.gif](0) = 1, [Graphics:Images/index_gr_279.gif](0) = 0:  

[Graphics:Images/index_gr_280.gif]
[Graphics:Images/index_gr_281.gif]
[Graphics:Images/index_gr_282.gif]
[Graphics:Images/index_gr_283.gif]
[Graphics:Images/index_gr_284.gif]
[Graphics:Images/index_gr_285.gif]

This is identical to the solution computed above using matrix powers.

Now consider a more complex process, shown in figure 4 below. There are two redundant systems, A and B; either can fail independently, and if down, can be repaired and brought back up. The figure shows the relevant transition rates (the probability of a direct transition from state 1 to state 4 is small enough to be neglected). The most important question is the probability of total system outage, represented by state 4.

[Graphics:Images/index_gr_286.gif]

                                Figure 4.

The transition matrix is

[Graphics:Images/index_gr_287.gif]
[Graphics:Images/index_gr_288.gif]

As in the previous example, we can find the stationary probabilities by brute force:

[Graphics:Images/index_gr_289.gif]
[Graphics:Images/index_gr_290.gif]

(In this case, Mathematica does not seem to be able to handle  Limit[MatrixPower[case2P,n],n->∞]--the result is a zero matrix. It does converge to the value above, however, as can be verified by computing a series of large powers of the matrix.)

More formally, in finding the stationary probabilities, we are looking for the state vector x such that Px = x. In other words, an eigenvector x corresponding to the eigenvalue 1, and satisfying  [Graphics:Images/index_gr_291.gif] = 1, since the [Graphics:Images/index_gr_292.gif] are probabilities of a set of mutually exclusive and exhaustive states. Starting with

[Graphics:Images/index_gr_293.gif]
[Graphics:Images/index_gr_294.gif]

The desired eigenvector is the first:

[Graphics:Images/index_gr_295.gif]
[Graphics:Images/index_gr_296.gif]

However, the components do not sum to one. Since an eigenvector is unuque only up to a scalar multiple, it can be normalized by the sum of its components to produce the desired result:

[Graphics:Images/index_gr_297.gif]
[Graphics:Images/index_gr_298.gif]

We can confirm that it is the desired eigenvector:

[Graphics:Images/index_gr_299.gif]
[Graphics:Images/index_gr_300.gif]

And also note that it is identical to the result produced above by matrix powers.

Markov models are limited by the fact that the "memoryless" property implies exponential transition rates; thus modeling of, say, Weibull failure rates is ruled out. But during the operating life of many systems (the flat part of the bathtub curve), exponential rates are a good enough approximation to make the simplicity of the Markov approach appealing.

Simulation or Monte Carlo methods

The Monte Carlo method is familiar to most mathematicians as a way to solve deterministic problems using probabilistic methods. For example, to find the area under a curve, within some larger bounding rectangle, we can select points within the rectangle at random, determine whether each point is under the curve, and take [Graphics:Images/index_gr_301.gif] as an approximation to the desired area.

Monte Carlo, or discrete event simulation, is more commonly used to solve non-deterministic problems, such as those encountered in reliability analysis. The general idea is  to describe a system as a network of "black boxes" whose inputs and behavior can be characterized by random variables; then use a random number generator to generate inputs, "run" the system over some time period, and measure the outputs. In most cases, the model is algorithmic, and is applied to problems for which no closed form solution or conventional numerical approximation is available.  

Generating values of random variables

One aspect of simulation modeling is defining the network model, which is specific to the situation under study. A more general problem, arising in all simulation models, is this: given a random variable X characterized by a probability distribution, to generate a time series of values [Graphics:Images/index_gr_302.gif], [Graphics:Images/index_gr_303.gif], [Graphics:Images/index_gr_304.gif], . . . from this distribution. All languages used for writing simulations, including Mathematica, supply methods for generating random numbers from a uniform distribution. By definition, if U is a uniform random variable on the interval [a, b], the probability of a value of U being found in the interval [u, u + Δu] is [Graphics:Images/index_gr_305.gif]. Recalling the discussion above related to figure 1, for any random variable X, the probability P that the random variable has a value <= x is uniformly distributed on the interval [0, 1]. So if the CDF of X has an inverse [Graphics:Images/index_gr_306.gif]: P -> X, we can use the inverse function to map the uniform variable U[0, 1] to X[0, ∞].

For example, take the CDF for the exponential distribution,  p = F(t) = 1 - [Graphics:Images/index_gr_307.gif].  The inverse function is t = [Graphics:Images/index_gr_308.gif], but note that p and 1-p have the same distribution.  So this function will return a random exponential variate for the parameter λ :

[Graphics:Images/index_gr_309.gif]

Here the function is used to generate a data set simulating the leukemia remission data, which is plotted against the continuous CDF:

[Graphics:Images/index_gr_310.gif]

[Graphics:Images/index_gr_311.gif]

[Graphics:Images/index_gr_312.gif]

(Compare this to the original data plot.)

This might be used to set up a simulation model that showed how natural remissions competed with the hypothesized effect of a new drug, in order to validate the design of a trial.

Since a simulation might need to generate thousands of values from a given distribution, computing the inverse function for each value may impact performance, particularly for CDFs (such as the normal) which do not have closed-form inverses. Given that the results only have a finite precision, standard practice is to initialize a table mapping p to t in increments, say of .01. This is similar to the empirical CDF, and in fact an empirical CDF could be used in a simulation model for this purpose.

Taking as an example the Weibull distribution for the ball bearing data, here is a table-driven generator of Weibull random variates. First a table is built. In order to insure coverage of the significant portion of the probability space, the function first computes the value of t with a cumulative probability of 0.999; then it iterates from 0 through this t, computing at least 100 values (the number of values could be increased for more accuracy). Then the table is used to generate the random variates, by mapping a value from U[0,1] to the closest table value.

[Graphics:Images/index_gr_313.gif]

Here a table is created to simulate the ball bearing failure data

[Graphics:Images/index_gr_314.gif]

and a simulated data set is created from the table and plotted against the CDF:

[Graphics:Images/index_gr_315.gif]

[Graphics:Images/index_gr_316.gif]

[Graphics:Images/index_gr_317.gif]

(Compare this to the original data plot.)

Reliabilty modeling example

To illustrate a simple, but more complete, simulation model, take the first example from the section on Markov models. As shown in the figure below, a system may fail, and having failed, it may be repaired. The failure time and repair time distributions are exponential, with the indicated rate parameters:

[Graphics:Images/index_gr_318.gif]

The general scheme of the model is, starting in state 1, pick a random failure time. All time prior to the failure was spent in state 1, so add this time to the cumulative time in state 1, and move to state 2. Then pick a random repair time, add it to the cumulative time in state 2, and move to state 1. This process is iterated some number of times, and then results are computed; for example,  [Graphics:Images/index_gr_319.gif]  gives the long-term probability that the system is operational. Here is the model function, which takes as parameters the failure and repair rates, number of iterations, and initial state:

[Graphics:Images/index_gr_320.gif]

And here it is run with the rate parameters from the example:

[Graphics:Images/index_gr_321.gif]
[Graphics:Images/index_gr_322.gif]
[Graphics:Images/index_gr_323.gif]
[Graphics:Images/index_gr_324.gif]

Comparing this to the analytic solution {0.967742, 0.0322581}, derived above,

[Graphics:Images/index_gr_325.gif]
[Graphics:Images/index_gr_326.gif]

shows it is accurate to within 1%. Convergence to the expected value is quite slow as the number of iterations is increased.

Methods exist for computing statistical confidence intervals on the outputs, but these become less tractable as the model becomes realistically large. Typically, engineers are content to perform a sensitivity analysis by varying parameters over a range they consider realistic, and observing how the variations affect the output.

Summary

Americans, and increasingly people all over the world, are dependent on complex technological systems for things they take for granted in daily life. Other things being equal, increased complexity leads to reduced reliability. In most cases, failure of technology (automobiles, kitchen appliances, etc.) is inconvenient but not dangerous. In other cases (aircraft, hospital equipment, etc.) it can be life-threatening. And in a few cases, such as nuclear reactors, failure can be catastrophic on a large scale. As we become more dependent on technology that is developed in shorter design and test cycles, high-quality reliability analysis beccomes more important to everyone.

This paper has introduced reliability analysis as one of many ways in which mathematics can be used to model, explain, and predict phenomena in the real world. Reliability engineers, starting from a theoretical understanding of failure modes and data on actual failures, attempt to predict failure probabilities in situations that have not yet been tested. Raliability analysis is applicable to every phase of product or system development, in order to set reasonable and attainable goals, and to take corrective action when needed.

In addition to its practical value, reliability analysis is mathematically interesting, because it uses theory that is applicable to a broad range of disciplines, from actuarial science to medical research. The theory and techniques are drawn mainly from probability, statistics, and stochastic process analysis. In this paper a number of particular techniques were selected, intended to illustrate the range of reliability analysis and the use of Mathematica.

Empirical models of reliability start with observed data, which is analyzed to determine a distribution type that fits the data. The theoretical distribution can then be used to make reliability predictions beyond the range of the observations. Developing such a model involves determining the distribution type, estimating parameters of the distribution, and testing the observed data against the model for goodness of fit.

Markov models are based on defining an exhaustive state space for the system or process of interest, and estimating probabilities or rates for transitions between states. If the system being modeled conforms to the mathematical requirements--principally that transition probabilities are time-invariant--a great deal of information can be derived from the model. Simple models may be solvable in closed form, but any properly formulated Markov model can be solved numerically.

The most realistic models are discrete event or Monte Carlo simulations, which use random inputs to drive models that incorporate the structure of the real system. This technique is normally employed only when no simpler approach is viable, since it is more expensive and requires more knowledge about the system being modeled.

Based on my own experience and studying a range of textbooks, the typical reliability engineer uses a hodge-podge of tools:  Statistical software, special-purpose software tools, probability-plotting paper, nomograms, tables, hand calculators, etc. Though many of these  are optimal for one facet of the analysis, a general-purpose tool with mathematical depth, such as Mathematica, may be a superior approach for developing an integrated approach to reliability analysis. This would be true all the more if Mathematica were supplemented with packages containing functions tailored for reliability analysis, along the lines of those developed here.

References

[1] Abernathy, Bob, and Wes Fulton. Weibull News 11, 1997. http://www.weibullnews.com/wnews11.html.

[2]  Bulmer, M. G. Principles of Statistics. New York, N. Y.: Dover Publications, 1979.

[3] Department of Defense. MIL-HDBK-217F: Reliability Prediction of Electronic Equipment. Washington, D. C.: Department of Defense, 1991. (Downloadable from http://astimage.daps.dla.mil/quicksearch/.)

[4] Department of Defense. MIL-HDBK-338B: Electronic Reliability Design Handbook. Washington, D. C.: Department of Defense, 1998. (Downloadable from http://astimage.daps.dla.mil/quicksearch/.)

[5]  Feller, William. An Introduction to Probability Theory and Its Applications, Volume I. New York, N. Y.: John Wiley & Sons, 1968.

[6]  Friedman, David. Hidden Order: The Economics of Everyday Life. New York, N. Y.: HarperCollins, 1997.

[7]  Gotelli, Nicholas J. A Primer of Ecology. Sunderland, Mass.: Sinauer Associates, Inc., 1998.

[8]  InfoPlease. "United States - U.S. Statistics - Population". Available online at http://www.infoplease.com/ipa/A0004997.html.

[9]  Inter-university Consortium for Political and Social Research. "Historical Demographic, Economic and Social Data: The United States, 1790-1970". Ann Arbor, Mich.: n.d. Data extract available online at http://fisher.lib.virginia.edu/census/.

[10]  Kleinrock, Leonard. Queueing Systems, Volume I:  Theory. New York, N. Y.: John Wiley & Sons, 1975.

[11]  Kolmogorov, A. N. "The Theory of Probability". Chapt. XI in A. D. Aleksandrov, et. al., eds., Mathematics: Its Content, Methods, and Meaning. Cambridge, Mass.: The M. I. T. Press, 1963.

[12]  Law, Averill M., and W. David Kelton. Simulation Modeling and Analysis. Boston, Mass.: McGraw-Hill, 2000.

[13]  Leemis, Lawrence M. Reliability: Probabalistic Models and Stastical Methods. Upper Saddle River, N. J.: Prentice-Hall, 1995.

[14]  Lieblein, J., and Zelen, M. "Statistical Investigation of the Fatigue Life of Deep-Groove Ball Bearings", Journal of Research, National Bereau of Standards, Vol. 57, p. 273, 1956. Quoted in [13].

[15]  McLinn, James A. Weibull Analysis Primer. American Society for Quality Control Reliability Division monograph, 1997.

[16]  Meridian Marketing Group. "Get top performance from your system with simulation" (Advertisement for GPSS/H).
http://www.meridian-marketing.com/GPSS_H/wo_p1.html.

[17]  Musa, John D., et al. Software Reliability: Measurement, Prediction, Application. New York, N. Y.: McGraw-Hill, 1987.

[18]  Nair, Vijayan, and Anne E. Freeny. Methods for Assessing Distributional Assumptions in One and Two Sample Problems. Murray Hill, N. J.: AT&T Bell Laboratories, n.d.

[19]  Norris, J. R. Markov Chains. Cambridge, UK: Cambridge University Press, 1997.

[20]  O'Connor, Patrick. Practical Reliability Engineering (third edition revised). Chichester, U.K.: John Wiley & Sons, 1995.

[21]  Pukite, Jan, and Paul Pukite. Modeling for Reliability Analysis: Markov Modeling for Reliability, Maintainability, Safety, and Supportability Analysis of Complex Computer Systems. New York, N. Y.: IEEE Press, 1998.

[22]  Sobol', Ilya M. A Primer for the Monte Carlo Method. Boca Raton, Fla.: CRC Press, 1994.

[23]  Thompson, D'Arcy. On Growth and Form. New York, N. Y.: Dover Publications, 1992.

[24]  U. S. Census Bureau. "First Census 2000 Results". Available online at http://www.census.gov/main/www/cen2000.html.

[25]  Wigner, Eugene. "The Unreasonable Effectiveness of Mathematics in the Natural Sciences." In Symmetries and Reflections: Scientific Essays. Woodbridge, Ct.: Ox Bow Press, 1979.

[26]  Wolstenholme, Linda C. Reliability Modelling: A Statistical Approach. Boca Raton, Fla.: Chapman & Hall/CRC, 1999.


Converted by Mathematica      January 1, 2003