The P-Value Concept as a New Approach to Statistical Process Control

A Look at Statistical Process Control through the P-Values

Yoav Benjamini and Yechezkel Kling

Tel Aviv University, Israel

Abstract

Statistical process control (SPC) involves repeated testing at each time point, or at each batch, of the same null hypothesis: that the process is under control. The results of these tests are usually reported at a fixed level, by observing whether the test statistic is in the rejection region — outside the control lines. In this work we address the use in SPC of the p-values, or their monotone transformation to the observed average run length value. We argue that the use of p-values in SPC carries with it major benefits: (a) It offers better graphical displays of the performance of the process, which are also easier to interpret; (b) it allows incorporating more complex control procedures into existing charts; and (c) it facilitates the incorporation of the effect of multiple testing into SPC.

We demonstrate the above by offering modifications to the P-charts, -S charts, and CUSUM, the latter requiring special approximations to the p-value. Adjusted p-values for multiplicity control in SPC are used for examining ten control charts for five quality attributes running in parallel.

Key words: adjusted p-values, CUSUM, average run length, control charts, multiple comparisons.

Introduction

The use of control charts for Statistical Process Control (SPC) can be viewed as a repeated hypothesis testing where the tested null hypothesis is that the process examined is under control at the time the sample was drawn (e.g. Sarkadi and Vincze (1974, p. 231), Alt (1985)). Woodall and Montgomery (1999) mention the relationship between hypothesis testing and control charts when listing major disputes in the research in SPC. Even when not specifically mentioned, it is a common practice to refer to the related type I and type II errors taking the aforementioned hypothesis for granted (e.g. Grant and Leavenworth 1980, p. 109, and Bissell 1994, p. 103). Assuming that the process is under control, and that the test statistics are statistically independent, the number of time points until the first false alarm is geometrically distributed. Therefore, the type I error, which is probability of a false alarm at a certain time point, is the reciprocal of the Average Run Length until false alarm (ARL₀) (for instance see Bissell 1994, p. 104).

In many fields of research it is customary to report the results of statistical testing in terms of the p-value, rather than in the fixed level form where the pre-specified significance level is used and the results are reported in terms of "reject" (out-of-control) or "do not reject" (in-control). The p-value is the smallest significance level at which the relevant hypotheses would be rejected given the observed realized value of the test statistic (see discussion in Gibbons 1985). As such, the p-value carries the information of how strong is the evidence against the null hypothesis, being a measure of the extremity of the sample result in view of the null hypothesis. Almost all statistical packages and applications supplying statistical test procedures report the results in terms of the p-values. Thus later analysis can be performed at any desired significance level a , by comparing the resulting p-value to a . While discussions about the advantages, limitations, and misinterpretations of the p-values are abundant (e.g. Schervish, 1996, Cassela and Berger, 1987, Berger and Sellke, 1987), it remains the most common way of using and reporting the results of statistical testing.

Placing the Action Line at the height of the observation and calculating the corresponding observed ARL₀ derives the analog of the observed p-value for SPC. That is, for observation of magnitude M, calculate the ARL₀as if the future observations would be controlled using the action line fixed at height M. Then the p-value is the reciprocal of the observed ARL₀.

In this work we argue that the use of p-values in SPC carries with it major benefits. First, it offers better graphical displays of the performance of the process, which are also easier to interpret. Second, it allows incorporating more complex control procedures into existing charts. Finally it facilitates the incorporation of the effect of multiple testing into SPC. We shall explain each of these points below.

(a) Better graphical display

In situations where sample size varies, most SPC charts have changing Action Lines. These plots tend to look messy and are confusing. For the sake of demonstration we have generated a P-chart with the above mentioned properties. The upper panel of Figure 1, displaying this P-chart, is cluttered and the eye is drawn to the motion of the action lines rather than to the actual measured proportion of defects. Moreover, points of the same magnitude may incorporate different risks; for instance the pair of runs 9 and 10 and the pair of runs 4 and 6. Notice that Run 4 is out of control while Run 6 is not, that is of the same magnitude but is based on more observations, is within the control lines. Thus the user of the chart has to compare relative lengths (between the measurement and the corresponding action line) that are scattered about. This perceptual task seems to be more difficult than all graphical elementary perceptual tasks that were discussed by Cleveland and McGill (1984).

The lower panel of Figure 1 presents how the incorporation of the p-values simplifies the above P-chart. It emphasizes the out of control signal, drawing the eye to the relevant information without diverting the attention from the original measurements. Note that the calculation of the p-value takes into account the sample size, simplifying the decision rule as to whether the process is out of control (the p-value is always compared to the same value). Though the construction of this plot, and its fine details, are yet to be explained (in section 2), the interpretation of the figure is quite intuitive. Charts using the p-value in the above way offer simple presentations avoiding unnecessary chart-junk (Tufte 1983) and enabling the design of visually effective displays.

(b) Incorporating more complex control procedures

One of the drawbacks of advanced control charts for SPC is that charts use statistics that have no natural interpretation within the contexts of the process examined and the measure plotted has no intuitive reading. For instance the Cumulative Sum (CUSUM) chart or a multivariate control chart using the Hotelling T² statistic. This is a painful problem since it inhibits their use in the manufacturing environment. There have been several attempts to amend this shortcoming for example Fuchs and Benjamini (1994) who point out that it is desirable to plot the observations in their original measurement scale. Superimposing the p-values that correspond to the above mentioned complicated statistics on a simple plot of the original measurements produces a simple intuitive chart. As a result, the chart's appearance and interpretation do not change when the underlying statistical calculation is modified (e.g. using the t-distribution instead of the Gaussian distribution, or using the Exponentially Weighted Moving Average (EWMA) scheme instead of the CUSUM).

Figure 1: A P-chart with changing sample sizes. Upper panel displays the standard control chart. The lower panel displays the same data, incorporating the p-values information. The grayed area increases as the p-value decreases. A black dot represents an out-of-control observation (in the rejection region).

Our own interest in utilizing p-values in SPC arose from our interest in multiplicity problems in SPC. Many aspects of commonly used SPC schemes are situations of multiple hypothesis testing. For instance, looking at different warning signals in the same chart, or looking at multiple quality characteristics of the same process in parallel charts. If unattended, the effect of multiplicity is to increase the type I error, thereby shortening the overall ARL₀ and inflating the number of false alarms. Many of the newly developed procedures that deal with the multiplicity situation need as their input only the information incorporated in the p-values (see reviews in Hsu, 1996, Westfall et al, 1999). Thus obtaining the p-values for the SPC enables the use of advanced research results on the multiplicity problem within SPC. Moreover, one way of reporting the results of these multiplicity controlling procedures is via adjusted p-values (Westfall and Young, 1992 and Westfall et al. 1999), which can be interpreted and plotted as p-values. Therefore, the very same modifications to the SPC charts, which enable the portraying of p-value information, offer solutions by displaying the adjusted p-values.

The concept of p-value is not new, so before any further discussion we should answer the most natural question a reader may pose: if the approach is so helpful, why is it not seen around more often in SPC? It will become clear that the proposed charts are prohibitively complex for charting by hand. The time now is ripe for the change, as most SPC charts are constructed these days by computerized systems. It is thus feasible, and more appropriate, to emphasize readability and ease of interpretation over ease of preparation.

The next three sections will be devoted to substantiate, demonstrate and expand each one of the above points. In particular, in the next section, a situation of controlling a dry etch process is used to demonstrate several possible ways for presenting graphically the information about p-values on commonly used SPC charts, and a preferable way emerges. We then devote Section 3 to demonstrate how the use of p-values in SPC charts enables the incorporation of more complex procedures, by redesigning an - chart to show the information from a CUSUM procedure. A Markov chain based approximation is used to compute the p-value for the CUSUM procedure, and it is detailed in the Appendix. In Section 4 we demonstrate a simultaneous consideration of both the - part and the S part, the results of which are displayed in a new variant of the combined - S chart. Then we discuss an example of another common situation where ten control charts are plotted simultaneously, a pair on each of five different quality characteristics, during the calibration of a Fourier Transform Infrared (FTIR) spectrometer used for measuring carbon contamination in silicon wafers.

Incorporating the p-values within SPC charts

In order to demonstrate possible ways to incorporate the p-value within SPC charts, we will use the data presented by Lynch and Markle (1997) for a Dry Etch process. Six wafers were drawn at each run, and nine measurements were taken at fixed locations on the wafers. There is a high positive correlation amongst the measurements per wafer. The analysis presented here is done on the six averages of the measurements per wafer, and the distribution of the sample mean is assumed to be Gaussian. Figure 2 presents a simple - chart similar to the chart used by Lynch and Markle (some of the points on the original plot do not correspond to the data set provided by the authors). The control lines on the figure were set for an ARL₀ of 500 which is equivalent to a type I error rate of 0.002. There are four points signaling out-of-control (Runs 2, 3, 9, and 18) and there appears to be a change in the spread towards the end of the chart. Lynch and Markle know of two major changes in the process that occurred at runs 9 and 18.

Figure 2: Shewhart control chart for the average Etch Rate for 27 runs as presented by Lynch and Markle. Control lines set at an ARL₀ of 500.

Since the process clearly changed at Run 18 we will use the first 17 runs for the estimation of the process mean and the within run variance (runs 4 and 9 are excluded from the analysis due to their extreme variability - see Figure 9). At each run a two-sided hypothesis is tested where the null hypothesis is that the sample drawn at that run comes from a Gaussian distribution with mean of 552.3 and standard deviation 9.42. Thus the Z Score at each run is given by equation (1):

(1) ,

and the two-sided p-value is:

(2) p_obs = 2 (1-F (|z _obs|))

Remark: It might be argued that a t distribution should be used for the p-value computations in (2) since the variance is estimated from 15 observations only. One of the merits of using p-values is that nothing is changed in the interpretation for the user of the charts that follow. Thus the lay-operator of the process need not be instructed each time the underlying statistical calculations (carried out by the computer) are modified.

Fuchs and Benjamini (1994) list three important principles for good SPC plots in step with Tufte (1983): ink should be proportional to the size of the warning signal; the original measurement scale should be maintained; and their interpretation should be relatively intuitive. If we are to comply with the requirement that the amount of ink on paper will be proportional to the strength of the signal we cannot directly use the p-value. Moreover, it is desirable to use a measure that is related to the commonly used ARL. As we mentioned above, the ARL₀ is the reciprocal of the significance level of the statistical test. Since the p-value is regarded as the observed (effective) significance level we refer to its reciprocal as the observed (effective) ARL₀, or the ARL-value,

ARL_obs =

The ARL-value reflects the magnitude of the expected average run length until the next false alarm, if the alarm were to be sounded at the level currently observed. To ease the readability of the charts displaying the ARL-value, the retinal variables we shall display are proportional to the logarithm of the ARL-value.

A naive implementation of a joint display may run as in Figure 3: the ARL-value is plotted as bars in the background of the original - chart. The bars are color-coded. Light gray corresponds to a small ARL-value (large p-value) and as the ARL-value grows (the p-value decreases) the gray gets darker. The original lines (Figure 2) are plotted as dashed lines and the full line presents the control line for the ARL-value. Thus instead of comparing the data points to the control lines one can (and should) refer to the distance of the ARL-value from their control line (set at -log₁₀(0.002), ARL-value = 500, for this example). Were the process in control, the figure would have been relatively clear. However Figure 3 is dominated by dark gray and black and it could be seen in a glance that most of the time the process hovers far from the center (from the mean) — it is mostly ‘in the gray area’. In this type of chart the attention is drawn to the ARL-value when it is relatively high and to the data points when the process is under control. This advantage is also a disadvantage when the chart is out of control since the information about the original variables is less visible because of the dark background.

Figure 3: A combination of the original control chart for the average Etch Rate (right pain) and a Bar chart of the Log ARL-value which gives the magnitude of the expected ARL₀ (left pain). As the p-value gets smaller (the ARL-value bigger) the bars get darker. Thus the Out of Control (OOC) signal is completely black. The original action lines for the - chart are plotted as dashed lines and the full line presents the control line for the Logof the ARL-Value).

Following the guidelines set by Cleveland and McGill (1984) and Fuchs and Benjamini (1994), Figure 4 presents an attempt to draw the eye equally to the value of the measurement and the p-value. Again the original - chart is used the basis for the plotting. The equal-sized dots are replaced with bars that are filled proportionally to the ARL-value. Thus an empty bar corresponds to a high p-value and a full bar to a small p-value. A Black filling marks an out of control signal. Since the ARL-value goes to infinity as the p-value nears zero, we have set, arbitrarily, the frame of the bar at 3.5 standard deviations (-log₁₀(0.0009); ARL-value = 1111). In this figure it is still clear that most of the measurements are relatively far from the center. However the differences among the p-values (Observed ARLs) are less distinct than in Figure 3 since color-coding was not used and the bars are much shorter. On the other hand, Cleveland and McGill (1984) point out that this type of display is superior not only because is utilizes perception skills that are high in the elementary task scale but also because the empty spaces in the bars help the comparison.

Figure 5 displays a variation of Figure 4. The bars are replaced with circles maintaining the principle that the area is proportional to the ARL-value. This type of chart is as good as the previous two and it has a few advantages:

The measurement value is plotted as a circle. This symbol is preferable since the eye is drawn naturally to its center. Note that it is difficult to decide which of the points 22 or 27 is lower in Figure 6 than in Figures 7.

The comparison of areas per se is inferior to the comparison of the boxes that involves comparison of locations on different scales. Nevertheless, in plotting the outer circle we manage to keep much of the advantage of the second. This kind of chart involves the same perceptual tasks that are used when the symbol is a box (such as in Figure 6). For instance lengths are compared along the radius as part of the comparison of the areas.

Though the symbol is relatively small the differences among the p-values are quite distinguishable since areas are compared together with the comparison of length (along the radius).

The attention is equally drawn to the measured value and the corresponding p-value.

Therefore this type of chart will be used in the rest of the discussion.

Remark 2.1. When color displays are available, the chart could be given a 3-dimensional look following Carr and Sun (1999). Using shades of gray lined with white and gray lines "light" the surfaces from the upper left corner of the figure thus creating the visual effect of raised rejection regions and observation points. Carr and Sun (1999) point out that coloring the dots red makes them appear closer thanks to the focal length of this color. Therefore, the filled area, which is proportional to the ARL-value, is painted bright red when the signal is out-of-control, and otherwise the red is mixed with a little blue (purple). Hence they remain legible for color-blind people (and also when printed on black and white printers). For a color version of Figure 5, see

www.math.tau.ac.il/~kling/MCP_II_Poster_Multiplicity_in_SPC.htm.

Remark 2.2 The charts presented above, as well as all other charts in this paper, were created using the SAS/Graph annotation facility. The code used to create them can be downloaded from www.math.tau.ac.il/~kling/html/pvaluecode.html.

Combining the CUSUM and - chart via the p-values

For Figures 3 to 5 the p-values were calculated to correspond with the simple test of location underlying the - chart. However, as we previously pointed out, the p-values (and the corresponding ARL-values) may be used to simplify the interpretation of a SPC chart based on a less-than-intuitive statistic. The ARL-value in Figures 6 and 7 are for a positive shift CUSUM and a negative shift CUSUM respectively, where the p-values for the CUSUM were obtained using a Markov chain representation of the process (see Appendix A). This combination presents at a glance two control schemes: the original -chart and the CUSUM. In these two figures, the p-values on the chart add significant new information, yet enable the display of the original measurements on the original scale.

Figure 4: The control chart for the average Etch Rate as in Figure 3, but the uniform sized dots are replaced with bars. Each bar is partially filled with darker bar. The height of the darker bar is proportional to the log(ARL-Value). A black filling marks an out of control signal.

Figure 5: The control chart for the average Etch Rate as in Figure 4, but bars are replaced with circles Each circle is partially filled with darker area. The area of the darker dot is proportional to the log(ARL-Value). A black filling marks an out of control signal.

Figure 6: A combination of an - chart and a ‘positive shift CUSUM’, with K=0.5 standard deviation. The control chart for the average Etch Rate is the same as in Figure 4, but the area of the filling of the circles is proportional to the log(ARL-value) for the positive CUSUM. A black filling marks an out of control signal according to the CUSUM (p-value<0.003, ARL-value > 333.33).

There are several runs that were found to be out of control by the CUSUM but not by the - chart (runs 15, 19, 22, 25, and 27 on Figure 7). Also note runs 10 through 14 on Figure 9 that are very close to the mean and yet have relatively small p-values (large ARL-values). On the other hand Runs 3 and 9 on Figure 6 and Run 18 on Figure 7 have measurements that are outside the action lines of the - chart even though they do not give out of control signals according to the CUSUM.

The two separate positive shift and negative shift CUSUM charts can also be combined to a single two-sided CUSUM chart. The graphical solution is presented in Figure 8. The circle is divided into two where the upper half is dedicated to the positive shift CUSUM and the lower half to the negative shift CUSUM. The horizontal line, halving the circle, helps the reading of the chart, by separating clearly the information in the two part of each circle.

Multiplicity adjustment

There are several situations of SPC that require attention to the multiplicity problem. The simplest is controlling several aspects of a process, for example simultaneously inspecting the -chart and the S-chart for the same quality attribute. The most common one is using multiple criteria on the same chart, such as various action line rules combined with some run's length rules (see Grant and Leavenworth 1980, p. 282-284 for an approximate calculation of the type I error rate for this situation). The combination of a CUSUM and a -chart, suggested in the previous section, is another example of such a situation. Controlling multiple attributes of a product (e.g. screw length, screw diameter, etc.), is also becoming a more common problem, and controlling the quality of the final product which is manufactured in several steps is yet a different multiplicity problem. These situations are rarely separated. In a typical example we have encountered at a paper mill, five quality characteristics and four additional control variables are simultaneously displayed on a large plot, the workers being instructed to monitor simultaneously four traditional warning signals on each.

The probability of a false alarm increases when several tests of the same hypothesis are conducted jointly, or when several hypotheses are simultaneously tested. This was already noted by Hilliard and Lasater (1966), who estimated through simulation the overall type-I error-rate when three criteria are applied simultaneously to an -chart, and found it to be almost 0.27, although for each single criterion the type I error rate was set at 0.05. Remedies for the increased error-rate exist, but usually have not been used by practitioners. Actually, such remedies were not even recommended. Hilliard and Lasater (1966) themselves argued that "All three tests have been used many times in control charts situations over the past years with good results. What better recommendation could there be?" Notable exceptions are Jackson (1959), and Montgomery and Klatt (1972), who explicitly list the multiplicity problem as a reason for using Hotelling’s T², and Alt (1985), who suggested using the Bonferroni procedure to investigate which quality characteristics are responsible for an out of control signal.

Figure 7: A combination of an - chart and a ‘Negative shift CUSUM’ (k=0.5 standard deviation). The control chart for the average Etch Rate is the same as in Figure 4, but the area of the filling of the circles is proportional to the log(ARL-value) for the negative CUSUM. A black filling marks an out of control signal according to the CUSUM (p-value<0.003, ARL-value > 333.33).

Figure 8: A combination of an - chart, a ‘positive shift CUSUM', and a ‘Negative shift CUSUM’ (both with K=0.5 standard deviation). The control chart for the average Etch Rate is the same as in Figure 4. The area of the filling of the upper half of the circles is proportional to the log(ARL-value) for the positive shift CUSUM. The area of the filling of the lower half of the circles is proportional to the log(ARL-value) for the negative shift CUSUM. A black filling marks an out of control signal according to the CUSUM (p-value<0.003, ARL-value > 333.33).

As research of SPC expands into less traditional uses in fields that are not purely industrial, concern about multiplicity should rise. Svolba (1999) uses SPC to monitor clinical trials. Addressing explicitly the issue of multiplicity is not only a standard practice in the analysis of clinical trials, but regulating authorities such as the FDA also requires it.

The approach for dealing with each of the multiplicity problems discussed above is not straightforward, and sometimes debatable. Should we always control the probability of making even one error? Only when the process is under control? May the control of the false discovery rate be enough? We shall touch in passing on these issues which warrant a more thorough discussion, as our emphasis is on the p-value concept and charts. The p-value is instrumental to many multiplicity correction techniques such as Bonferroni, Holm's, Hochberg's, and others (see Hochberg and Tamhane 1987, Westfall and Young 1992, Hsu 1996 and Benjamini and Hochberg 1995 and 1997). Furthermore it is beneficial to use the multiplicity-adjusted p-value in the SPC charts thereby avoiding the necessity of specifying the significance level prior to the calculations when a multiplicity correction procedure is implemented. As a simple example, suppose that the Bonferroni procedure is used to control for multiplicity, so each p-value p_i is compared to the significance level a divided by the number of hypotheses (a /n). Instead, the adjusted p-value in this case is p_i*= np_i, which should be now compared to a . For more details on the adjusted p-value see Westfall and Young (1992) and Westfall et al (1999).

We illustrate the benefits of attaining and displaying the p-values where the problem of multiplicity is of concern in two examples. In the first example we demonstrate how to combine a -chart and an S-chart (see Figures 2, 9, and 11), a task for which already Alt (1985) had noted the need for. Lynch and Markle present the -chart and the S-chart for the Dry Etch data discussed above in the usual way, with no correction for the fact that the combined ARL₀ is shorter.

For the first example we take the approach that controls the traditional probability of making even one error, and therefore assures that the combined ARL₀ is at the pre-set level. For that purpose we shall use the procedure of Hochberg (1988) for independent test statistics. The procedure makes use of the two individual p-values for each batch, one from the analysis of and the other from the analysis of S, and combines them to report two new multiplicity-adjusted p-values. Computationally, if p(1) ¾ p(2) are the two sorted individual p-values, the adjusted p-values are p₍₂₎* =p₍₂₎, and p₍₁₎* =min(2p₍₁₎, p₍₂₎*). These two can now be displayed and compared, each to the same desired level.

Figure 9: S-chart for the standard deviation of the Etch Rate for 27 runs, as presented by Lynch and Markle.

In Figure 10 the -chart and the S-chart are displayed separately in the traditional way. The filled areas of the circles are proportional to -log₁₀(adjusted p-value), which henceforth will be called the adjusted ARL-value obtained for the - and the S-chart. In Figure 11 one chart incorporates both statistics in a similar way, displaying the adjusted ARL-values in the lower and upper halves. (The action lines on these figures are not multiplicity adjusted, although they can be incorporated as well). Figure 11 has the advantage that the location and the dispersion measures are viewed compactly on the same chart. The location and the dispersion measures are plotted as the height of the dot (location) and the length of the horizontal bar (dispersion).

Figure 10: Control charts for the average (left) and the standard deviation (Right) of the Etch Rate for 27 runs presented by Lynch and Markle. The area of the filling in the left figure is proportional to the log(adjusted ARL-value) for the- chart and the area of the filling in the right figure is proportional to the log(adjusted ARL-value) for the S-chart. The multiplicity adjustment is done according to Hochberg's procedure. A black filling marks an out of control signal {The over all type I error rate was set to 0.2%}

Figure 11: A combined -S control chart for the Etch Rate in Figure 10. Each circle is divided into two: The area of the filling of the upper half is proportional to the log(adjusted ARL-value) for the- chart and the area of the filling of the lower half is proportional to the log(adjusted ARL-value) for the S control chart. The multiplicity adjustment is done according to Hochberg's procedure. A black filling marks an out of control signal {The over all type I error rate was set to 0.2%}

While presenting both measures and their corresponding p-values, this chart emphasizes visually the location information over the spread information. It is immediate to create the dual chart that emphasizes spread over location. Choosing the more appropriate one as default for a specific process, and being able to toggle to the other (or to the separate chart) upon need could give the best practical tool.

Figures 10 and 11 suggest that the dispersion of the process is small for most of the time and usually is not affected by shifts in the mean of the process. Runs 4, 9, and 27 have relatively strong signals for both shift and spread but they are not statistically significant after correcting for multiplicity. Though Runs 18 and 4 are outside the control lines signaling a shift in the mean or variance correspondingly, their signal is not strong enough to be considered extreme once correcting for multiplicity. This example demonstrates how the effect of multiplicity, if unattended, is to increase the type I error, thereby shortening the overall ARL₀ and inflating the number of false alarms.

As an example of a more complex situation we use Pankratz (1997) who describes an experiment to check a new FTIR measuring machine. Five specimens with known contents of carbon (standards) were measured repeatedly for ten days. For each specimen both the location and spread were analyzed using control charts. The purpose was to identify extreme measurements so that the clean data could be used to calculate a calibration curve. If each control chart were designed with an ARL₀ of 500 (a =0.002) then the overall ARL₀ for all ten charts together, assuming all statistics are independent, is about 50 (a =0.02). On the other hand, if for some of the attributes the process is not under control, the control of the overall type I error rate (FWE) might be too conservative to identify the multiple sources for out-of-control data. We therefore use the approach of Benjamini and Hochberg (1995) who offer a procedure that controls the expected proportions of false alarms to the alarms (False Discovery Rate, or FDR). If the process is under control, the procedure controls the ARL₀ at the desired level. But if the process is out of control, the procedure is much more powerful.

The results of the procedure in Benjamini and Hochberg (1995) can also be described by introducing FDR-adjusted p-values (Yekutieli and Benjamini, 1999, and Troendle, 2000), and they, in turn, can be transformed to FDR-adjusted ARL-values as before. Figure 12 presents the control charts for the measurements presented in Pankratz. The area of the filled out circles is proportional to the appropriate FDR adjusted ARL-value. The FDR adjustment uses again the ten sorted p-values for the ten hypotheses tested at each day, p₍₁₎ ¾ p₍₂₎ ¾ … ¾ p₍₁₀₎, and is given by

(4) p*_(i) = max { m p_(j) / j, j = i, i+1,…, m) }.

This procedure is also known to be less sensitive to the size of the problem, which means that the number of charts look at simultaneously does not severely affect the decision as to which attributes are out of control.

The lines of the charts in Figure 12 are based on the estimation of the means and standard deviations. When basing the calculations on the specimens' known contents of carbon as reported by Pankratz, it is apparent that the measurements

Specimen	charts	S charts
A
B
C
D
E

Figure 12: - charts and S control charts for FTIR measurements of five standard specimens presented in Pankratz (1997). The area of the filling of the circle is proportional to -log(adjusted p-value) where the adjustment is according to the FDR controlling procedure.

for Specimens D and F, though quite constant, are way off. This could mean that the carbon contents of these specimens were not as supposed (possibly due to long shelf life). The outs of control signals (identified by the black dots) are noticed immediately, even though there is a lot of information on the page. In the original presentation one had to examine each chart looking for the points that lay outside the control lines to find these signals.

Discussion

The versions of SPC charts that we presented incorporate information about p-values, in the form of observed ARL-values. The charts remain intuitive to the end users since the observations are also plotted in the original measurement scale no matter what statistics are used. This enables the use of more appropriate — even though more complex- tests. Among the complexities that can be handled this way are multiple testing criteria as well as multiple charts Only when the signal is strong enough - in terms of the ARL-value - is the eye drawn to it.

Costa (1999b) discusses the properties of an - chart with variable parameters, and Costa (1999a) discusses a joint scheme of a - chart and an R- chart with variable sample sizes and sampling intervals. These are situations which call for the use of complicated charts: variable action lines, complicated statistic, and the multiplicity problem. Costa plots the standardized mean and standardized range in order to obtain stable warning and action lines. Changing the above-suggested charts so that p-values are displayed will enable the plotting of the observations in the original measurement scale. On this scale, regions where sampling should be conducted more frequently, and with larger sample size, can be easily identified. Thus the approach we suggest in this work can most naturally find its use such problems.

Woodall and Montgomery (1999) point that "Given the difficulties associated with interpreting signals from multivariate control charts, more work is needed on graphical methods for data visualization." The definition of the p-value and adjusted p-value for SPC, and its graphical implementation, enable the study of the multiplicity issues in SPC including those related to multivariate control charts. These issues will become more important as SPC finds its ways into areas other than manufacturing.

While we have demonstrated in this paper that multiplicity considerations make a difference, in terms of the different conclusions they may lead to, we have not addressed in this work the fundamental questions that should be associated with the introduction of multiplicity considerations into SPC. Our ongoing research strives to identify where the problem occurs, where the potential harm is greatest, what the appropriate error measure for each situation is, and what available procedure, or newly designed ones, should be used to control it.

Appendix - The calculation of the p-value for the CUSUM chart

With no loss of generality, the discussion will be restricted to a simple one-sided CUSUM control chart, assuming that when the process is in control the observations are drawn from a standard Gaussian distribution. Thus, at each time point a sample of fixed size n is drawn, and the sample mean calculated. For the ‘Positive Shift CUSUM’ the test statistic is where k is a constant set at the desired shift in mean to be detected (in terms of standard deviation).

The distribution of S_t is not easily obtained nor is the ARL₀. Brook and Evans (1972) approximate the ARL₀ using a Markov chain representation. In order to use their method the possible values for S_t should be grouped into intervals - ‘states’ in the terminology of Markov chain analysis. For a given magnitude of shift in the mean to be detected (K) and an action line (AL), the distance between the action line and zero is divided into N_s—2 sections. Denote by IL the interval length (), and let n_k be the integer part of . The state above the action line is defined as absorbing (State N_s) and all negative values are grouped as ‘Zero’ (State 1). Thus there are N_s states the process can be in (see Figure 13). The transition probability from state i at time t to state j at time t+1 is approximated using a discretization of the Gaussian distribution:

(1)

Note that the transition probability matrix Pij is a function of the specified action line (AL) and the magnitude of the shift in mean to be detected (K).

Brook and Evans (1972) show that the ARL₀ can be approximated by the first element of (I-R)^-11; where R is obtained by omitting the last row and column from the transition probability matrix P, 1 is a vector of length N_s —1 whose elements are all 1, and I_{(Ns —1* Ns —1)} is the identity matrix.

For a process in control, the number of samples taken until the first false alarm, assuming independence, is geometrically distributed. Therefore the probability of a false alarm is the reciprocal of the ARL until false alarm (). We can still define the p-value as the reciprocal of the observed ARL-value even in our dependent case. Now p₀ is the significance level of the single hypothesis test in the series of independent tests which is equivalent (in terms of ARL-value) to the observed result of the test performed by the SPC procedure. For a given realization of the test statistic at a certain time point, the p-value is in fact p₀ based on Pij calculated as if the action line (AL) is set at the value of the realization of the test statistic St.

The approximations of the ARL-value (and p-value) for the CUSUM charts were carried out in Splus, but could be executed carried out in any statistical mathematical or general software that can calculate the Gaussian distribution and the inverse of a matrix.

Figure 13: A schematic representation of the transition of the process between states

References

Alt, F. B. (1985), "Multivariate quality control," in Encyclopedia of Statistical Sciences, eds. Kotz, S., and Johnson, N. L., 6, 110-122.

Bissell, D. (1994), Statistical Methods for SPC and TQM, Chapman & Hall, London.

Benjamini, Y., Hochberg, Y. (1995), "Controlling the false discovery rate: a practical and powerful approach to multiple testing," Journal of the Royal Statistical Society B, 57, 289-300.

Benjamini, Y., Hochberg, Y. (1997), "Multiple hypotheses testing with weights," Sca-ndinavian Journal of Statistics, 24, 3, 407-418.

Berger, J. O., Sellke, T. (1987), "Testing a point null hypothesis: The irreconcilability of p-values and evidence", Journal of the American Statistical Association 82,112-122.

Brook, D., Evans, D. A. (1972), "An Approach to the probability distribution of CUSUM run length," Biometrika, 59, 3, 539-549.

Carr, D., Sun, R. (1999), "Using Layering and Perceptual Grouping In Statistical Graphics," Statistical Computing & Statistical Graphics Newsletter, Vol. 10, No. 1, pp. 25-31.

Casella, G., Berger, R.L. (1987), "Reconciling Bayesian and frequentist evidence in the one-sided testing problem," Journal of the American Statistical Association 82,106-111.

Cleveland, W. S., McGill, R. (1984), "Graphical Perception: Theory, experimentation, and application to the development of graphical methods," Journal of the American Statistical Association, 79, 387, 531-553.

Costa, A. F. B. (1999a), "Joint and R Charts with Variable Sample Sizes and Sampling Intervals," Journal of Quality Technology, 31, 4, 387-397.

Costa, A. F. B. (1999b), " Charts with Variable Parameters," Journal of Quality Technology, 31, 4, 408-416.

Fuchs, C., Benjamini, Y. (1994), "Multivariate profile charts for statistical process control," Technometrics, 36, 182-195.

Gibbons, J. D. (1985), "P-values," in Encyclopedia of Statistical Sciences eds. Kotz, S., Johnson, N. L., 6, 366-368.

Grant, E. L., Leavenworth, R. S. (1980), Statistical Quality Control, (5th edition) McGraw-Hill Book Co., New York, NY

Hilliar, J. E., Lasater, H. A. (1966), "Type I Risks When Several Tests Are Used Together on Control Charts for Means and Ranges, No Standard Given," Industrial Quality Control, 56-61.

Hochberg, Y., Tamhane, A. (1987), Multiple Comparison Procedures, Wiley & Sons, N. Y.

Hochberg, Y. (1988), "A sharper Bonferroni procedure for multiple tests of significance," Biometrica, 75, 800-803.

Hsu, J.C. (1996), "Multiple Comparisons: Theory and Methods," Chapman and Hall., London.

Jackson, J. E (1959), Quality control methods for several related variables," Technometrics, 1, 359-377.

Lynch, R. O., Markle, R. J (1997), "Understanding the nature of variability in a Dry Etch process", in Statistical Case Studies for industrial process improvement, (eds), Czitrom, V., and Spagon, P. D. ASA-SIAM series on statistical and applied probability, Ch. 7, pp. 71-86.

Montgomery, D. C., Klatt, P. J. (1972), "Economic design of T² control charts to maintain current control of a process," Management Science, 19, 77-89.

Pankratz, P. C. (1997), "Calibration of an FTIR Spectrometer for Measuring Carbon," in Statistical Case Studies for industrial process improvement, (eds), Czitrom, V., and Spagon, P. D., ASA-SIAM series on statistical and applied probability, Ch. 3, 19-38.

Sarkadi, K., Vincze, I. (1974), Mathematical methods of statistical quality control, Academic press, New York.

Schervish, M.J. (1996), "P values: what they are and what they are not," The American Statistician, 50,3,203-206.

Svolba, G. (1999), "Statistical quality control in clinical trials," Dissertation, WUV-Universitatsverlag, Vienna, Austria.

Troendle, JF ( 2000) "Stepwise normal theory multiple test procedures controlling the false discovery rate," Journal of Statistical Planning and Inference, 84 (1-2) 139-158.

Tufte, E., R., (1983), The visual display of quantitative information, Graphics Press, Cheshire, Connecticut.

Westfall, P. H., and Young, S. S. (1992), Resampling-based multiple testing, John Wiley & Sons, Inc., New York, NY.

Westfall, P. H., Tobias, R. D., Rom, D., Wolfinger, R. D., Hochberg, Y. (1999), Multiple Comparisons and Multiple tests using the SAS Systems, SAS Institute, Cary, North Carolina.

Woodall, W., H., and Montgomery, D., C., (1999), "Research Issues and Ideas in Statistical Process Control," Journal of Quality Technology, 31, 4, 376-386.

Yekutieli, D., and Benjamini (1999), "Resampling-based false discovery rate controlling multiple test procedures for correlated test statistics," Journal of Statistical Planning and Inference, 82 (1-2) 171-196.