JWBKFMJWBKMaronnaFebruary 16, Char Count= 0Robust StatisticsRobust Statistics: Theory and M. Robust Statistics, Theory and Methods,Wiley, NY Huber, P. J. and Ronchetti, E. M. (). Robust. Statistics, Second Edition. Wiley, New. Robust Statistics. Theory and Methods. Ricardo A. Maronna. Universidad Nacional de La Plata, Argentina. R. Douglas Martin. University of Washington, Seattle.

Author: | LATONYA GERSTLE |

Language: | English, Spanish, French |

Country: | Latvia |

Genre: | Business & Career |

Pages: | 190 |

Published (Last): | 27.12.2015 |

ISBN: | 776-2-79855-208-3 |

ePub File Size: | 17.42 MB |

PDF File Size: | 15.16 MB |

Distribution: | Free* [*Regsitration Required] |

Downloads: | 28195 |

Uploaded by: | MALCOM |

Robust Statistics: Theory and Methods "This book belongs on the desk of every statistician working in robust statistics, and the authors are to. Robust Statistics sets out to explain the use of robust methods and their theoretical justification. It provides an up-to-date overview of the theory and practical. Request PDF on ResearchGate | Robust Statistics: Theory and Methods | Time series outliers and their impactClassical estimates for AR modelsClassical.

The first gives the impression of short-tailed residuals, while the residuals from the robust fit indicate the existence of least two outliers. Define W as in 2. A similar representation can be found for L-estimates. As in the univariate case, the main reason for assuming normality is simplicity. It is seen that the bias of each estimate is worse than that of the LS estimate up to a certain value of K and then drops to zero. Figure 1. See for example Efron and Tibshirani and Davison and Hinkley

He was a consultant at Bell Laboratories for many years, and author of numerous research articles on robust methods for time series. He is the author of a large number of important research articles on robust statistics, in particular on regression and time series.

Please check your email for instructions on resetting your password. If you do not receive an email within 10 minutes, your email address may not be registered, and you may need to create a new Wiley Online Library account. If the address matches an existing account you will receive an email with instructions to retrieve your username.

Skip to Main Content. Robust Statistics: Theory and Methods Author s: Ricardo A. Maronna R. First published: Print ISBN: Book Series: Wiley Series in Probability and Statistics. The concept of continuity requires the definition of a measure of distance d F, G between distributions. A general definition of BP can be given in this framework. This means that in a neighborhood of z0 , h can be approximated by a linear function. The converse is not true: In some cases, the IF may be viewed as a derivative in the stronger sense of 3.

In fact, the i. For further work in this area, see Fernholz and Clarke A rigorous proof may be found in Huber The same approach serves to prove 3.

A similar approach proves 3. The details are left to the reader. We shall now prove the opposite inequality. This concludes the proof. Huber calculated the finite BP for this situation. For the sake of simplicity we treat the asymptotic case with point mass contamination. It will be assumed that g is increasing.

This complete the proof. Then differentiating 3. And its integral equals one, since by 3. Condition 3. Theorem 3. Proof of Theorem 3. Let now satisfy 3. Call 0c the ML score function centered by r: It follows from 3. Recalling 3. The dual problem is treated likewise. The former theorem proves optimality for a certain class of bounds.

The following theorem shows that actually any feasible bounds can be considered. Then the solutions to where 0 and both the direct and the dual Hampel problems have the form 3. We treat the dual problem; the direct one is treated likewise. The proof is involved and can be found in Hampel et al. In view of 3. Verify 3. Prove 3. M-estimates for regression are developed in the same way as for location. In this chapter we deal with fixed nonrandom predictors.

Recall that our estimates of choice for location were redescending M-estimates using the median as starting point and the MAD as dispersion. Redescending estimates will also be our choice for regression. When the predictors are fixed and fulfill certain conditions that are satisfied in particular for analysis of variance models, monotone M-estimates—which are easy to compute—are robust, and can be used as starting points to compute a redescending estimate.

This problem is treated in the next chapter. We start with an example that shows the weakness of the least-squares estimate. Example 4. Times were recorded for a rat to go through a shuttlebox in successive attempts. If the time exceeded 5 seconds, the rat received an electric shock for the duration of the next attempt. The data are the number of shocks received and the average time for all attempts between shocks.

Figure 4. The relationship between the variables is seen to be roughly linear except for the three upper left points. The LS line does not fit the bulk of the data, being a compromise Robust Statistics: The figure also shows the LS fit computed without using the three points. It gives a better representation of the majority of the data, while pointing out the exceptional character of points 1, 2 and 4. Code shock is used for this data set. We aim at developing procedures that give a good fit to the bulk of the data without being perturbed by a small proportion of outliers, and that do not require deciding 12 4 2 1 8 4 6 LS - 2 average time 10 LS 0 5 10 number of shocks Figure 4.

Slope Table 4. Since the median satisfies 2. When the data are observational the xi j are random variables. We sometimes have mixed situations with both fixed and random predictors. Then the linear model 4. Thus minimizing 4. In a designed experiment, the predictors xi j are fixed. An important special case of fixed predictors is when they represent categorical predictors with values of either 0 or 1.

Call 1m the column vector of m ones. The next level of model complexity is a factorial design with two factors which are represented by two categorical variables, usually called a two-way analysis of variance. The main reason for its immediate and lasting success was that it was the only method of estimation that could be effectively computed before the advent of electronic computers.

We shall review the main properties of LS for multiple regression. See any standard text on regression analysis, e. If the model contains a constant term, it follows from 4. The matrix of predictors X is said to have full rank if its columns are linearly independent. If X has full rank then the solution of 4.

If X is not of full rank, e. These are desirable properties, since they allow us to know how the estimate changes under these transformations of the data. A more precise justification is given in Section 4. Under the linear model 4. Under model 4. However, if the model contains an intercept, the bias will only affect the intercept and not the slopes.

More precisely, under 4. For the large-sample theory of the LS estimate see Stapleton and Huber , p. After they are identified, some decision must be taken such as modifying or deleting them and applying LS to the modified data.

They include the familiar Q—Q plots of residuals and plots of residuals vs. See Weisberg , Belsley, Kuh and Welsch or Chatterjee and Hadi for further details on these methods, as well as for proofs of the statements in this section. In this case h i is a measure of how far xi is from the average value x. Calculating h i does not always require the explicit computation of H. For example, in the case of the two-way design 4.

The reason is that ri, h i and s may be largely influenced by the outlier. A graphical analysis is provided by the normal Q—Q plot of t i. Fitting 4. To show the effect of outliers on the classical procedure, we have modified five data values.

Again, nothing suspicious appears.

But the p-values of the F-tests are now 0. The diagnostics have thus failed to point out a departure from the model, with serious consequences. All these procedures are fast, and are much better than naively fitting LS without further care.

But they are inferior to robust methods in several senses: Assume model 4. For the linear model 4. We shall deal with estimates defined by 4. It is remarkable that this estimate was studied before LS by Boscovich in and Laplace in If the model contains an intercept term 4.

Unlike LS there are in general no explicit expressions for an L1 estimate. However, there exist very fast algorithms to compute it Barrodale and Roberts, ; Portnoy and Koenker, Differentiating 4. The last equation need not be the estimating equation of a MLE. The matrix X will be assumed to have full rank. Solutions to 4. The main advantage of monotone estimates is that all solutions of 4. The example in Section 2. This cannot happen with monotone estimates.

On the other hand, we have seen in Section 3. Computing redescending estimates requires a starting point, and this will be the main role of monotone estimates. This matter is pursued further in Section 4. Then if 4. Thus the approximate covariance matrix of an M- estimate differs only by a constant factor from that of the LS estimate. It is important to note that if we have a model with intercept 4.

Here the equivalent procedure is first to compute the L1 fit and from it obtain the analog of the normalized MAD by taking the median of the nonnull absolute residuals: Recall that the L1 estimate does not require estimating a scale.

We then obtain a regression M-estimate by solving 4. Then 4. Under 4. Thus the efficiency of the estimate does not depend on X. If the model contains an intercept the approximate distribution result holds for the slopes without any requirement on u i.

More precisely, assume model 4. We can estimate v in 4. As we have seen in the location case, one important advantage of redescending estimates is that they give null weight to large residuals, which implies the possibility of a high efficiency for both normal and heavy-tailed data. This is valid also for regression since the efficiency depends only on v which is the same as for location.

The corresponding fitted lines are shown in Figure 4. The results are very similar to the LS estimate computed without the three atypical points. The estimated standard deviations of the slope are 0. It is seen that the outliers inflate the confidence interval based on the LS estimate relative to that based on the bisquare M-estimate.

We see that the M-estimate results for the altered data are quite close to those for the original data. Furthermore, for the altered data the robust test again gives strong evidence of row and column effects. Note that differentiating 4. Therefore this class of estimates includes the MLE. There are, however, some cases in which this estimate can be computed explicitly. These equations suggest an iterative procedure due to J.

Repeat steps 2—3 until no more changes take place. It can be shown Problem 4. The result frequently coincides with an L1 estimate, and is otherwise generally close to it. Sposito gives conditions under which the median polish coincides with the L1 estimate. Define W as in 2. Assume X is of full rank so that the estimates are well defined.

Since X is fixed only y can be changed, and this requires a modification of the definition of the breakdown point BP. In the case of a model with intercept 4. It is shown in Section 4.

For the one-way design 4. It is natural to conjecture that the FBP of monotone M-estimates attains the maximum 4. This may happen even when there are no leverage points. The situation is worse for fitting a polynomial Problem 4. It is even worse when there are leverage points. The intuitive reason for this fact is that here the estimate is determined almost solely by y As a consequence, monotone M-estimates can be recommended as initial estimates for zero—one designs, and perhaps also for uniform designs, but not for designs where X has leverage points.

The case of random X will be treated in the next chapter. The techniques discussed there will also be applicable to fixed designs with leverage points. We consider testing the linear hypothesis H0: We can also write the test statistic T 4. But this test would not be robust, since outliers in the observations yi would result in corresponding residual outliers and hence an overdue influence on the test statistic.

This makes LRTTs preferable. The influence of high leverage points on inference is discussed further in Section 5. Regression quantiles are especially useful with heteroskedastic data. Assume the usual situation when the model contains a constant term. In this section we want to explain why equivariance is a desirable property for a regression estimate. Let y verify the model 4. Scale equivariance 4. It must be noted that although equivariance is desirable, it must sometimes be sacrificed for other properties such as a lower prediction error.

In particular, the estimates resulting from a procedure for variable selection treated in Section 5. The same thing happens in general with procedures for dealing with a large number of variables like ridge regression or leastangle regression Hastie, Tibshirani and Friedman, Hence the equivariance 4.

The proof of 4. Write computer code for the median polish algorithm and apply it to the original and modified oats data of Example 4. Show that for large n the FBP given by 4. The columns classify the data in seven occupational groups: Compare the effect of the estimations. These data have also been analyzed by Daniel In that case a monotone M-estimate is a reliable starting point for computing a robust scale estimate and a redescending M-estimate. But when X is random, outliers in X operate as leverage points, and may completely distort the value of a monotone M-estimate when some pairs xi , yi are atypical.

This chapter will deal with the case of random predictors and one of its main issues is how to obtain good initial values for redescending M-estimates. The following example shows the failure of a monotone M-estimate when X is random and there is a single atypical observation.

Example 5. The data are given in Table 5. Figure 5. Observation 15 stands out as clearly atypical. The LS fit is seen to be influenced more by this observation than by the rest. But the L1 fit shows the same drawback! By contrast, the LS fit omitting observation 15 gives a good fit to the rest of the data. Figures 5. Neither figure reveals the existence of an outlier as indicated by an exceptionally large residual. However, the second figure shows an approximate linear relationship between residuals and fitted values—excepting the point with largest fitted value—and this indicates that the fit is not correct.

Zn Obs. Cu Zn Obs. Q—Q plot of LS residuals 0 —1 —2 3 —3 residuals 1 2 3 15 20 40 60 fitted values Figure 5. LS residuals vs. The intuitive reason for the failure of the L1 estimate and of monotone M-estimates in general in this situation is that the xi outlier dominates the solution to 4.

This does not happen with the redescending M-estimate. We now briefly discuss the properties of a linear model with random X. Our observations are now the i. See Section In the case 4. Consequently the estimating equation 5. In Section 5. The results, displayed in Figure 5. The MM-estimate intercept and slope parameters are now The former now lacks the suspicious structure of Figure 5.

And compared to Figure 5. It is seen that most points lie below the identity diagonal, showing that except for the outlier the sorted absolute MM-residuals are smaller than those from the LS estimate, and hence the MM-estimate fits the data better. We now discuss the breakdown point, influence function and asymptotic normality of such an estimate. LS residuals point 15 omitted 5. Then instead of 4.

Note that since not only y but also X are variable here, the FBP given by 5. Then the scale used in Section 4. On the other hand, it can be shown that the maximum FBP of any regression equivariant estimate is again the one given in Section 4.

In the present setting, calculating the MLE again yields 4. Thus no new class of estimates emerges from the ML approach. The proof is similar to that of Section 3.

It follows that the IF is unbounded. On the other hand, in Section 5. These facts indicate that the IF need not yield a reliable approximation to the bias.

We have seen in the previous chapter that a leverage point forces the fit of a monotone M-estimate to pass near the point, and this has a double-edged effect: The implications of these facts for the case of random x are as follows. Suppose that x is heavy tailed so that its variances do not exist. If the model 5. This local minimum will be obtained by starting from a reliable starting point and applying the IRWLS algorithm of Section 4.

The L1 estimate does not require a scale, but we have already seen that it is not a convenient estimate when X is random. Hence we need an initial estimate that is robust toward any kind of outliers and that does not require a previously computed scale. Such a class of estimates will be defined in Section 5. Now we state the details of the above steps. As was seen at the end of Section 2.

For the bisquare scale given by 2. The key result is given by Yohai , who called these estimates MM-estimates. It can also be shown in the same way as the similar result for location in Section 3. Thus it is not necessary to find the absolute minimum of 5. The numerical computation of the estimate follows the approach in Section 4.

It is shown in Section 9. The values of c1 for prescribed efficiencies are the values of k in the table 2. It is therefore important to choose the efficiency so as to maintain reasonable bias control. The results in Section 5. No outliers are apparent. The plot shows that the MM-residuals are in general smaller than the LS residuals, and hence MM gives a better fit to the bulk of the data.

The LS and the L1 estimates minimize the averages of the squared and of the absolute residuals respectively, and therefore they minimize measures of residual largeness that can be seriously influenced by even a single residual outlier. A more robust alternative is to minimize a scale measure of residuals that is insensitive to large values, and one such possibility is the median of the absolute residuals.

MM-residuals vs. In the location case the LMS estimate is equivalent to the Shorth estimate defined as the midpoint of the shortest half of the data see Problem 2. For fitting a linear model the LMS estimate has the intuitive property of generating a strip of minimum width that contains half of the observations Problem 5.

The resulting estimate 5. We now consider the BP of S-estimates. Proofs of all results on the BP are given in Section 5. See Davies and Kim and Pollard for a rigorous proof. Thus the LMS estimate is very inefficient for large n.

Numerical computation yields that the normal distribution efficiency of the S-estimate based on the bisquare scale is 0. See, however, the comments on page A precise statement is given in Problem 5.

Numerical computation of S-estimates is discussed in Section 5. The asymptotic behavior of the LTS estimate is more complicated than that of smooth S-estimates. As we have already discussed, one can attain a desired normal efficiency by using an S-estimate as the starting point for an iterative procedure leading to an MM-estimate. A better approach for increasing the efficiency is described in Section 5. We now discuss an adaptive one-step estimation method due to Gervini and Yohai that attains full asymptotic efficiency at the normal error distribution and at the same time has a high BP and small maximum bias.

We now justify this procedure. The intuitive idea is to consider as potential outliers only those observations whose t i are not only greater than a given value, but also sufficiently larger than the corresponding order statistic of a sample from G. Note that if the data contain one or more outliers, then in a normal Q—Q plot of the t i against the respective quantiles of G, some large t i will appear well above the identity line, and we would delete it and all larger ones. For example, consider fitting a straight line through the origin, i.

N 0, 1 , and three outliers at 10, The global minima of the loss functions for these four estimates are attained at the values of 0. The loss functions for the LMS and LTS estimates are not differentiable, and hence gradient methods cannot be applied to them.

Section 5. Since this method is computer intensive, care is needed to reduce the computing time and strategies for this goal are presented in Section 5. The LTS estimate A local minimum of 5. It is proved in Section 9. If a subsample is collinear, it is replaced by another. Therefore N must grow exponentially with p. Table 5. A shortcut saves much computing time. The reason is as follows. Although the N given by 5. Furthermore, because of the randomness of the subsampling procedure the resulting estimate is stochastic, i.

In our experience a carefully designed algorithm usually gives good results, and the above infrequent but unpleasant effects can be mitigated by increasing N as much as the available computing power will allow. The subsampling procedure may be used to compute an approximate LMS estimate. Since total lack of smoothness precludes any kind of iterative improvement, the result is usually followed by one-step reweighting Section 5.

It must be recalled, however, that the resulting estimate is not asymptotically normal, and hence it is not possible to use it as a basis for approximate inference on the parameters. Since the resulting estimate is a weighted LS estimate, it would be intuitively attractive to apply classical LS inference as if these weights were constant, but this procedure is not valid.

Consider the following two extreme strategies for combining the subsampling and the iterative parts of the minimization: Clearly strategy B would yield a better approximation of the absolute minimum than A, but is also much more expensive.

An intermediate strategy, which depends on two parameters K iter and K keep , consists of the following steps: A theoretical study of the properties of this procedure seems impossible, but their simulations show that it is not worthwhile to increase K iter and K keep beyond the values 1 and 10, respectively. Ruppert proposes a more complex random search method. Recall that according to 5. If observation i has high leverage i.

The following example shows how different the inference can be when using an MM-estimate compared to using the LS estimate. For the straight-line regression of the mineral data in Example 5. This occurs especially when the error distribution is very heavy tailed or asymmetric see the end of Section See for example Efron and Tibshirani and Davison and Hinkley While the bootstrap approach has proved successful in many situations, its application to robust estimates presents special problems.

One is that in principle the estimate should be recomputed for each bootstrap sample, which may require impractical computing times.

Another is that the proportion of outliers in some of the bootstrap samples might be much higher than in the original one, leading to quite incorrect values of the recomputed estimate. Salibian-Barrera and Zamar proposed a method which avoids both pitfalls, and consequently is faster and more robust than the naive application of the bootstrap approach. We shall use an approach based on prediction.

Consider an observation x, y from the model 5. Now consider a model with intercept, i. A frequently used benchmark for comparing estimates is to assume that the joint distribution of x, y belongs to a contamination neighborhood of a multivariate normal.

In this case it can be shown that the maximum biases of M-estimates do not depend on p. A proof is outlined in Section 5. The same is true of the other estimates treated in this chapter except GM-estimates in Section 5.

One MM-estimate is given by the global minimum of 5. The other two MM-estimates correspond to local minima of 5. This shows the importance of a good starting point. It is also curious that the two local MM-estimates with different efficiencies have the same maximum biases. To understand this phenomenon, we show in Figure 5. It is seen that the bias of each estimate is worse than that of the LS estimate up to a certain value of K and then drops to zero.

But the range of values where the MM-estimate with efficiency 0. This is the price paid for a higher normal efficiency. The MM-estimate with efficiency 0. If one S-E 1. For these reasons we recommend an MM-estimate with bisquare function and efficiency 0.

One could also compute such MM-estimates with several different efficiencies, say between 0. Yohai et al. The former results on MM-estimates suggest a general approach for the choice of robust estimates. An instance of this approach in multivariate analysis will be presented in Section 6. This approach cannot be taken with regression M-estimates with random predictors since 5.

However, it can be suitably modified, as we now show. Therefore the biases of these estimates are continuous at zero, which means that a small amount of contamination produces only a small change in the estimate. Because of this the approach in Section 3. The latter is a region in which it is most difficult to tell whether a data point is an outlier or not, while outside that transition region outliers are clearly identified and rejected, and inside the region data values are left essentially unaltered.

As a minor point, the reader should note that 5. We prove in Section 5. For example, in the location case if more than half the sample points are concentrated at x0 , then the median coincides with x0. The exact fit property implies that if a dataset is composed of two linear substructures, an estimate with a high BP will choose to fit one of them, and this will allow the other to be discovered through the analysis of the residuals.

A nonrobust estimate such as LS will instead try to make a compromise fit, with the undesirable result that the existence of two structures passes unnoticed. Heat Krafft pt. It is seen that there are two linear structures, and that the LS estimate fits neither of them, while MM fits the majority of the observation.

The points in the smaller group correspond to compounds called sulfonates code krafft. In order to bound the effect of influential points W must be such that W t t is bounded. The first is the estimate 5. The GM-estimate with the Schweppe function 5. See Section 5. Note that the function d x in 5. Hence the IF is the same as would be obtained from 3.

However, GM-estimates also have several drawbacks: It will be seen in Chapter 6 that computing robust affine equivariant multivariate estimates presents the same computational difficulties we have seen for redescending M-estimates, and hence combining robustness and equivariance entails losing computational simplicity, which is one important feature of GM-estimates.

For these reasons GM-estimates, although much treated in the literature, are not a good choice except perhaps for small p. However, their computational simplicity is attractive, and they are much used in power systems. If the number of predictor variables is large and the number of observations relatively small, fitting the model using all the predictors will yield poorly estimated coefficients, especially when predictors are highly correlated.

More precisely, the variances of the estimated coefficients will be high and therefore the forecasts made with the estimated model will have a large variance too. A common practice to overcome this difficulty is to fit a model using only a subset of variables selected according to some statistical criterion.

Consider evaluating a model using the mean squared error MSE of the forecast. This MSE is composed of the variance plus the squared bias. Deleting some predictors may cause an increase in the bias and a reduction of the variance. Hence the problem of finding the best subset of predictors can be viewed as that of finding the best tradeoff between bias and variance. There is a very large literature on the subset selection problem, when the LS estimate is used as an estimation procedure.

See for example Miller , Seber and Hastie et al. The predictors are assumed to be random but the case of fixed predictors is treated in a similar manner. The expectation on the right hand side of 5. To robustify FPE we must note that not only must the regression estimate be robust, but also the value of the criterion should not be sensitive to a few residuals. In addition we shall bound the influence of large residuals by replacing the square in 5.

The criterion 5. Two problems arise: A simple but frequently effective suboptimal strategy is stepwise regression: Various simulation studies indicate that the backward procedure is better. The second problem above arises because robust estimates are computationally intensive, the more so when there are a large number of predictors. These assumptions do not always hold in practice. Actually the assumptions of independent and homoskesdastic errors are not necessary for the consistency and asymptotic normality of M-estimates.

In fact, it can be shown that these properties hold under much weaker conditions. Nevertheless we can mention two problems: We deal with these problems in the next two subsections.

For instance, we can replace model 5. Robust methods for heteroskedastic regression have been proposed by Carroll and Ruppert who used monotone M-estimates; by Giltinan, Carroll and Ruppert who employed GM-estimates, and by Bianco, Boente and Di Rienzo and Bianco and Boente who defined estimates with high BP and bounded influence starting with an initial MM-estimate followed by one Newton—Raphson step of a GM-estimate.

Croux, Dhaene and Hoorelbeke proposed a method to estimate the asymptotic covariance matrix of a regression M-estimate which requires neither homoskedasticity nor symmetry. This method can also be applied to simultaneous M-estimates of regression of scale which includes S-estimates and to MM-estimates. We shall give some details of the method for the case of MM-estimates.

Croux et al. They have a high BP and a controllable normal distribution efficiency, but unlike MM-estimates they do not require a preliminary scale estimate.

Its asymptotic efficiency at the normal distribution can be adjusted to be arbitrarily close to one, just as in the case of MM-estimates.

See Yohai and Zamar In fact the normal equations 4. But in general it is not possible to obtain equality in 5. Maronna and Yohai studied P-estimates with b given by 5. Maronna and Yohai show that if the xi are multivariate normal, then the maximum asymptotic bias of P-estimates does not depend on p, and is not larger than twice the minimax asymptotic bias for all regression equivariant estimates.

Numerical computation of P-estimates is difficult because of the nested optimization in 5. Mendes and Tyler also show that for a continuous distribution the solution asymptotically attains the bound 5. Since only the direction matters, the infimum in 5.

Like the P-estimates of Section 5. Adrover, Maronna and Yohai discuss the relationships between maximum depth and P-estimates. They derive the asymptotic bias of the former and compare it to that of the MP-estimates defined in Section 5.

Both biases turn out to be similar for moderate contamination in particular, the GESs are equal , while the MP-estimate is better for large contamination.

They define an approximate algorithm for computing the maximum depth estimate, based on an analogous idea already studied for the MP-estimate. In the model 5. See the number of subsamples required as a function of the number of parameters in Section 5. Besides, in an unbalanced structured design there is a high probability that a subsampling algorithm yields collinear samples.

For example, if there are five independent explanatory dummy variables that take the value 1 with probability 0. Let M X, y be a monotone M-estimate such as L1. This estimate is regression and affine equivariant. For example, if 5. An MS-estimate was used in the example of Section 1. Rousseeuw and Wagner , and Hubert and Rousseeuw , , have proposed other approaches to this problem. There are 11 predictors. The first three are categorical: The other eight are the concentrations of several chemical substances.

The response is the logarithm of the abundance of a certain class of algae. The first gives the impression of short-tailed residuals, while the residuals from the robust fit indicate the existence of least two outliers.

The last two are categorical variables with 22 and 2 parameters respectively, while the other predictors are continuous variables. The estimator used for that example was the MS-estimate. Figures 1. In this example three of the LS and MS-estimate t-statistics and p-values give opposite results using 0.

The opposite is the case for the Period2 level of the Period categorical variable. This shows that outliers can have a large influence on the classical test statistics of a LS fit. But the result can be shown to hold under more general assumptions. This result has been proved by Rousseeuw and Leroy and Mili and Coakley under slightly more restricted conditions.

The main result of this section is the following. Theorem 5. To prove the theorem we first need an auxiliary result. Proof of lemma: The first sum tends to zero. Now 4. The asymptotic BP A proof similar to but much simpler than that of Theorem 5. We shall first show that the asymptotic bias under point mass contamination of M-estimates and of estimates which minimize a robust scale does not depend on the dimension p.

Call x0 , y0 the contamination location. It is easy to show that g is an increasing function. Let F correspond to the model 5. Proceeding as in Section 3. But the present reasoning gives some justification for the use of 5. Note that y is the LS estimate of the regression coefficients under model 4. Show that S-estimates are regression, affine and scale equivariant.

The stack loss dataset Brownlee, , p. The predictors X 1 , X 2 , X 3 are respectively the air flow, the cooling water inlet temperature, and the acid concentration, and the response Y is the stack loss. Fit a linear model to these data using the LS estimate, and the MM-estimates with efficiencies 0. Fit the residuals vs. Is there a pattern?. The dataset alcohol Romanelli, Martino and Castro, gives for 44 aliphatic alcohols the logarithm of their solubility together with six physicochemical characteristics.

The interest is in predicting the solubility. Compare the results of using the LS and MM-estimates to fit the log-solubility as a function of the characteristics.

The dataset waste from Chatterjee and Hadi, contains for 40 regions the solid waste and five variables on land use. Draw the respective Q—Q plots of residuals, and the plots of residuals vs. Find its asymptotic breakdown point. In most cases of interest it is known or assumed that some form of relationship exists among the variables, and hence that considering each of them separately would entail a loss of information.

Some possible goals of the analysis are: The reader is referred to Seber and Johnson and Wichern for further details. It also implies that since the conditional expectation of one coordinate with respect to any group of coordinates is a linear function of the latter, the type of dependence among variables is linear.

Thus methods based on multivariate normality will yield information only about linear relationships among coordinates. As in the univariate case, the main reason for assuming normality is simplicity. It is known that under the normal distribution 6. The sample mean and sample covariance matrix share the behavior of the distribution mean and covariance matrix under affine transformations, namely 6. This property is known as the affine equivariance of the sample mean and covariances.

Worse still, a multivariate outlier need not be an outlier in any of the coordinates considered separately. Example 6. The data are plotted in Figure 6. We see in Figure 6. However, Figure 6. Thus the atypical character of observation 3 is visible only when considering both variables simultaneously. The table below shows that omitting this observation has no important effect on means or variances, but the correlation almost doubles in magnitude, i.

Complete data Without obs. This example shows the need for robust substitutes of the mean vector and covariance matrix, which will be the main theme of this chapter. To some extent such nonequivariant methods have a certain built-in robustness.

The reasons are given in Section 6. This is, however, not a mandatory property, and may in some cases be sacrificed for other properties such as computational speed; an instance of this trade-off is given in Section 6. As in the univariate case, one may consider the approach of outlier detection. Thus, assuming the estimates x and C are close to their true values, we may examine the Q—Q plot of Di vs. This approach may be effective when there is a single outlier, but as in the case of location it can be useless when n is small recall Section 1.

It contains, for each of 59 wines grown in the same region in Italy, the quantities of 13 constituents.

The original purpose of the analysis de Vel, Aeberhard and Coomans, was to classify wines from different cultivars by means of these measurements. In this example we treat cultivar 1. No clear outliers stand out. Mahalanobis distances vs. At least seven points stand out clearly. The failure of the classical analysis in the upper row of Figure 6. These seven outliers have a strong influence on the results of the analysis. Simple robust estimates of multivariate location can be obtained by applying a robust univariate location estimate to each coordinate, but this lacks affine equivariance.

Apart from not being equivariant, the resulting matrix may not be positive semidefinite. See, however, Section 6. Nonequivariant procedures may also lack robustness when the data are very collinear Section 6.

In subsequent sections we shall discuss a number of equivariant estimates that are robust analogs of the mean and covariance matrix. They will be generally called location vectors and dispersion matrices. The latter are also called robust covariance matrices in the literature. However, the dispersion matrix has a more complex parameter space consisting of the set of symmetric nonnegative definite matrices.

For theoretical purposes it may be simpler to work with the asymptotic BP. Applying Definition 3. Let estimation. This result will be seen to hold for the larger family of elliptical distributions, to be defined later. Most estimates defined in this chapter are also asymptotically normal: In Section 6. With one exception treated in Section 6. Recall that in the univariate case it was possible to define separate robust equivariant estimates of location and of dispersion.

This is more complicated to do in the multivariate case, and if we want equivariant estimates it is better to estimate location and dispersion simultaneously. We shall develop the multivariate analog of simultaneous M-estimates 2. We note that the level sets of f are ellipsoidal surfaces. If the mean resp. More details on elliptical distributions are given in Section 6. Let x1 ,. Note that by 6. This is similar to 2. Existence and uniqueness of solutions were treated by Maronna and more generally by Tatsuoka and Tyler Uniqueness of solutions of 6.

We shall call an M-estimate of location and dispersion monotone if d W2 d is nondecreasing, and redescending otherwise. Monotone M-estimates are defined as solutions to the estimating equations 6. Huber treats a slightly more general definition of monotone M-estimates. For practical purposes monotone estimates are essentially unique, in the sense that all solutions to the M-estimating equations are consistent estimates.

It is proved in Chapter 8 of Huber that if the xi are i. It is easy to show that M-estimates are affine equivariant Problem 2 and so if x has an elliptical distribution 6. It follows from 6. In of elements of H, it lies in H.

Furthermore 6. To make the estimate well defined in all cases, it suffices to extend the definition 6. Since b1 ,.

Note that di enters 6. Several important features of the distribution, such as correlations, principal components and linear discriminant functions, depend only on shape.