A Suggested Method of Detecting Multicollinearity in Multiple Regression Models

In literature, several methods suggested for the detection of multicollinearity in multiple regression models, and one of the multicollinearity problems solutions is to omit the explanatory variables in the model, which cause the multicollinearity. In this paper, we concentrated on the extra sum of squares method as a suggested method that can be used for detecting multicollinearity. The method of extra sum of squares is applied to real data on the annually surveys about smoking were conducted by the American Federal Trade Commission (FTC). In this data, we detected multicollinearity, then we solved this problem by using the ridge regression and we got the new estimates of the new model without omitting any of the explanatory variables.


Introduction
Data with multicollinearity frequently arise and cause problems in many applications of linear regression such as in econometrics, oceanography, geophysics and other fields that rely on no experimental data. Multicollinearity is a natural flaw in the data set due to the uncontrollable operations of the data generating mechanism. In multiple linear regressions two or more of independent variables used in the model, the multicollinearity word has been used to represent a near exact relationship between two or more variables. Thomas P. Rayan (2009). In estimating the parameters in the regression model, it is often stated that multicollinearity can cause the signs of the parameter estimator to be wrong. The presence of multicollinearity will also mislead with the significance test telling us that some important variables are not needed in the model; multicollinearity causes a reduction of statistical power in the ability of statistical tests. Neter (1989) said that in the process of fitting regression model, when one independent variable is nearly combination of other independent variables, the combination would affect parameter estimates. Multicollinearity is the extreme problem for regression models, because it violates the assumptions of the model that is the explanatory variables should be independent. Belsley (1980) stated that, in case of existing of multicolliearity, it becomes difficult to infer the separate influence of such explanatory variables on the response variable. Weismann & Helge& Shalabh (2007) said that various diagnostic tools such as condition number, singular value decomposition method, Belsley condition indices, variance decomposition method, variance inflation factors, and Belsley's perturbation analysis etc., have been suggested in the literature for the detection of multicollinearity and identification of variables causing the linear relationships. Therefore, detecting multicollinearity is very important in regression analysis. The paper is organized as follows. Section 2 recalls the technical background of multicollinearity. Section 3 the extra sum of squares method Section 4 data analysis. Section 5 concludes.

Multicollinearity
Multicollinearity is defined as the existence of nearly linear dependency among the independent variables. The presence of serious multicollinearity would reduce the accuracy of the parameters estimate in a linear regression model and affect the independency of the independent variables of the regression model. Multicollinearity can cause serious problem in estimation and prediction, increasing the variance of least squares of the regression coefficients and tending to produce least squares estimates that are too large in absolute value. Theoretically, we have two types of multicollinearity; these types are partial multicollinearity and perfect multicollinearity or full multicollinearity. In addition, multicollinearity has two cases: scalar case and matrix case.

Scalar case
Any for simplicity. The regression would need to find the coefficient estimates produce the best Ŷ . Then , we can substitute 2 X by 1 dX as: 2) The above result is true also for infinite number of coefficient pairs; also these pairs produce the same value of Ŷ .
Any small change in 1 b from one possible value to another ) where n: is the numbers of predictors. Therefore j S is undefined, then there is no linear regression.

Matrix case
In a matrix case the explanatory variable X can be take the following form: The coefficient b can be calculated as: is equal zero, therefore the inverse will not existing.

Detection of multicollinearity
Multicollinearity can be detected by examining one of two qualities: Variance Inflation Factor "VIF" and Tolerance.
We can detect the multicollinearity by examining a quality called Variance Inflation Factor (VIF). , and the tolerance in multicollinearity should be small. Also, we can detect multicollinearity by using extra sum of squares method.

Extra Sum of squares method for detecting multicollinearity
We can define the extra sums of squares as the marginal increase in regression sum of squares when one or more independent variables are added to a regression model. In general, we use extra sums of squares to determine whether specific variables are making substantial contributions to our model. Extra sums of squares provides a means of formally testing whether one set of predictors is necessary given that another set is already in the model. In regression analysis, we can use hypothesis test to check the significance of the fitted model. Analysis of variance gives the information on regression sum of squares (SSR), residual sum of squares (SSE), total sum of squares (SST) and the F value for the hypothesis test. Regression sum of squares account for the variation in y that is explained by the variation of x i . In regression analysis, the regression sum of squares will always increase while the residual sum of squares will decrease when a new independent variable is added to the model, because the total sum of squares unchanged. Decomposition of SSR into extra sum of squares In many applications such as stepwise regression, additional sums of squares are needed to measure the variation of y on some independent variables when a certain set of independent variables are already in the model. Here, we use SSR(x i | x j , x k ) to represent the additional sum of squares which account for the variation in y when x i is added in the model that already contains independent variables x i and x k . The SSR (x i | x j , x k ) can be calculated as: If we have a linear regression model is constructed based on p independent variables x 1 .x 2  Here, the independent variables are (p -1), equation (3.5) represents the additional sum of squares accounted for the variation in y when x i is added in a model that already contains p-1 independent variables.

Uses of extra sum of squares
One of the major uses of extra sum of squares is for conducting tests concerning regression coefficients without fitting both of the full and reduced models separately. For example if we have MLR model with two independent variables and we want to test whether or not β 2 = 0, here actually we don't need to fit the reduced model since the partial F test statistic can be calculated immediately from the relation below: Also, we can use the extra sum of squares to measure the coefficient of partial determination between y and any independent variable in the MLR model, for example if we have a model with two independent variables, we have the following: SSE(X 2 ) measures the variation in Y when X2 is included in the model, SSE(X 1 ,X 2 ) measures the variation in Y when X 1 and X 2 are included in the model. The relative marginal reduction in the variation in Y associated with X 1 when X 2 is already in the model is:

Data analysis
In this section, we will conduct an application using real data and try to detect multicollinearity by using sum of squares method. The data is from the American Federal Trade Commission (FTC), which annually ranks varieties domestic cigarettes according to their tar, nicotine, and carbon monoxide contents. The U.S. surgeon general considers each of these three substances hazardous to a smoker's health. Past studies have shown that increases in the tar and nicotine contents of a cigarette are accompanied by an increase in the carbon monoxide emitted from the cigarette smoke. Table  (4.1) lists tar, nicotine, and carbon monoxide contents (in milligrams) and weight (in grams) for a sample of 25 (filter) brands tested in recent years. Here, we need to model carbon monoxide content, y, as a function of tar content, x 1 , nicotine, x 2 , and weight, x 3 , using the linear model We used SPSS (Statistical Package for Social Sciences) program to analyze the data as the following:

Table (4.2) Correlations
From table (4.2), we can conclude that all the correlation coefficients are significant at (0.05) significance level. The regression equation appears to be very useful for making predictions since the value of R 2 is close to 1, but may be this indicator indicates that multicollinearity problem is exist.

Table (4.4) ANOVA table (b)
Predictors: (Constant), Tar x1 milligrams, Weight x3 grams, Nicotine x2 milligrams b Dependent variable: carbon monoxide y, milligrams Since p-value < 0.01, so at the α = 0.05 level of significance, there exists enough evidence to conclude that at least one of the predictors is useful for predicting the Carbon Monoxide; therefore the model is useful.  , the model will be as: Carbon monoxide(y) = 3.2 -0.128weight(x 1 ) -2.632nicotine(x 2 ) +0.963tar(x 3 ) and we can conclude that the slope of the Tar variable is not zero since p-value < 0.001and, hence, that Tar is useful (with Nicotine and Weight) as a predictor of Carbon Monoxide. But the both slopes of Weight and the Nicotine are equal to zero. This means that there something is not normal in our analysis, so we should test for the multicollinearity. From table (4.6), since the two predictors, Nicotine and Tar have a variance inflation factor (VIF) greater than ten, there are apparent multicollinearity problems in the model.

Fig(4.1) Relation between tar and nicotine
Also, from figure (4.1), we can conclude that there is aproximately quadratic relation between tar and nicotine, which indicates to multicollinearity problem.  (r = 0.500). In fact, all three-sample correlations are significantly different from zero based on the small p-values as shown in table (4.7). Now, we will rerun the analysis of the model systematically to get at extra sum of squares SSR(X 2 /X 1 ) and SSR(X 1 /X 2 , X3) by depending on SSE.
We run the analysis with one independent variable. Therefore, by the extra sum of squares, we can conclude from the table (4.11) that there is a severe multicollinearity problem in the model, and we cannot omit any independent variable from the model because logically there is very strong relation between the three independent variables in the model.
So the question, which appears here, is how can we solve this problem? One of the remedial methods of multicollinearity is a ridge regression, which first introduced by Hoerl and Kennard (1970), it is one of the most popular methods that have been suggested for the multicollinearity problem. This regression enables us to inference on values of predictor variables that follow the same pattern of multicollinearity and this aspect is very important.

Fig (4.2) coefficients plot
We see from figure (4.2), that the modified value was approximately 0.07. Therefore, by using ridge regression, we got the best model for the data without omitting any of the explanatory variables.

Conclusion
In this paper, we concentrated on the extra sum of squares method as a suggested method that can be used for detecting multicollinearity. The method of extra sum of squares is applied to real data on the annually surveys about smoking were conducted by the American Federal Trade Commission (FTC). In this data, we detected multicollinearity, then we solved this problem by using the ridge regression and we got the new estimates of the new model without omitting any of the explanatory variables.