We use regression analysis to explore the relationships between one or more input variables, or factors, and a response. A manufacturer might use it to look at how baking time and temperature relate to the hardness of a piece of plastic. Social scientists might use it to see how educational levels and birthplace relate to annual income. In theory, the number of factors you could include in a regression model is limited only by your imagination.
But before we throw data about every potential predictor under the sun into a regression model, we need to remember a thing called multicollinearity. In regression, as in so many things, there comes a point where more is not better. Sometimes adding more factors to a regression model not only fails to make relationships clearer, it actually makes them harder to understand.
Why Should You Care about Multicollinearity?
Multicollinearity refers to predictors that are correlated with other predictors or a subset of the other predictors. It occurs when your model includes several factors that are correlated not just to your response variable, but also to each other. In other words, when you have factors that are redundant.
Imagine a football game: If one player tackles the quarterback, it’s easy to give credit for the sack where credit’s due. But if three players tackle the quarterback simultaneously, it’s much more difficult to determine who was most responsible.
Or imagine seeing a rock and roll band with two great guitar players. You want to see which one plays best. But on stage, they’re both playing loud and fast. What they’re doing is so similar it’s difficult to tell one from the other. So how can you tell which guitarist has the biggest effect on the sound?
That’s the problem with multicollinearity.
Statistically speaking, multicollinearity increases the standard errors of the coefficients. By overinflating the standard errors, multicollinearity may make some variables statistically insignificant when they should be significant. Without multicollinearity, those factors might be significant.
Warning Signs of Multicollinearity
A little bit of multicollinearity isn’t necessarily a problem: to extend the rock band analogy, if one guitarist is louder than the other, you can tell them apart. But severe multicollinearity is a major problem, because it makes it more difficult to interpret your regression coefficients.
Do you need to be concerned about multicollinearity? Watch for these symptoms:
- A regression coefficient is not significant even though, theoretically, that variable should be highly correlated with the response.
- Adding or deleting a variable changes the regression coefficients dramatically.
- A regression coefficient is negative, but your response should increase along with that factor.
- A regression coefficient is positive when the response should decrease along with that factor.
- Your X variables have high pairwise correlations.
We can use the variance inflation factor (VIF) to assess how much the variance of an estimated regression coefficient increases if predictors are correlated. If the VIF is equal to one there is no multicollinearity among factors, but if the VIF is greater than one, the predictors may be moderately correlated.
VIF | Predictors are... |
VIF = 1 | Not correlated |
1 < VIF < 5 | Moderately correlated |
VIF > 5 to 10 | Highly correlated |
VIF values greater than 10 may indicate multicollinearity is unduly influencing your regression results. In this case, you may want to reduce multicollinearity by removing unimportant predictors from your model.
Statistical software packages may not display the VIF by default, but it should be available as an option when you perform a regression analysis. Typically, the VIF will be displayed along with other important statistics in the table of coefficients.
The output in Figure 1 is from an analysis of the relationship between researchers’ salary, the number of publications they have, and their length of employment:
The VIF for the Publication and Years factors are about 1.5, indicating some correlation but not enough to be overly concerned about. A VIF between five and 10 indicates high correlation that may be problematic. If the VIF goes above 10, you can assume that the coefficients are poorly estimated due to multicollinearity.
Dealing With Multicollinearity
If multicollinearity is a problem—if the VIF for a factor is near or above 5— the solution may be relatively simple. Try one of these:
Remove highly correlated predictors from the model. If two or more factors have a high VIF, remove one from the model. Because they supply redundant information, removing one of these factors usually won’t drastically reduce the R-squared. Consider using stepwise regression, best subsets regression, or specialized knowledge of the data set to remove these variables. Select the model that has the highest R-squared value.
Use Partial Least Squares Regression (PLS) or Principal Components Analysis. These methods cut the number of predictors to a smaller set of uncorrelated components.