Everyone knows that correlation does not imply causation, yet everyone who does regression analysis thinks about the “independent” predictor variables in causal ways. This series of posts explains how an under-acknowledged property of simple linear correlation coefficients has implications for causal construals of regression equations of any type.
Let me start with a story. As a young researcher in an agricultural research project I sought data on crop responses to irrigation water and fertilizer levels. I found one data set that combined crop yield and both variables and derived the best linear regression equation to predict crop yield from the other two variables, that is, crop yield = a times irrigation water used plus b times fertilizer level. The value of b was negative and my boss said “That can’t be right; we know that fertilizer increases crop yield.” His approach (which he used when deciding which equations to insert into a large model of the agricultural sector of the whole economy) was to reject the equation if it didn’t make causal sense to him. My approach was simply to look for the best predictor based on the data available. But even then I had a causal interpretation of a kind in pointing out that the data fitted by the regression included points in which the water used was so high that plants could not make use of the fertilizer; a non-linear regression might have captured that causal relationship. (The debate became moot when, for lack of published data, we decided to consult with the agricultural extension officers in the region about expected crop yields under various combination of irrigation water, fertilizer level, soil type, etc.)
Contra my boss’s thinking, the models fitted to data in statistical analysis are not causal. Data are given and the models re-describe those data in some economical way. That is, we use them to extract a pattern and discount (as “error” or “noise”) what doesn’t fit the pattern. Now, that pattern may be one in which every increase in variable A of 1 unit is accompanied by an increase in variable B of 2 units, but this does not mean that we can actually manipulate variable A and get the associated change in variable B, or vice versa—the equation would be causal if that were the case. (Later, we’ll address the possibility that there is a variable C—or variables C, D, etc.—that we can manipulate and that lead to changes in variables A and B of the ratio 1:2.)
At the risk of belaboring this introductory point, consider a variable A that is 1 if the person measured is an adult male in the USA and 0 if the person is an adult female. If variable B is these person’s heights, there is an obvious pattern: an increase in variable A is associated with a 14 cm ( 5.5″) increase in variable B. And an increase in variable B of 14 cm is associated with an increase in variable A of 1. It is not the case, of course, that we can increase the height of an adult person nor that their sex would change. And, in the rare cases where people do change their sex, their height does not change! (Perhaps variable A could be sex before any sex change, in which case we don’t have to consider the possibility of sex changes [not] leading to height change.)
Now, there are patterns in data that have causal significance or a causal interpretation, but we have to bring in additional knowledge not contained in the data. For example, if variable A is the height of wives and variable B is the height of their husbands, there is a pattern that very few wives are taller than their husbands—even though quite few women are taller than quite few men. The data don’t tell us whether this pattern derives from men choosing female spouses who are shorter than them, women choosing male spouses who are taller, a combination of the two causes, or something else. Yet the pattern has causal significance in that tall women can expect to end up with husbands only from the small subset of men taller them. Tall men can choose—or expect to be chosen as husbands—by short women, which results in a difference in physical strength, which has implications for the power relations within the couple.
An under-recognized property of the simple correlation provides a valuable perspective on the distinction between patterns and causes…—to be continued in the next post.