An under-recognized property of the simple correlation provides a valuable perspective on the distinction between patterns and causes (continuing from previous post). This property is that the simple (linear) correlation is a measure of the tightness of packing of the data around the line of best fit through that data. Let’s spell that out. Suppose two variables are measured for a set of items and these items are plotted against two axes so that the horizontal position of an item is its value for the first variable and the vertical position is its value for the second variable. The first pattern that most people see when viewing such data is a cloud of points (i.e., the items). The line of best fit is an axis that runs through that cloud.

Is this property really under-recognized? In one sense, no. All statistics students know that if the correlation is high, the cloud is a narrow cigar; if it’s low (i.e., close to 0), it’s hard to specify where the axis is for the cloud. But, in another sense, yes. Statistics students also learn that the correlation coefficient is related to the slope of the regression line, where that line is not the line of best fit, but the line which, on average, best predicts the vertical variable for an item on the basis of the horizontal variable for that item. For example, if the items were mother-daughter pairs, the horizontal variable could be the height of mothers and the vertical variable the height of a daughter. The regression line would let us predict the daughter’s height from the mother’s (excluding adopted daughters). Of course, the predictions would not turn out to be perfect, but the regression line is chosen so as to minimize the average of the vertical deviations from predictions (actually, the squared deviations, so that the above-the-line and below-the-line deviations do not cancel each other out). No other line through the cloud—including the line of best fit—can do better in the prediction department.

Restating this slightly technically: If the horizontal and vertical variables are standardized so that they now have a mean of 0 and a standard deviation of 1, then the slope of the regression line is exactly the correlation coefficient. The greater the slope, the greater the correlation coefficient (up to a maximum of 1). The under-recognized property is that the average perpendicular distance (again squared) from a diagonal line going through the intersection of the axes (i.e., the (0,0) point) with a slope of 1 (which is the line of best fit in this case) is 1 minus the correlation. The larger the correlation, the tighter the packing of points. And this remains the case even if the vertical and horizontal variables are switched in the plot.

Notice a shift of interpretation here. In one view, the correlation coefficient as slope of regression line helps us predict the vertical variable from the horizontal—knowing mothers’ heights gives a way to predict daughters’ heights. In the other view, the correlation coefficient just tells us how closely associated the values are, without the horizontal having predictive priority over the vertical variable. Now you might say, mothers come first—daughters are born of their mothers. Notice that you had to bring in knowledge not contained in the data to say that. In any case, we could imagine using daughters’ heights to predict mothers’. After an earthquake hit an area, young women who had been away, say, at college, returning to search for their mothers among the shrouded victims could be directed on the basis of their height first to a subsection of the victims who had heights similar to that predicted on the basis of the young women’s own height. A gruesome example, but there is nothing in the fact that mothers are born first that says we must predict the daughters’ heights from the mothers, and not vice versa. This non-directionality of prediction makes sense because the same correlation coefficient is involved in the slope of the regression as in the tightness of the packing—no additional information is contained in one versus the other, so no additional interpretation is warranted on the basis of the data alone. *To be continued.*

Pingback: Underrecognized property of correlation coefficient opens up an alternative direction for statistical analysis and interpretation « Intersecting Processes

Pingback: Underrecognized property of correlation coefficient opens up an alternative direction, IV « Intersecting Processes