“We could, however, just think directly about the relationships among the variables as seen in the patterns…” is where the last post left off. What other knowledge can we bring in to help us hypothesize about why the variables are associated. “Associated” in the two-variable case means go up together or, for a negative correlation, one goes up when the other goes down. For cases where the items are measured on more than two variables, the lines of best fit through the multi-dimensional cloud of items, i.e., the principal components, have coefficients of various sizes and signs for the different variables. Why these combinations of coefficients?

Hypothesizing about the relationships behind the observed associations includes cases where the linear correlation is low, such as the husband and wife height data mentioned earlier in this series of posts. It also includes the higher-correlation case of the mother-daughter association. Could the two heights be associated because mothers feed their daughters what they were fed growing up? Or, because this cross-generational diet, in particular, influences the onset of menarche when growth soon stops? Or, because the fathers of the daughters have heights associated with the mothers (i.e., tall mothers, tall fathers, and so on) and the father’s genetic contribution determines their daughter’s height?

Personally I doubt each of these hypotheses about the mother-daughter height association, but my doubt draws not from the pattern in the data, but from the additional knowledge and assumptions about the world. For example, I assume that, in the U.S.A., there are enough cross-generational shifts in diet (and not only increases in availability of what mother’s were fed) that diet is an unlikely cause of height associations. And I am pretty sure that height variation among my siblings was not related to differential access to food.

The same hypothesizing about relationships behind the association, drawing on additional knowledge, can apply to multi-variable associations. The point is not that exploration of possible relationships should be the focus for everyone. Translation to the forms, i.e., regression lines, that best predict a given variable form the others can be useful, as the (invented) case of identification of injured mothers (in a previous post) illustrated. The point is to counteract (or at least complement) the narrowing of exploration of possible relationships where in regression coefficients are viewed as causes or proxies for them. Regression equations make each variable contribute its own effect the way that carriages of a coal train drop their loads into the bunker when they arrive at their destination.

Statisticians say it is too difficult to intepret the combinations of coefficients that make up a principal component, so, in a sense, I’m saying try harder—stretch yourselves to explore hypotheses about relationships behind associations. In another sense, however, I agree with the statisticians. And that’s because the relationships we would be looking for to account for the patterns discount the heterogeneity of relationships underlying those patterns. This will be the topic of a future series of posts.

Pingback: Underrecognized property of correlation coefficient opens up an alternative direction, III « Intersecting Processes