Tag Archives: statistics

Heterogeneity and Data Analysis: Heterogeneity #4, Deviation from the type or essential trajectory

Heterogeneity #4, Deviation from the type or essential trajectory

Of course, statistical analysis involves more than t-tests and their generalizations.  Correlation and regression are another mainstay.  Here, however, the emphasis lies more on prediction than variation, as if, as a generalization of the emphasis in t-tests on types, the line or curve of prediction captured the essential trajectory of the data (McLaughlin 1989).  (Of course, everyone knows that correlation is not causation, but most of us interpret regressions in a causal spirit.)  The following excerpt from Taylor (2008; see http://bit.ly/osTjQ3) highlights an alternative view of correlation and regression that keeps our attention on the variation (also discussed in a series of posts):

Consider the concept of a regression line as a best predictor line.  To predict one measurement from another is to hint at, or to invite, causal interpretation.  Granted, if we have the additional information that the second measurement follows the first in time—as is the case for offspring and parental traits—a causal interpretation in the opposite direction is ruled out.  But there is nothing about the association between correlated variables, whether temporally ordered or not, that requires it to be assessed in terms of how well the first predicts the second (let alone whether the predictions provide insight about the causal process).  After all—although this is rarely made clear to statistics students—the correlation is not only the slope of the regression line when the two measurements are scaled to have equal spread, but it also measures how tightly the cloud of points is packed around the line of slope 1 (or slope -1 for a negative correlation).  Technically, when both measurements are scaled to have a standard deviation of 1, the average of the squared perpendicular distance from the points to the line of slope 1 or -1 is equal to 1 minus the absolute value of the correlation (Weldon 2000).  This means that the larger the correlation, the tighter the packing.  This tightness-of-packing view of correlation affords no priority to one measurement over the other.  Whereas the typical emphasis in statistical analysis on prediction often fosters causal thinking, a non-directional view of correlation reminds us that additional knowledge always has to be brought in if the patterns in data are used to support causal claims or hypotheses.

[Postscript: The tightness of packing view of regression for continuous variables can be extended to multivariate associations through Principal Component Analysis, factor analysis, etc.  The well-known difficulty of interpreting principal components or the factors can be flipped on its head: What causal assumptions about independent variables (i.e., independently modifiable variables) enter into interpretations of conventional regression analysis?]

(continuing a series of posts—see first post; see next post)

Heterogeneity and Data Analysis: Heterogeneity #4, Deviation from the type

Heterogeneity #4, Deviation from the type

Statistical analysis rests on the simplest heterogeneity, namely, variation around a mean.  In this light, I tell education students who will not be taking a statistics course that they should:

Understand the simple chain of thinking below, then enlist or hire a statistician who will use the appropriate recipe for the data at hand.

1. There is a population of individuals. (Population = individuals subject to the same causes of interest.  In addition to these foreground causes, there may also be background, non-manipulatable causes that vary among these individuals.)

2. Variation: For some measurable attribute, the individuals have varying responses to these causes (possibly because of the background causes).

3. You have observations of the measurable attribute for two or more subsets (samples) of the populations.

4. Central question of statistical analysis: Are the subsets sufficiently different in their varying responses that you doubt that they are from the one population (i.e., you doubt that they are subject to all the same foreground causes)? Statisticians answer this question with recipes that are variants of a comparison between the subset averages in relation to the spread around the averages. For the figure below, the statisticians’ comparison means that you are more likely to doubt that subsets A and B are from the same population in the left hand situation than in the right hand one.

5. If you doubt that the subsets are from the same population, investigate further, drawing on other knowledge about the subsets. You hope to expose the causes involved and then take action informed by that knowledge about the cause.

Variation around a mean is not a strong sense of heterogeneity.  The emphasis above is on the means (the circles) more than the variation (the dashed curves).  Statistical analysis distinguishes types (or decides they are not distinguishable) more than it explores the variation (or error, i.e., deviation from type).  Data amenable to a t-test are, however, open to alternative explorations, as illustrated by the final vignette in this series of posts.

(continuing a series of posts—see first post; see next post)

Creative Thinking in Epidemiology: 5. Alternatives to some statistical conventions & 6. Agent-oriented epidemiology

5.  Alternatives to some statistical conventions: As I have developed my ability to read the epidemiological literature and explain the methods and controversies over methods to others, I have taken note of approaches or perspectives that depart from statistical conventions.  The third Appendix includes some items from my mixed grab bag of alternatives.  There is no grand theory linking them.  Readers might have objections to some of the alternatives and the thinking behind them, but they might also be stimulated to explore their implications further.  Continue reading

Underrecognized property of correlation coefficient opens up an alternative direction, II

An under-recognized property of the simple correlation provides a valuable perspective on the distinction between patterns and causes (continuing from previous post).  This property is that the simple (linear) correlation is a measure of the tightness of packing of the data around the line of best fit through that data.  Let’s spell that out.  Suppose two variables are measured for a set of items and these items are plotted against two axes so that the horizontal position of an item is its value for the first variable and the vertical position is its value for the second variable.  The first pattern that most people see when viewing such data is a cloud of points (i.e., the items).  The line of best fit is an axis that runs through that cloud.

Is this property really under-recognized?  In one sense, no.  All statistics students know that if the correlation is high, the cloud is a narrow cigar; if it’s low (i.e., close to 0), it’s hard to specify where the axis is for the cloud.  But, in another sense, yes.  Statistics students also learn that the correlation coefficient is related to the slope of the regression line, where that line is not the line of best fit, but the line which, on average, best predicts the vertical variable for an item on the basis of the horizontal variable for that item.  For example, if the items were mother-daughter pairs, the horizontal variable could be the height of mothers and the vertical variable the height of a daughter.  The regression line would let us predict the daughter’s height from the mother’s (excluding adopted daughters).  Of course, the predictions would not turn out to be perfect, but the regression line is chosen so as to minimize the average of the vertical deviations from predictions (actually, the squared deviations, so that the above-the-line and below-the-line deviations do not cancel each other out).  No other line through the cloud—including the line of best fit—can do better in the prediction department.

Restating this slightly technically:  If the horizontal and vertical variables are standardized so that they now have a mean of 0 and a standard deviation of 1, then the slope of the regression line is exactly the correlation coefficient.  The greater the slope, the greater the correlation coefficient (up to a maximum of 1).  The under-recognized property is that the average perpendicular distance (again squared) from a diagonal line going through the intersection of the axes (i.e., the (0,0) point) with a slope of 1 (which is the line of best fit in this case) is 1 minus the correlation.  The larger the correlation, the tighter the packing of points.  And this remains the case even if the vertical and horizontal variables are switched in the plot.

Notice a shift of interpretation here.  In one view, the correlation coefficient as slope of regression line helps us predict the vertical variable from the horizontal—knowing mothers’ heights gives a way to predict daughters’ heights.  In the other view, the correlation coefficient just tells us how closely associated the values are, without the horizontal having predictive priority over the vertical variable.  Now you might say, mothers come first—daughters are born of their mothers.  Notice that you had to bring in knowledge not contained in the data to say that.  In any case, we could imagine using daughters’ heights to predict mothers’.  After an earthquake hit an area, young women who had been away, say, at college, returning to search for their mothers among the shrouded victims could be directed on the basis of their height first to a subsection of the victims who had heights similar to that predicted on the basis of the young women’s own height.  A gruesome example, but there is nothing in the fact that mothers are born first that says we must predict the daughters’ heights from the mothers, and not vice versa.  This non-directionality of prediction makes sense because the same correlation coefficient is involved in the slope of the regression as in the tightness of the packing—no additional information is contained in one versus the other, so no additional interpretation is warranted on the basis of the data alone.   To be continued.

A non-technical introduction to path analysis and structural equation modeling

Path analysis is a data analysis technique that quantifies the relative contributions of variables (“path coefficients”) to the variation in a focal variable once a certain network of interrelated variables has been specified (Lynch & Walsh 1998, 823). Some of these contributions are direct and some mediated through other variables, i.e., indirect. Although some researchers interpret “contribution” in causal terms (e.g., Pearl 2000, 135 & 344-5), others criticize such an interpretation (e.g., Freedman 2005). Here, contribution refers neutrally to the term of an additive model fitted to data.

This post is a first attempt at a non-technical introduction to path analysis and structural equation modeling (alternatives expositions welcome).

The conceptual starting point for path analysis is an additive regression model that associates the focal (“dependent”) variable with several other measured (“independent” or “exogenous”) variables. (The vertical lines in these figures indicates that the separate horizontal lines are combined together.)

X1 —-|
X2 —-|—-> Y
X3 —-|

Technically, the additive model is transformed by subtracting the mean from every term, squaring the expression (so it is an equation for the variance), and dividing by the variance of the focal (“dependent”) variable. The result is the “equation of complete determination,” with the regression coefficients being multiplied by the SD of the other “independent” variables and divided by the SD of the focal variable to arrive at the path coefficient.

The next step is to consider more than one focal, “endogenous” variable and networks of exogenous and endogenous variables that you have reason to think are associated with one another. Indeed, the focal variable of one regression may be among the variables associated with a second focal variable and so on. In the figure below X3 has a direct link with Y2 and an indirect one through Y1.

X1 —-|
X2 —-|—-> Y1 -|–> Y2
X3 —-|————|

The software (e.g., LISREL) can solve these linked regression equations, but it is up to you to compare the results using the network you specify with plausible (theoretically-justified) alternatives that may link exogenous, independent variables and endogenous variables differently. Unlike multiple regression, we do not arrive at our idea of what should be in the regression by adding or subtracting variables in some stepwise procedure.

Structural equation modeling extends path analysis to include latent (a.k.a. unmeasured) variables or “constructs.” These latent variables are sometimes the presumed real underlying variable of which the measured one is an imperfect marker. For example, birth weight at full term and the neonate APGAR scores might be the measured variables but the model might include degree of fetal under-nutrition as a latent variable. Latent variables can also be constructed by the software in the same way that they are in factor analyses, namely, as economical (dimension-reducing) linear combinations of measured variables. Calling the networks of linked variables “structural” is meant to suggest that we can give the pathways causal interpretations, but SEM and path analysis has no trick that overcomes the problems that regression and factor analyses have in exposing causes.

(This post is a supplement to a series laying out a sequence of basic ideas in thinking like epidemiologists, especially epidemiologists who pay attention to possible social influences on the development and unequal distribution of diseases and behaviors in populations [see first post in series and contribute to open-source curriculum http://bit.ly/EpiContribute].)


Freedman, D. A. (2005). Linear statistical models for causation: A critical review. Encyclopedia of Statistics in the Behavioral Sciences. B. Everitt and D. Howell. Chichester, Wiley.
Lynch, M. and B. Walsh (1998). Genetics and Analysis of Quantitative Traits. Sunderland, MA, Sinauer.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge, Cambridge University Press.

Patterns among relatives: A classroom activity III

The simple classroom activity presented in the previous two posts allows us to unpack the simple picture of science as empirical observation and rational interpretation (i.e., identifying patterns and trying to explain them).

These are only two of the many steps in scientific inquiry (figure 5).  At each step decisions are made that depend on knowledge—perhaps assumed knowledge—in addition to what can be drawn from any data collected.  Scientific inquiry cannot proceed without decisions that take into account diverse additional considerations, such as, in this classroom activity: technical constraints of plotting in three dimensions; theories about the mechanisms of heredity, temporal ordering (parents grow before their offspring are born and grow), whether to collect data about the diet of parents and offspring when they were growing up, and conventions about designation of outlier status to extreme points.  Each step becomes a site where decisions made can be shaped by convention, ongoing negotiation, and wider influences.  These “sites of sociality” invite critical scrutiny (Taylor 2005). We can, for example, consider the ways that preconceptions or preferences about the outcomes at the later steps feed forward to earlier ones (as depicted by the dashed lines in figure 5) so that the inquiry tends to reinforce that outcome.  As will be shown in the discussion of Galton’s work, such feed forward loops can involve the social actions or organization supported or desired by scientists—what they think we as a society can or should do.

Figure 5.  A chain of steps in scientific inquiry in which each step (indicated by an arrow ->) involves assumptions and is open for negotiation and wider influences.  The dashed lines depict the possibility that desired outcomes for the later stages influence decisions made at earlier steps.  See text for discussion.


Through this classroom activity two themes have emerged:

  • There are many sites in scientific inquiry at which decisions are made based on knowledge drawn from outside the observations to be explained.
  • The negotiation, assumptions about social possibilities and constraints, and wider influences that shape decisions made at these open sites invite critical scrutiny.

These themes extend some more basic themes about interpreting science in its social context:

  • It can be illuminating to ask what the authors (including ourselves) state or imply about what we can do.  (This deliberately broad formulation encompasses views about the social actions and organization they support as well as their views about the capabilities of different people growing up in our society and how difficult these are to change.)
  • Close examination of concepts and methods within any given natural or social science can stimulate our inquiries into the diverse social influences shaping that science, and reciprocally.

For more discussion of these themes, see Taylor, P. “Why was Galton so concerned about ‘regression to the mean’? -A contribution to interpreting and changing science and society” DataCritica, 2(2): 3-22, 2008, http://www.datacritica.info/ojs/index.php/datacritica/article/view/23/29, from which this post has been adapted, and Taylor (2005, chapter 2).


Taylor, P.J. (2005) Unruly Complexity: Ecology, Interpretation, Engagement (Chiacgo: U. Chicago Press)

Patterns among relatives: A classroom activity II

OK readers.  Keep in mind your answers to the questions raised in the previous post about patterns in data that link parents and offspring.  In this post I describe what usually emerges when I ask these questions in my classes on biology and society.

Students identify patterns in many ways.  They draw boxes, ellipses, or convoluted shapes around the data points, mark highs and low values for each of the variables, note how many offspring are taller than their parents, separate the main cloud of points from outliers, draw trend lines through the cloud, and so on.  Many students note that in the first three plots an increase in one variable tends to be associated with an increase in the other (albeit with considerable spread around any trend line).  No trend, however, is seen in figure 4, which depicts the heights of each pair of parents.  Indeed, often students will say there is no pattern in that plot.  Some students notice the outlier half way up on the right in which the mother, at 72”, is 3” taller than the father. They do not notice the pattern that the father is taller than the mother in almost every pair, but see it once I point it out.

When it comes to explanation, the first three plots are typically seen as indications of the hereditary relation between parents and offspring.  Because there is no hereditary relation between any mother and father, students conclude at first that no causality can be drawn from figure 4.  However, once I have drawn attention to the strong father-taller-than-mother pattern, lively discussion about the causes ensues for this plot too: Does this pattern correspond to men choosing female partners shorter than them or to women choosing male partners taller than them?  Or both?

A range of questions or reservations are expressed about the process of this scientific inquiry, including the reliability of the data (how accurate are the data, which presumably came from students’ recall or phone calls to their parents); criteria for inclusion (could adoptive or step-parents have been included); whether the students have stopped growing (perhaps heights should have been collected for parents when at the same age as their child is now); and whether outliers warrant special explanation (or can they be viewed as points at the end of a spectrum).

As the teacher I inject further issues of critical thinking into the discussion: What additional knowledge leads the students to invoke heredity?  (Couldn’t height trends result from parents feeding their children the way they were fed?)  Why plot same sex pairs and exclude the opposite sex parent?  (Is this a choice dictated only by the difficulties of plotting in three dimensions?)  Why plot offspring height against the average of the parents?  (Does this presume that height is a blending of contributions—hereditary or otherwise—from parents?)  Most importantly, what could anyone do (or be constrained from doing) on the basis of the patterns or explanations?

On this last issue of “what can we do?”, I note that the mother-father height pattern, originally overlooked by students, is of great significance to taller heterosexual women because it corresponds to a smaller selection of men available to them as potential partners.  If the height norm were contested, these women would have new options opened up.  It would also reduce the frequency of couples in which the man is very much taller and stronger than the woman.  In contrast, the hereditary explanation of the trend in the first three plots does not suggest any action other than inaction—parents cannot do anything to change the outcomes for their offspring once these offspring have been conceived.  This inaction conclusion about height might not trouble us, at least not enough to make us delve into possible relationships between growth trajectories and, say, maternal nutrition before and during pregnancy, childhood diet, exercise, and so on.  However, I ask my students, if the data were of IQ test scores, not heights, would inaction be an acceptable conclusion?  Or would they pursue the process of identifying patterns, proposing explanations, exploring reservations (including raising alternatives) differently?

In the concluding post I show that this simple classroom activity allows us to unpack the simple picture of science as empirical observation and rational interpretation.


Extracted from Taylor, P. “Why was Galton so concerned about ‘regression to the mean’? -A contribution to interpreting and changing science and society” DataCritica, 2(2): 3-22, 2008, http://www.datacritica.info/ojs/index.php/datacritica/article/view/23/29.