Path analysis is a data analysis technique that quantifies the relative contributions of variables (“path coefficients”) to the variation in a focal variable once a certain network of interrelated variables has been specified (Lynch & Walsh 1998, 823). Some of these contributions are direct and some mediated through other variables, i.e., indirect. Although some researchers interpret “contribution” in causal terms (e.g., Pearl 2000, 135 & 344-5), others criticize such an interpretation (e.g., Freedman 2005). Here, contribution refers neutrally to the term of an additive model fitted to data.
This post is a first attempt at a non-technical introduction to path analysis and structural equation modeling (alternatives expositions welcome).
The conceptual starting point for path analysis is an additive regression model that associates the focal (“dependent”) variable with several other measured (“independent” or “exogenous”) variables. (The vertical lines in these figures indicates that the separate horizontal lines are combined together.)
X2 —-|—-> Y
Technically, the additive model is transformed by subtracting the mean from every term, squaring the expression (so it is an equation for the variance), and dividing by the variance of the focal (“dependent”) variable. The result is the “equation of complete determination,” with the regression coefficients being multiplied by the SD of the other “independent” variables and divided by the SD of the focal variable to arrive at the path coefficient.
The next step is to consider more than one focal, “endogenous” variable and networks of exogenous and endogenous variables that you have reason to think are associated with one another. Indeed, the focal variable of one regression may be among the variables associated with a second focal variable and so on. In the figure below X3 has a direct link with Y2 and an indirect one through Y1.
X2 —-|—-> Y1 -|–> Y2
The software (e.g., LISREL) can solve these linked regression equations, but it is up to you to compare the results using the network you specify with plausible (theoretically-justified) alternatives that may link exogenous, independent variables and endogenous variables differently. Unlike multiple regression, we do not arrive at our idea of what should be in the regression by adding or subtracting variables in some stepwise procedure.
Structural equation modeling extends path analysis to include latent (a.k.a. unmeasured) variables or “constructs.” These latent variables are sometimes the presumed real underlying variable of which the measured one is an imperfect marker. For example, birth weight at full term and the neonate APGAR scores might be the measured variables but the model might include degree of fetal under-nutrition as a latent variable. Latent variables can also be constructed by the software in the same way that they are in factor analyses, namely, as economical (dimension-reducing) linear combinations of measured variables. Calling the networks of linked variables “structural” is meant to suggest that we can give the pathways causal interpretations, but SEM and path analysis has no trick that overcomes the problems that regression and factor analyses have in exposing causes.
(This post is a supplement to a series laying out a sequence of basic ideas in thinking like epidemiologists, especially epidemiologists who pay attention to possible social influences on the development and unequal distribution of diseases and behaviors in populations [see first post in series and contribute to open-source curriculum http://bit.ly/EpiContribute].)
Freedman, D. A. (2005). Linear statistical models for causation: A critical review. Encyclopedia of Statistics in the Behavioral Sciences. B. Everitt and D. Howell. Chichester, Wiley.
Lynch, M. and B. Walsh (1998). Genetics and Analysis of Quantitative Traits. Sunderland, MA, Sinauer.
Pearl, J. (2000). Causality: Models, Reasoning, and Inference. Cambridge, Cambridge University Press.