A recent post finished by wondering how we reconcile the figures for within-population variation versus among populations against the 2-dimensional PCA (principal component analysis) of Li et al. (2008), http://www.sciencemag.org/content/319/5866/1100.full In Li et al’s analysis, 89% of variation is within populations, 2% is among populations, within groups, and 9% is among groups. What such variation means is that it is difficult, on the basis of a random selection of genetic loci, to assign an individual to one branch of the tree of human ancestry or the other. However, the PCA plots (such as the one below) make it look like within groups variation is somewhat less than among groups variation.
Here are my current explorations:
1. Scale the diagram so that variation on the plot is proportional to variation accounted for by each of the first two principal components. (I also rotated it because I like the convention of the major axis of variation being left to right.)
2. Consider, as a thought experiment, groups separated on the first axis. That is, 3 subdivisions of human genetic diversity — African (red); Europe + Middle East + most Central/South Asia + Oceania (all in one group, brown, green, light blue, and deep blue); and E Asia + America + some CS Asia (gold, purple). Then choose some place on the second PC as if the variation in that direction were all the variation not accounted for by the grouping (rather than actually only 3/5 of it). There’s a lot of overlap among the three groups for any position with PC2 > 0.2. This shows how a randomly selected gene (or combination of genes) not captured by the first axis won’t be a reliable basis for separating groups; members of two or more groups will share that gene.
3. Now consider groups separated by using both PC axes and imagine choosing a gene (or combination of genes) along the direction of the remaining variation (20% of the total). Again, a randomly selected gene (or combination of genes) not captured by the first two axes won’t be a reliable basis for separating groups; members of two or more groups will share that gene.
4. Granted, a randomly selected gene (or combination of genes) captured by the first two axes will do OK in separating groups. However, if the trait we are concerned with involves many genes (not to mention environmental factors interacting over a developmental sequence), we will expect it to be difficult to link differences between any two individuals in different groups to genetic differences.
5. Of course, IF the genes that do allow us to separate groups (either in an ancestry method or using PCA) had been the focus of natural selection in divergent environments, then the separation on the ancestry tree or PCA plot would mean something. Is there any evidence for that? Indeed, what would be required to establish evidence for that?
Li, J. et al. (2008) Worldwide Human Relationships Inferred from Genome-Wide Patterns of Variation, Science 319: 1100-1104