A NYTimes article on predicting effective tweets provides some nice examples of abstract points I often make about statistical analysis. Patterns found in data, which are often formulated in predictive regression equations that look like and are talked about like they are about causes (even though “we all know that correlation is not causation”), suggest policies or practices that might get enacted. When patterns are seen or used that way, what caused the patterns can be misunderstood, or the underlying dynamics that generated the original data, and thus the patterns, can be altered. Continue reading

# Tag Archives: data analysis

# Heterogeneity and Data Analysis (Days 11 and 12 of Learning road trip)

On days 11 & 12 we were hosted by Prof. Di Cook and her Research Group at Iowa State University in Ames, IA. The reason I wanted to learn more about the group’s dynamic graphical approaches to data analysis is conveyed in the start of the discussion paper I prepared for them (previously posted on this blog).

My interest in visual exploration of large data sets traces back to my studies and first research job in multivariate “pattern analysis” in ecology and agriculture (in Australia in mid 1970s). Conversations among the plant breeders I worked with were lively when they saw the plots I generated for them. Much less so when I showed them analyses of variance and other numerical output. Although I have since strayed from my quantitative roots—I am now more of a sociologist and philosopher of science than a data analyst—but I remain very interested in ways that people push the limits of conventional quantitative methods. The theme of people addressing or suppressing heterogeneity runs through my studies these days of what researchers do (or don’t do) in social epidemiology, population health, and quantitative genetics. In this vein, I see the various tools of interactive and dynamic graphics for data analysis as ways to address heterogeneity, in the sense of teasing apart homogeneous components of a (heterogeneous) mixture so that separate kinds of explanations can be formulated for the separate components. Traditional statistical analysis allows itself to be confounded by the mixture of patterns or structure in a given data set. In this spirit, Cook and Swayne (2007, 13) quote Buja (1996) approvingly: “Non-discovery is the failure to identify meaningful structure… [T]he fear of non-discovery should be at least as great as the fear of false discovery.”

During the two days we sat in on two classes, led a discussion of the paper, and met one-on-one with many members of the research group. The bottom line is that I wouldn’t that my hypothesis about “the various tools of interactive and dynamic graphics for data analysis [being] ways to address heterogeneity” was confirmed by these interactions. Except my host disagrees with that conclusion. Obviously more to think and talk about.

Some other things I took away:

A statistician, Di Cook, co-teaching with an English department member, Charles Kostelnick (http://engl.iastate.edu/directory/chkostel) on interpreting and design visual communication. Visual hypothesis testing: Suppose that the treatment is not having a systematic effect, then a scrambled allocation of treatment labels and you don’t see a difference between them and the original data (that is, in a “line up”), then patterns shown in original data are just something you’d see by scrambling. But if you can pick out the original, then there is a systematic basis to the original data (seehttp://jonathanstray.com/papers/wickham.pdf for an extended discussion)

*Beautiful Visualization*, ed. J. Steele et al., especially chapter on process of development of a visualization of a wikipedia entry.

Information Visualization (InfoVis) is a large, well-rewarded field now. Practitioners combine stats, computing, and visualization (better than they did 20 years ago).

The discussion mode worked well for the large group of participants (primarily graduate students)—* I will give a brief introduction, then participants take turns, say 5 minutes each, to relate how the paper intersects with or stimulates their own thinking (while the author stays quiet, listening). I join in at the end. This approach means that the emphasis is on participants teasing out their own thinking more than on digging into what the author thinks. * The participants all found some piece of their work to air in the session. (Given that I didn’t appreciate all the points or their significance, it would have been better if this was the beginning of extended interactions with these people, but I hope some of them follow up amongst themselves. And, maybe, some of them will try the format for discussions that they lead one day.)

(back to Start of road trip; forward to Days 5-15)

# Heterogeneity and Data Analysis: Coda: Heterogeneity and Control

CODA: HETEROGENEITY AND CONTROL

Several of the vignettes speak to a broad contention I would make about heterogeneity and control: In relation to modern understandings of heredity and development over the life course, research and application of resulting knowledge are untroubled by heterogeneity to the extent that populations are well controlled. Such *control* can be established and maintained, however, only with considerable effort or *social infrastructure*, which invites more attention to possibilities for *participation* instead of control of human subjects. On the control side, people can be made to fit types in many ways: through stereotyping, screening and surveillance, population health measures, diagnostic manuals in psychology, reassignment surgery, ignoring non-conformers, and so on. On the participation side, Taylor (2005) describes diagramming of intersecting processes to expose multiple points of engagement, “mapping” by researchers of the complex situations they study and their own complex situatedness, and well-facilitated participatory processes.

Does the contention about heterogeneity and control make sense in data analysis? Does it have relevance beyond heredity and life course development?

—

*(completing a series of posts—see first post)*

# Heterogeneity and Data Analysis: Heterogeneity #4, Deviation from the type -> #10, Participatory restructuring of the dynamics that generated the data

Heterogeneity #4, Deviation from the type -> #10, Participatory restructuring of the dynamics that generated the data

(From an unfinished 2008 thought-piece)

While preparing to teach a course on epidemiology for non-specialists I made a websearch for a simple teaching example on the t-test for comparing the means (averages) of two groups for some measurement. The first example I found compared the mean productivity for two groups of workers, one group of 40 workers averaging 4.8 (in some unspecified units) with a standard deviation of 1.2 and the other group of 45 averaging 5.2 a standard deviation of 2.4. Thinking about this example led me to articulate the sequence of thoughts and questions that follow about the foundations of statistical analysis. In particular, my inquiry explores contrasts between: the statistical emphasis on averages or types around which there is variation or noise; variation as a mixture of types; the dynamics (or heterogeneous mix of dynamics) that generated the data analyzed; and participatory restructuring of these dynamics in the future. A key issue is who is assumed to be able to take action—who are the “agents”—and who are the subjects that follow directions given by others.

…[Basic sections on t-test omitted here]

3. There is something else I didn’t yet mention: in the original example there was actually only one workplace—the first group in the example is made up of workers measured on one day; the second group is made up of workers measured on a later day when the music was playing. The different size of the groups is simply related to different numbers of missing measurements on the two days. We could, therefore, look at the change in productivity for individual workers who were measured on both days. Suppose that we go back to the first example and find that this change averaged 0.5 with a standard deviation of 1.3 for the 36 workers measured on both days (Figure 2). The chance of a mean difference of this size if the workers actually came from the same population—that is, if music playing had no systematic effect on individuals’ productivity, whether good or bad—is 0.01… Given that the mean difference is positive, again the obvious thing to do is for the employer to play the music.

4. Yet, given that the mean difference is 0.5 and the standard deviation is 1.5, there must be many individuals who show a negative difference, that is, whose productivity declined when music was playing. In fact, this was the case for 12 of the 36 (see Figure 2). Should they oppose the playing of music, even though they are in the minority? If they do, should the employer ignore their opposition given that the firm’s average individual productivity increases? Does the employer have to power to ignore any opposition? If so, the employer’s power to switch on the music comes at the expense of one third of the workforce. In effect, the employer treats them as part of a music-enhances-productivity population, even though they don’t fit this type.

5. The employer, faced with competition from other firms and cognizant of obligations to shareholders, might justify playing music by pointing to the increase in average productivity of the workers, which translates into an increase in overall productivity of the firm. There are, however, other paths to higher overall productivity that the employer could consider. The employer might start by asking individuals in the minority why their productivity decreased when the music played. Suppose it turned out that the tasks of those whose productivity decreased required greater concentration than the tasks of their fellow workers, or that the music chosen is not to their liking. The employer might then rearrange the workplace so that music was not played in areas where workers had to concentrate hard. Or, using headphones linked to airplane-style audio-systems, individual workers might choose from a selection of musical styles. Once the employer starts consulting individual workers, the employer might go on to ask individuals whose productivity increase was well above the mean increase to explain why. It might turn out, for example, that the music countered the tedium of their work and made them less likely to take extended bathroom breaks. By learning about the different individuals, the employer is able, in effect, to dividing the range of individuals into a set of types in relation to working when music is playing. Actions taken by the employer can then be customized accordingly. Such actions might even lead to a higher overall productivity for the firm than switching on music for all. Of course, switching on music for all is simpler and probably less expensive, but it is a matter of empirical investigation whether the firm’s net profit would increase more through the customized changes or the simpler one-size-for-all action.

6. There are other things to consider about the one-size-for-all action by the employer. It keeps our focus on productivity in relation to playing music or not, and thereby keeps attention away from the dynamics (or mechanisms or causal connections) through which factors in addition to music influence productivity. We are left to hope that whatever the dynamics are, the addition of music does not lead to any long-term shifts in them. In other words, whatever dynamics generated the data we analyze, we assume that these same dynamics continue into the future even after playing music is added to them. Perhaps, however, a number of workers, including even some who like music, react negatively to the employer exerting the power to pipe in music, worrying, say, that this opens the door to advertizing, anti-union messages, and so on. Moreover, to some extent, a similar assumption about the continuation of past dynamics underlies the customized actions. For example, if headphones were used so as to allow choice of music, would the quality of intra-office communication continue as before? However, there is one difference between the one-size-for-all and customized actions. The latter, by acknowledging the range of circumstances underlying the increases and decreases in individuals’ productivity, opens the door to further attention to the dynamics through which factors in addition to music influence productivity. Of course, much more data is needed to investigate these dynamics and the employer might judge as unwarranted the cost of collecting and analysing the data and acting on any results.

7. Imagine, however, an employer who consults workers, acknowledges the range of circumstances influencing productivity, and worries about whether past dynamics continue even after an intervention (here: switching on music) into them. These steps open the door to the employer mobilizing the workers in a participatory planning process. Skilful facilitators can lead participants through processes that elicit diverse items of knowledge about the current circumstances, generate novel proposals for improvement, and ensure that the participants are invested in collaborating to bring the resulting plans to fruition (Stanfield 2002). If this collaborative change happens, it would matter less whether the past dynamics continued as before because the workers would have become agents in the ongoing assessment and reorganization of their work lives. Moreover, improvement in productivity could result from plans unrelated to the initial issue about having music played. Of course, this scenario assumes that the employer and workers can all be brought together and kept interacting despite differences and tensions until plans are developed in which all are invested…

*(continuing a series of posts—see first post; see next post)*

# Heterogeneity and Data Analysis: Heterogeneity #4, 6, 7, 9

Heterogeneity #4, Deviation from the type or essential trajectory -> Heterogeneity #6: Variation, not types -> Heterogeneity #9. Heterogeneity in pathways of development

see post on fluoridation

Heterogeneity #7, Possibility of “underlying heterogeneity”

Different kinds or combinations of factors are involved in what is deemed the same response. The challenge is to expose the factors and the ways they contribute to the response in question, if that is possible.

• Consider the height a high jumper jumps. The athlete may use the classical approach to the jump and movements in the air or those of the Fosbury flop.

see post on twin studies

Heterogeneity #9, Heterogeneity in pathways of development -> potential for #11, Participatory restructuring through multiple points of engagement

see post on PKU: Responding to genetic conditions requires social infrastructure

*(continuing a series of posts—see first post; see next post)*

# Heterogeneity and Data Analysis: Heterogeneity #4, Deviation from the type or essential trajectory

Heterogeneity #4, Deviation from the type or essential trajectory

Of course, statistical analysis involves more than t-tests and their generalizations. Correlation and regression are another mainstay. Here, however, the emphasis lies more on prediction than variation, as if, as a generalization of the emphasis in t-tests on types, the line or curve of prediction captured the *essential trajectory* of the data (McLaughlin 1989). (Of course, everyone knows that correlation is not causation, but most of us interpret regressions in a causal spirit.) The following excerpt from Taylor (2008; see http://bit.ly/osTjQ3) highlights an alternative view of correlation and regression that keeps our attention on the variation (also discussed in a series of posts):

Consider the concept of a regression line as a best predictor line. To predict one measurement from another is to hint at, or to invite, causal interpretation. Granted, if we have the additional information that the second measurement follows the first in time—as is the case for offspring and parental traits—a causal interpretation in the opposite direction is ruled out. But there is nothing about the association between correlated variables, whether temporally ordered or not, that requires it to be assessed in terms of how well the first

predictsthe second (let alone whether the predictions provide insight about the causal process). After all—although this is rarely made clear to statistics students—the correlation is not only the slope of the regression line when the two measurements are scaled to have equal spread, but it also measures how tightly the cloud of points is packed around the line of slope 1 (or slope -1 for a negative correlation). Technically, when both measurements are scaled to have a standard deviation of 1, the average of the squared perpendicular distance from the points to the line of slope 1 or -1 is equal to 1 minus the absolute value of the correlation (Weldon 2000). This means that the larger the correlation, the tighter the packing. This tightness-of-packing view of correlation affords no priority to one measurement over the other. Whereas the typical emphasis in statistical analysis on prediction often fosters causal thinking, a non-directional view of correlation reminds us that additional knowledge always has to be brought in if the patterns in data are used to support causal claims or hypotheses.

[Postscript: The tightness of packing view of regression for continuous variables can be extended to multivariate associations through Principal Component Analysis, factor analysis, etc. The well-known difficulty of interpreting principal components or the factors can be flipped on its head: What causal assumptions about *independent* variables (i.e., independently modifiable variables) enter into interpretations of conventional regression analysis?]

*(continuing a series of posts—see first post; see next post)*

# Heterogeneity and Data Analysis: Heterogeneity #4, Deviation from the type

Heterogeneity #4, Deviation from the type

Statistical analysis rests on the simplest heterogeneity, namely, variation around a mean. In this light, I tell education students who will not be taking a statistics course that they should:

Understand the simple chain of thinking below, then enlist or hire a statistician who will use the appropriate recipe for the data at hand.

1. There is a population of individuals. (Population = individuals subject to the same causes of interest. In addition to these foreground causes, there may also be background, non-manipulatable causes that vary among these individuals.)

2. Variation: For some measurable attribute, the individuals have varying responses to these causes (possibly because of the background causes).

3. You have observations of the measurable attribute for two or more subsets (samples) of the populations.

4. Central question of statistical analysis: Are the subsets sufficiently different in their varying responses that you doubt that they are from the one population (i.e., you doubt that they are subject to all the same foreground causes)? Statisticians answer this question with recipes that are variants of a comparison between the subset averages in relation to the spread around the averages. For the figure below, the statisticians’ comparison means that you are more likely to doubt that subsets A and B are from the same population in the left hand situation than in the right hand one.

5. If you doubt that the subsets are from the same population, investigate further, drawing on other knowledge about the subsets. You hope to expose the causes involved and then take action informed by that knowledge about the cause.

Variation around a mean is not a strong sense of heterogeneity. The emphasis above is on the means (the circles) more than the variation (the dashed curves). Statistical analysis distinguishes types (or decides they are not distinguishable) more than it explores the variation (or error, i.e., deviation from type). Data amenable to a t-test are, however, open to alternative explorations, as illustrated by the final vignette in this series of posts.

*(continuing a series of posts—see first post; see next post)*