On days 11 & 12 we were hosted by Prof. Di Cook and her Research Group at Iowa State University in Ames, IA. The reason I wanted to learn more about the group’s dynamic graphical approaches to data analysis is conveyed in the start of the discussion paper I prepared for them (previously posted on this blog).
My interest in visual exploration of large data sets traces back to my studies and first research job in multivariate “pattern analysis” in ecology and agriculture (in Australia in mid 1970s). Conversations among the plant breeders I worked with were lively when they saw the plots I generated for them. Much less so when I showed them analyses of variance and other numerical output. Although I have since strayed from my quantitative roots—I am now more of a sociologist and philosopher of science than a data analyst—but I remain very interested in ways that people push the limits of conventional quantitative methods. The theme of people addressing or suppressing heterogeneity runs through my studies these days of what researchers do (or don’t do) in social epidemiology, population health, and quantitative genetics. In this vein, I see the various tools of interactive and dynamic graphics for data analysis as ways to address heterogeneity, in the sense of teasing apart homogeneous components of a (heterogeneous) mixture so that separate kinds of explanations can be formulated for the separate components. Traditional statistical analysis allows itself to be confounded by the mixture of patterns or structure in a given data set. In this spirit, Cook and Swayne (2007, 13) quote Buja (1996) approvingly: “Non-discovery is the failure to identify meaningful structure… [T]he fear of non-discovery should be at least as great as the fear of false discovery.”
During the two days we sat in on two classes, led a discussion of the paper, and met one-on-one with many members of the research group. The bottom line is that I wouldn’t that my hypothesis about “the various tools of interactive and dynamic graphics for data analysis [being] ways to address heterogeneity” was confirmed by these interactions. Except my host disagrees with that conclusion. Obviously more to think and talk about.
Some other things I took away:
A statistician, Di Cook, co-teaching with an English department member, Charles Kostelnick (http://engl.iastate.edu/directory/chkostel) on interpreting and design visual communication. Visual hypothesis testing: Suppose that the treatment is not having a systematic effect, then a scrambled allocation of treatment labels and you don’t see a difference between them and the original data (that is, in a “line up”), then patterns shown in original data are just something you’d see by scrambling. But if you can pick out the original, then there is a systematic basis to the original data (seehttp://jonathanstray.com/papers/wickham.pdf for an extended discussion)
Beautiful Visualization, ed. J. Steele et al., especially chapter on process of development of a visualization of a wikipedia entry.
Information Visualization (InfoVis) is a large, well-rewarded field now. Practitioners combine stats, computing, and visualization (better than they did 20 years ago).
The discussion mode worked well for the large group of participants (primarily graduate students)— I will give a brief introduction, then participants take turns, say 5 minutes each, to relate how the paper intersects with or stimulates their own thinking (while the author stays quiet, listening). I join in at the end. This approach means that the emphasis is on participants teasing out their own thinking more than on digging into what the author thinks. The participants all found some piece of their work to air in the session. (Given that I didn’t appreciate all the points or their significance, it would have been better if this was the beginning of extended interactions with these people, but I hope some of them follow up amongst themselves. And, maybe, some of them will try the format for discussions that they lead one day.)