At the CHI conference last week, Max Wilson organized the RepliCHI workshop (I helped a little, but really it was Max). There were a bunch of presentations reflecting on the challenges of doing replication work, and lots of discussion about the roles and value of replication in HCI research.
Previously, including in the call for position papers for the workshop, we had offered a taxonomy that included direct replication, replicate+extend, conceptual replication, and applied case studies. Aside from everyone having difficulty assigning particular studies to these categories, these categories fail to highlight the purposes of replication studies, which I think are:
- testing the validity or generality of previous findings;
- calibrating or testing a measurement instrument;
- teaching an investigator methods or findings in a way that they’ll really remember.
All of the distinctions we seem to want to make in a taxonomy of replication types revolve around what differs between replicating study and original and what inferences the investigator will make. So let’s tackle those separately.
What Changes and What Stays the Same
The ideal of a “direct” replication can never be realized. Some follow-on studies are closer to the original than others, but there is always some difference on at least one of the following dimensions:
- People whose behavior is observed and/or manipulated. Examples would be switching from Korean to U.S. subjects, or college students to Mechanical Turkers.
- Time when the behavior occurs. In HCI, the phenomena we are studying are changing frequently as technologies change and social norms evolve.
- Context in which the behavior occurs, including experimental procedures for experiments or variations in physical and social environment or tasks for naturally occurring behavior. For example, at the workshop, a couple of the presentations involved experiments with either a search task or reading news articles, where it did not make sense to reuse the original tasks because of the passage of time. Another example would be moving from lab to field setting.
Inferences From Confirmatory Results
If the replicating study results match the original, we infer that the original result is reliable and generalizes to the new context. Generally, the more that has changed between original and replicating study, the greater the evidence the new study provides for robustness of the original finding (provided they match). There is always the possibility that multiple difference have had effects that canceled each other out rather than none of them changing the results. Some people at the workshop, notably Noam Tractinsky, strongly argued for changing things in small increments, to avoid that possibility.
If there was some reason to doubt the robustness (reliability or generality) of the original findings, and those findings have important implications for theory or practice, then a confirmatory result may be a significant contribution to the field. No further “extension” or follow-on experiment is necessary. Sometimes, however, the test of robustness will be more impressive if it is conducted several times with somewhat differing populations or conditions. Several workshop participants were concerned that some CHI reviewers and ACs do not value studies whose main contribution is to check the robustness of a previous result. For example, Bonnie John was concerned that reviewers for a paper this year (not hers) had understood and appreciated the contribution of a submission very clearly but had rejected it because it only confirmed a previous finding, even though the reviewers acknowledged and agreed with the authors’ argument that there were reasons to doubt the previous finding. Perhaps the underlying problem is that the CHI community is insufficiently skeptical of results that have been published at the conference. I think it’s fine that we publish studies, ethnographic and experimental, with small samples, but we should be far more skeptical than we are of the robustness of the findings from those studies.
If the measurement instruments (survey items; eye tracking device; counts of FB likes, qualitative coding categories) have changed, from matching results in the replication we also infer that the instruments are valid. This is the main reason to replicate first in the replicate+extend paradigm. In this case, the results of a follow-up study using the calibrated instruments are usually the main contribution of the research. In some situations, however, simply testing the robustness of a measurement instrument (e.g., a shorter version of a questionnaire scale trying to measure the same construct previously measured with a larger number of questions) may be a valuable methodological contribution.
Most workshop participants felt that research contributions following the replicate+extend model are well appreciated by CHI reviewers. One of the presentations at the workshop described a project that followed this model, and it had been accepted for publication/presentation at the conference. I wish, however, that reviewers would be more critical of papers that omit the instrument calibration phase. The problem here is not rejection of studies that include replication but acceptance of papers that do not.
Inferences from Contradictory Results
If the replicating study results contradict the original results, we do not know exactly why; any of the things that differ between original and replicating studies could be the cause, as could errors in conducting or reporting of the initial results. The more that has changed between the original and replicating study, the less we learn from contradictory results. This is another reason for some investigators to prefer changing things in small increments. Generally, contradictory results require follow-on studies to try to refine the original results by characterizing better the conditions under which they apply. That’s the situation we find ourselves with the NewsCube replication study that we presented at the workshop and as a work-in-progress paper at the main conference.