Reasons for Replication in HCI Research

At the CHI conference last week, Max Wilson organized the RepliCHI workshop (I helped a little, but really it was Max). There were a bunch of presentations reflecting on the challenges of doing replication work, and lots of discussion about the roles and value of replication in HCI research.

Previously, including in the call for position papers for the workshop, we had offered a taxonomy that included direct replication, replicate+extend, conceptual replication, and applied case studies. Aside from everyone having difficulty assigning particular studies to these categories, these categories fail to highlight the purposes of replication studies, which I think are:

  1. testing the validity or generality of previous findings;
  2. calibrating or testing a measurement instrument;
  3. teaching an investigator methods or findings in a way that they’ll really remember.

All of the distinctions we seem to want to make in a taxonomy of replication types revolve around what differs between replicating study and original and what inferences the investigator will make. So let’s tackle those separately.

What Changes and What Stays the Same

The ideal of a “direct” replication can never be realized. Some follow-on studies are closer to the original than others, but there is always some difference on at least one of the following dimensions:

  • Investigator.
  • People whose behavior is observed and/or manipulated. Examples would be switching from Korean to U.S. subjects, or college students to Mechanical Turkers.
  • Time when the behavior occurs. In HCI, the phenomena we are studying are changing frequently as technologies change and social norms evolve.
  • Context in which the behavior occurs, including experimental procedures for experiments or variations in physical and social environment or tasks for naturally occurring behavior. For example, at the workshop, a couple of the presentations involved experiments with either a search task or reading news articles, where it did not make sense to reuse the original tasks because of the passage of time. Another example would be moving from lab to field setting.

Inferences From Confirmatory Results

If the replicating study results match the original, we infer that the original result is reliable and generalizes to the new context. Generally, the more that has changed between original and replicating study, the greater the evidence the new study provides for robustness of the original finding (provided they match). There is always the possibility that multiple difference have had effects that canceled each other out rather than none of them changing the results. Some people at the workshop, notably Noam Tractinsky, strongly argued for changing things in small increments, to avoid that possibility.

If there was some reason to doubt the robustness (reliability or generality) of the original findings, and those findings have important implications for theory or practice, then a confirmatory result may be a significant contribution to the field. No further “extension” or follow-on experiment is necessary. Sometimes, however, the test of robustness will be more impressive if it is conducted several times with somewhat differing populations or conditions. Several workshop participants were concerned that some CHI reviewers and ACs do not value studies whose main contribution is to check the robustness of a previous result. For example, Bonnie John was concerned that reviewers for a paper this year (not hers) had understood and appreciated the contribution of a submission very clearly but had rejected it because it only confirmed a previous finding, even though the reviewers acknowledged and agreed with the authors’ argument that there were reasons to doubt the previous finding. Perhaps the underlying problem is that the CHI community is insufficiently skeptical of results that have been published at the conference. I think it’s fine that we publish studies, ethnographic and experimental, with small samples, but we should be far more skeptical than we are of the robustness of the findings from those studies.

If the measurement instruments (survey items; eye tracking device; counts of FB likes, qualitative coding categories) have changed, from matching results in the replication we also infer that the instruments are valid. This is the main reason to replicate first in the replicate+extend paradigm. In this case, the results of a follow-up study using the calibrated instruments are usually the main contribution of the research. In some situations, however, simply testing the robustness of a measurement instrument (e.g., a shorter version of a questionnaire scale trying to measure the same construct previously measured with a larger number of questions) may be a valuable methodological contribution.

Most workshop participants felt that research contributions following the replicate+extend model are well appreciated by CHI reviewers. One of the presentations at the workshop described a project that followed this model, and it had been accepted for publication/presentation at the conference.  I wish, however, that reviewers would be more critical of papers that omit the instrument calibration phase. The problem here is not rejection of studies that include replication but acceptance of papers that do not.

Inferences from Contradictory Results

If the replicating study results contradict the original results, we do not know exactly why; any of the things that differ between original and replicating studies could be the cause, as could errors in conducting or reporting of the initial results. The more that has changed between the original and replicating study, the less we learn from contradictory results. This is another reason for some investigators to prefer changing things in small increments. Generally, contradictory results require follow-on studies to try to refine the original results by characterizing better the conditions under which they apply. That’s the situation we find ourselves with the NewsCube replication study that we presented at the workshop and as a work-in-progress paper at the main conference.

Posted in Uncategorized | Leave a comment

Personalized Filters Yes; Bubbles No

On Thursday, I gave the closing keynote for the UMAP conference (User Modeling and Personalization) in Girona, Spain (Catalunya). I had planned to talk about by work directed towards creating balanced news aggregators that people will prefer to use over unbalanced ones (see project site). But then Eli Pariser’s TED Talk  and book on “Filter Bubbles” starting getting a lot of attention.  He’s started a trend of a little parlor game where a group of friends all try the same search on Google and look in horror when they see that they get different results. So I decided to broaden my focus a little beyond news aggregators. I titled the talk, “Personalized Filters Yes; Bubbles No.”

As you can perhaps guess from my title, I agree with some of Pariser’s concerns about bubbles. But I think he’s on the wrong track in attributing those concerns to personalization. Most of his concerns, I argue, come from badly implemented personalization, not from personalization itself. I’ve posted a copy of my slides and notes. For anyone who wants a summary of my arguments, here goes.

His first concern I summarize as “Trapped in the Old You”. I argued that personalization systems that try to maximize your long-term clickthrough rates will naturally try to explore to see if you like things, not just give you more of the same. This is the whole point of the algorithmic work on multi-armed bandit models, for example. Moreover, once our personalization systems take into account that  our interests and tastes may change over time, and that there is declining marginal utility, eventually, for more of the same (consider the 7,000th episode of Star Trek; even I stopped watching). Personalization systems that are designed to optimize some measure of user satisfaction (such as click-throughs or purchases or dwell time or ratings) are going to be designed to give you serendipitous experiences, introducing you to things you like that you didn’t know you would like. Moreover, even today’s systems often do that pretty well, in part because when they optimize on matching on one dimension (say topic) they end up giving us some diversity in another dimension that matters to us (say, political ideology). From introspection, I think most people can recall times when automated personalization systems did introduce them to something new that became a favorite.

His second concern I summarize as, “Reinforcing Your Baser Instincts”. Here, too, good personalization systems should take into account the difference between short-term and long-term preferences (entertainment vs. education, for example). We will need delayed indicators of long-term value, such as measuring which words from articles we end up using in our own messages and blog posts, or explicit user feedback after some delay (the next day or week). It may also be helpful to offer features that people can opt in to that nudge them toward their better selves (their long-term preferences). Here, I gave examples of nudges toward balanced news reading that you can see if you look at the slides.

His third concern I summarize as, “Fragmenting Society”, but there are two somewhat separable sub-elements of this. One is the need for common reference points, so that we can have something to discuss with colleagues at the water cooler or strangers on the subway. Here, I think if individuals value having these common reference points, then it will get baked into the personalized information streams they consume in a natural way. That is, they’ll click on some popular things that aren’t inherently interesting to them, and the automated personalization algorithms will infer that they have some interest in those things. Perhaps better would be for personalization algorithms to try to learn a model that assumes individual utility is a combination of personal match and wanting what everyone else is getting, with the systems learning the right mix of the two for the individual, or the individual actually getting a slider bar to control the mix.

The second sub-concern is fragmenting of the global village into polarized tribes.  Here it’s an open question whether personalization will lead to such polarization. It hinges on whether the network fractures into cliques with very little overlap or permeability. But the small-world properties of random graphs suggest that even a few brokers, or a little brokering by a lot of people, may be enough to keep average shortest path short. Individual preferences would have to be strongly in favor of insularity within groups in order to get an outcome of real fragmentation. It turns out that people’s preferences with respect to political information, as concluded by my former doctoral student Kelly Garrett, is that they like confirmatory information but have at best a mild aversion to challenge. Moreover, some people prefer a mix of challenging and confirmatory information and everyone wants challenging information sometimes (like when they know they’re going to have to defend their position at an upcoming family gathering.) Thus, it’s not clear that personalization is going to lead us to political fragmentation, or any other kind. Other forces in society may or may not be doing that, but probably not personalization. Despite that, I do think that it’s a good idea to include perspective-taking features in our personalization interfaces, features that make it easy to see what other people are seeing. My slides include a nice example of this from the ConsiderIt work of Travis Kriplean, a PhD student at the University of Washington.

The final point I’d like to bring up is that personalization broadens the set of things that are seen by *someone*. That means that more things have a chance to get spread virally, and eventually reach a broader audience than would be possible if everyone saw the same set of things. Instead of being horrified by the parlor game showing that we get different search results than our friends do, we should delight in the possibility that our friends will be able to tell us something different.

Overall, we should be pushing for better personalization, and transparent personalization, not concluding that personalization per se is a bad thing.

At the conference banquet the night before my talk, attendees from different countries were invited to find their compatriots and choose a song to sing for everyone else. (The five Americans sang, “This Land is Your Land”). Inspired by that, I decided to compose a song we could all sing together to close the talk and the conference, and which would reinforce some themes of my talk.  The conference venue was a converted church, and the acoustics were great. Many people sang along. The melody is “Twinkle, Twinkle, Little Star”.

(Update: someone captured it on video: )

The Better Personalization Anthem

User models set me free
                as you build the Daily Me

Yes exploit, but please explore
                could just be that I’ll want more

Broaden what my models know
                UMAP scholars make it so

 Words: Paul Resnick and Joe Konstan
 Melody: Wolfgang Amadeus Mozart

Posted in Uncategorized | Leave a comment

Yelp gets more reviews per reviewer than CitySearch or Yahoo Local

Author attributes it to the fact that reviewers are anonymous at CitySearch and Yahoo local, but build up reputations on Yelp. Of course, there are also other differences between the sites.

Zhongmin Wang (2010) “Anonymity, Social Image, and the Competition for Volunteers: A Case
Study of the Online Market for Reviews,” The B.E. Journal of Economic Analysis & Policy: Vol.
10: Iss. 1 (Contributions), Article 44.
Available at: http://www.bepress.com/bejeap/vol10/iss1/art44

Abstract:
This paper takes a first step toward understanding the working of the online market for re-
views.   Most online review firms rely on unpaid volunteers to write reviews.   Can a for-profit
online review firm attract productive volunteer reviewers, limit the number of ranting or raving
reviewers, and marginalize fake reviewers?  This paper sheds light on this issue by studying re-
viewer productivity and restaurant ratings at Yelp, where reviewers are encouraged to establish a
social image, and two competing websites, where reviewers are completely anonymous. Using a
dataset of nearly half a million reviewer accounts, we find that the number (proportion) of prolific
reviewers on Yelp is an order of magnitude larger than that on either competing site, more produc-
tive reviewers on all three websites are less likely to give an extreme rating, and restaurant ratings
on Yelp tend to be much less extreme than those on either competing site.

Posted in Uncategorized | Leave a comment

Need recommender systems contest ideas

Do you have an idea or plan for a future challenge/contest that you think could move the field of Recommender Systems forward? I’d love to hear about your idea or plan, even if only in sketch form, and even if you’re not in a position to carry it out yourself. At this year’s RecSys conference in Barcelona, I’ll be moderating a panel titled, “Contests: Way Forward or Detour?” As part of that panel, I’d like to present brief sketches of several contest ideas for the panelists to respond to.

Please send me your ideas!

———————-Abstract of the Session
Panelists:
Joseph A. Konstan, University of Minnesota, USA
Andreas Hotho, University of Würzburg, Germany
Jesus Pindado, Strands, Inc., USA

Contests and challenges have energized researchers and focused attention in many fields recently, including recommender systems. At the 2008 RecSys conference, winners were announced for a contest proposing new startup companies. The 2009 conference featured a panel reflecting on the then recently completed Netflix challenge.

Would additional contests help move the field of recommender systems forward? Or would they just draw attention from the most important problems to problems that are most easily formulated as contests? If contests would be useful, what should the tasks be and how should performance be evaluated? The panel will begin with short presentations by the panelists. Following that, the panelists will respond to brief sketches of possible new contests. In addition to prediction and ranking tasks, tasks might include making creative use of the outputs of a fixed recommender engine, or eliciting inputs for a recommender engine.

Posted in Uncategorized | Leave a comment

Gerhard Fischer paper at C&T

“Towards an Analytic Framework for Understanding and Supporting Peer-Support Communities in Using and Evolving Software Products” at C&T

Participation in SAP’s online community.

Before/After point system introduced in SAP:
Mean response time decreased (51 min vs. 34 min.)
Mean helper count increased (1.89 vs. 2.02)
Percentage answered (12% vs. 30%)

Some evidence of gaming of the system—people just ask questions to gain points.

Posted in Uncategorized | Leave a comment

Karim Lakhani at C&T

Karim Lakhani is giving a great keynote at C&T about tracking innovation. He has worked with MatLab programming contests that have a fascinating format. There’s clear performance outcome; source code of all entries is available to other people; leaders are tracked. Researchers can track which lines of code get reused.

What leads to displacing the currently leading entry?
–novel code
–novel combos of others’ code
–NOT borrowed code
–complexity
–NOT conformance

What leads code to get reused in future (leading) entries?
–novel code
–novel combos of others’ code
–borrowed code
–complexity
–conformance

Also did experiments with TopCoder.
One experiment with computational biology contest problem.
Three conditions (random assignment?):
Fully collaborative vs. fully competitive vs. mixed (competitive first week, then all code shared)
Fully collaborative got the best performance
Best performing entries did better than state-of-the-art in computational biology

Posted in Uncategorized | Leave a comment

Universally Utility-Maximizing Privacy Mechanisms

Interesting paper presentation by Tim Roughgarden.

He gave a nice introduction to the recent literature on provably privacy-preserving mechanisms for publishing statistical summaries such as counts of rows from databases satisfying some property (e.g., income > 100,000). Suppose a mechanism computes the actual count, and then reports something possibly different (e.g., by adding noise). There is a definition of a p-privacy if, for every possible output (count), for any person (row) the ratio of the probability of that output with the row present to the probability of that output with the row omitted is always in the range [p, 1/p]. Intuitively, whatever the actual count, there’s not much revealed about whether any particular person has high income.

One technique that works for counts, LaPlace-p, is to add to the correct count +/- z, where probability of z is 1/2(-lnp)e^^(z/lnp). For any reported count, there’s some confidence interval around it, and the size of that confidence interval is independent of the count. Thus, for reported count 1, you can’t really tell whether the correct count is 1 or 0, and thus you can’t really tell whether a particular person has high income, *even if you have great side information about everyone else in the database*. On the other hand, if the reported count is 27,000, you still can’t tell much about any one person, but you can be pretty sure that the correct count is somewhere around 27,000.

Roughgarden’s paper is about how much value you can get from the best count function (in terms of some loss function comparing true result to reported result) while still preserving the p-privacy requirement. It turns out that a mechanism very much like LaPlace-p, but discretized, works to minimize the expected loss no matter what the user of the count’s priors are about the distribution of correct counts. It is in this sense that it is universal. This requires a little user-specific post-processing of the algorithm’s output, based on the user’s priors about the correct counts. For example, if the reported count is -1, we know that’s not the correct count; it must really be 0 or something positive, and you can back out from the report and the user’s prior beliefs to infer a belief distribution over correct counts.

Posted in Uncategorized | Leave a comment