Thursday, 24 April 2014

What to do when you get results that don't make sense

A few different recent conversations and this blogpost on list experiments by Andrew Gelman have made me think about the nature of the file drawer problem.

Gelman quotes Brendan Nyhan
I suspect there’s a significant file drawer problem on list experiments. I have an unpublished one too! They have low power and are highly sensitive to design quirks and respondent compliance as others mentioned. Another problem we found is interpretive. They work best when the social desirability effect is unidirectional. In our case, however, we realized that there was a plausible case that some respondents were overreporting misperceptions as a form of partisan cheerleading and others were underreporting due to social desirability concerns, which could create offsetting effects.

and Lynn Vavreck:
Like the others, we got some strange results that prevented us from writing up the results. Ultimately, I think we both concluded that this was not a method we would use again in the future.
Many of the commenters on the blog said that the failure to publish on these results reflected badly on these researchers and that they should publish these quirky results to complete the scientific record. 

Both of these examples as well as many other stories I've heard make me think that the major causes of the file drawer effect in social science are not null results but inconclusive, messy and questionable results. The key problem is when you get a result from an analysis that makes you reassess some of the measurement assumptions that you were working with. For instance, a secondary correlation with a demographic variable comes out in an unexpected direction or the distribution of the responses is bunched up in 3 places on a 10 point scale.

The problem comes down to this. If I design a survey or other study to test an empirical proposition, the study is likely not to be ideally designed to test the validity of the measures involved and how the design effects are impacting them.

The results you get from a study designed to test an effect are often enough to cast doubt on validity but rarely are enough to demonstrate the lack of validity in a convincing way (i.e. that would be of publishable quality). The outcome is therefore that the paper can either be written up as a poor substantive article (i.e. the validity of the measures is in doubt or a poor methodological article (the evidence about the validity of the measures is weak either way because the study wasn't designed to be a test of the measure's validity).

One answer to this is to do more pre-testing. This can help to establish the validity of measures prior to working with them and can certainly identify the most obvious problems. However, unless the pre-test is nearly as large as the actual sample, the correlations with other variables won't be particularly clear in advance. In addition, pre-testing won't help understand design effects unless it tests different combinations.

However, what is really needed is whole studies devoted to examining design effects experimentally and establishing the measurement attributes. But until that happens for methods such as list experiments, researchers will be stuck with questionably valid results that are hard to publish as good empirical or methodological pieces.

A more radical approach would be to encourage journals of ideas that didn't quite work out. Short research articles that explain why the idea should have worked out nicely but ended up being a damp squib. These would be useful for meta-analysis of why certain techniques are problematic in practice without having the same time requirements for writing up as a full methodological piece.