“Challenge everything” – A challenge prize centre embraces experimentation

By Eszter Czibor, Teo Firpo on Wednesday, 8 December 2021.

IGL’s researchers teamed up with Nesta Challenges to find out how randomised experiments can improve the way we identify the best solutions to tough societal challenges.

Our colleagues at Nesta Challenges have run countless successful challenge prizes – competitions that offer a reward for the first or best solution to a problem – spanning a wide range of topics and locations. From Nepal to Canada, from antibiotic resistance to agricultural productivity, several examples have demonstrated the power of this innovation method.

But could prizes work even better? Besides running specific prizes, Nesta Challenges are also passionate about advancing the methodology of challenge prizes. Given our own interest in the science of innovation funding, we were keen to team up with them to see how randomised experiments can help improve the design and delivery of prizes.

What did we set out to test?

There are numerous ways to embed experiments within a challenge prize, from randomising the messages sent out to potential innovators to varying the prize structure. The pilot experiments we cover in this blog post focus on the evaluation of proposed solutions: they test whether there is room to improve the way we gauge and score applications’ quality.

Why this focus? There is mounting evidence for biases in assessment, with well-documented examples from hiring decisions and science funding. In particular, we were worried about in-group bias (whereby evaluators are more favourable towards proposals submitted by individuals or teams with whom they share certain characteristics). This bias can perpetuate existing inequalities in participation in innovation if members of marginalised groups are underrepresented among evaluators and as applicants receive lower scores from evaluators who are members of the majority group. Such bias would call into question one of the main advantages of challenge prizes: their ability to attract and promote innovators from a wide range of backgrounds.

How did we go about it?

We identified two social innovation challenge prizes run by Nesta Challenges in 2020 and 2021 that attracted a large number of proposals and many assessors and judges (resulting in hundreds of proposal-evaluator pairs). Both prizes involved an initial assessment stage (assessors with relevant knowledge reviewing all submitted proposals, creating a longlist of proposals to advance) and a subsequent first judging stage (judges – high-profile topic experts – reviewing the longlist and narrowing it down to a shortlist of top proposals). 1 Evaluators scored proposals on various pre-specified criteria, including team capability, innovation, feasibility etc.

To test for in-group bias in the evaluation, we randomly assigned proposals to evaluators. It was thus determined by chance whether the resulting applicant-evaluator pairs shared a characteristic of interest (“match”) or not (“no match”). 2 We focussed on matches by region of origin (in the first prize) and gender (in both prizes). For instance, if in-group bias by gender was at play, we would expect that a woman applicant whose proposal by the luck of the draw was assigned to a woman evaluator (“match”) would receive a higher score than another woman with the same quality proposal who was assigned to be evaluated by a man (“no match”).

In addition, we also embedded a non-randomised “shadow experiment” in the assessment stage of the second prize in order to test whether sectoral (private vs. non-profit) matching between evaluators and applicants systematically affected scores. We invited additional, “shadow” assessors to review a subset of the applications: while all “real” assessors were private sector employees, all shadow assessors worked for Nesta Challenges. Shadow assessors’ scores were only used for the purpose of the experiment, not for determining the longlist. Our hunch was that shadow assessors (by virtue of working for a not-for-profit organisation) might evaluate applications from nonprofit-led teams more favourably than the “real” assessors working in the private sector.

What have we learned?

These experiments have delivered a crucial insight: Even if bias is not an issue, noise is!

Across the two prizes, we didn’t find evidence of in-group bias in proposal evaluation: applicants did not receive higher scores for their proposals when their proposal was assigned to an evaluator with whom they shared a characteristic of interest. This was true for all three characteristics (gender, region of origin and sector) that we studied. As the figure below demonstrates, men evaluated applications submitted by women somewhat more favourably than those submitted by men (though the difference was not statistically significant). Directionally, we found the same result for the other two characteristics as well.

Even though we didn’t detect systematic differences in the way evaluators scored proposals submitted by in- and out-group members, our results formally demonstrated what anyone running funding calls has long suspected: there is a lot of noise in assessment! The same proposal often received widely different scores from different evaluators, and a proposal’s initial assessor scores were only weakly predictive of later judge scores. Evaluators differed a lot in their assessment style (both in their overall tendency to be lenient and in the spread of the scores they gave). The team managing the prize had anticipated this issue and had built in a process to (among other things) review borderline cases or applications where there was a significant difference between both assessors' scores; the judges also had the opportunity to revise their scoring in light of their discussions at the final judging panel meeting. While these efforts certainly helped to reduce noise, we still found that having one’s proposal randomly assigned to a harsher vs. a more lenient judge had a detectable impact on a proposal’s final rank.

What changed as a result?

These findings inspired Nesta Challenges to re-design the evaluation process of a new prize they were preparing. First, they decided to keep with the random assignment of proposals to evaluators. This mode of allocation ensures fairness, as every proposal has the same chance of being assigned to a strict or a lenient judge. It also allows programme managers to continue to monitor the scores for signs of in-group bias (potentially along other characteristics).

Second, they have introduced score normalisation in the initial assessment phase, essentially correcting for any assessor-specific component in a proposal’s score. In the subsequent judging panel, they have presented judges’ normalised scores alongside their initial scores to help facilitate the discussion. Prior to the panel, they have also provided judges with private feedback on their individual assessment style.

The team also found the “shadow experiment” approach helpful for sense-checking their assessment. A programme manager involved in our pilots urged his colleagues to continue collecting scores from shadow assessors in future prizes as a low-risk way of interrogating their evaluation process.

Caveats

We consider the experiments covered in this blog post as pilots whose main aim was to introduce the Nesta Challenges team to experimentation. Going forward, we hope to bring even more rigour to the design and analysis of our joint experiments. 3 While we are confident in our conclusion that evaluators in these prizes are not biased towards applicants of the same gender, region of origin or sector, we wouldn’t attach too much weight to specific point estimates and p-values obtained in the analysis (and chose not to report them in this blog for this reason).

It is important to recognise that our conclusions are specific to our context: they do not imply that there is no in-group bias along other characteristics (e.g. ethnicity) in these prizes, and they should certainly not be interpreted to mean that in-group bias is not an issue in science and innovation funding in general. Nesta Challenges place great value on equity and diversity and have substantial experience in delivering challenge prizes in an inclusive way – repeating these pilots with other funders and organisations may well return different results.

Experimentation is valuable and need not be disruptive

Our collaboration with Nesta Challenges successfully demonstrated that randomised experiments can be embedded in the evaluation of challenge prizes and can deliver insights that lead to practical improvements in the way future prizes are delivered. It has also inspired even more creativity and rigour in the team, and empowered those involved in the design and delivery of challenge prizes to grab every opportunity for learning, questioning and improving their methodology. In the words of one of our program manager partners: “experimentation has taught me to challenge everything”.

Are you a funding organisation interested in using experiments to improve your funding processes? Get in touch.

1. In each phase, at least two evaluators read each proposal, and each evaluator read multiple proposals.
2. The randomisation process proved to be quite complicated in practice as we had to observe constraints regarding the number of proposals each evaluator read and had to avoid certain conflicts of interest.
3. When designing the pilots, unfortunately we weren’t able to conduct proper power calculations due to a combination of time pressure and uncertainty around how much of the variance in scores would be explained by observables. Note that the confidence intervals reported in Figure 1 do not account for the complicated randomisation process we used to generate our data.