Exploring policy solutions through synthetic data: Our Datathon experience

By David Ampudia on Wednesday, 23 October 2024.

Data held by innovation agencies and funders has the potential to offer valuable strategic insights for both internal programmes and the wider policy and research communities that support science and innovation. By working with researchers and other organisations, agencies can experiment with novel approaches to analysis and new research questions to help unlock these insights. However, internal data on historical programmes and outcomes is often too sensitive, making external sharing and analysis a challenge. 

Our design for a synthetic datathon provides a way to overcome these barriers. This approach was recently implemented in collaboration with the European Innovation Council (EIC). By allowing researchers to work with anonymised, simulated data, we helped them both address real policy challenges for the EIC and advance their research agendas, all while keeping sensitive information secure.

Synthetic data. Real insights.

As part of a larger initiative, the EIC aimed to better leverage its underutilised data by inviting researchers to explore it, exposing the organisation to fresh research questions and novel methodologies that could enhance its learning. However, strict data-sharing restrictions posed a challenge. We needed to design a solution that would overcome these barriers while still encouraging collaboration. The synthetic datathon was that solution, designed to allow researchers to safely explore policy solutions using a simulated version of the EIC’s data. 

At the Innovation Growth Lab, we advocate for experimentation to turn data into practical insights. This openness to trying new approaches led us to organise the synthetic datathon, not only to help the EIC find new ways to use its data, but also to encourage collaboration and bring in fresh ideas from the research community. Through this initiative, researchers tackled real policy challenges, producing solutions that could inform future discussions around policy and data use, while also advancing their own research agendas.

How the datathon unfolded

The EIC faced a significant challenge: while its data held great potential for policy insights, limited capacity and strict data-sharing restrictions made it difficult to exploit fully. Sharing sensitive data with researchers could compromise privacy, yet not doing so would limit creative approaches to existing challenges. The synthetic datathon we developed provided a creative solution, enabling researchers to explore policy questions while safeguarding sensitive information.

We launched the initiative with an open call for researchers and public sector analysts to propose solutions to key policy challenges faced by the EIC. Drawing on our understanding of the EIC’s policy needs and the research landscape, we were able to guide the selection of proposals that aligned with both EIC objectives and researchers’ areas of expertise. Over 50 individual researchers and 27 research teams expressed interest, showcasing a diverse range of approaches. Seven proposals were selected, based on how well they aligned with the EIC’s policy needs and their innovative methods.

We produced a synthetic version of the EIC’s data, with anonymised text, added noise in numerical data, and permuted categorical variables. Researchers developed and refined their code in a  secure environment, ensuring their work was robust and adaptable. Once their models were ready, the code was run on the real data by IGL. While researchers never accessed the data directly, their final findings were grounded in real-world data, making the results relevant and actionable to the EIC’s policy needs.

Over four months, the selected teams developed their models using the synthetic data. The teams worked asynchronously, communicating via a Slack workspace. We also organised joint meetings and code-sharing sessions to keep the collaboration interactive and dynamic. This format promoted flexibility, enabling participation without geographical constraints. We provided technical support, helping the teams troubleshoot code and refine their approaches. 

The datathon culminated in a forum that brought together both the researchers and innovation agencies who had participated in the Innovation Data Dialogues. This event gave researchers the chance to present their work, engage with innovation policymakers, and discuss how their insights could support decision-making. It was a crucial moment where research met policy, bridging the gap between the two communities.

What we learned

The datathon showed that agencies can generate valuable insights without compromising privacy. Us running researchers’ code on the real data ensured that their results were usefully grounded in real-world data. This approach reinforced the value of a test-and-learn model, allowing agencies to explore data-driven insights even when direct data sharing is not possible.

The asynchronous format gave researchers the flexibility to contribute at their own pace, which led to broad participation from nine countries and eleven institutions. This diversity of involvement also encouraged a wide range of innovative proposals, many of which demonstrated strong policy relevance and showed significant potential for continued collaboration and further development.

Improving the datathon

Using synthetic data worked well but introduced new challenges. Although the data was accurate at aggregate levels, this did not always hold true for individual-level details. This underlined the need to be able to clearly  communicate potential impact of artefacts and imperfections in the data to participants in the datathon.

A key challenge was balancing creativity and policy relevance. While the teams had clear research agendas, as with any hackathon-style event, some ideas evolved in unexpected directions. Regular feedback sessions, supported by our understanding of both the EIC’s policy objectives and the research process, helped balance the EIC’s priorities with researchers’ explorations. 

Managing researcher availability in an asynchronous format required adaptability, as productivity spiked unevenly. This, combined with the remote nature of the event, led to coordination challenges and some duplication of effort. However, by using our experience as intermediaries and setting up frequent touchpoints, we kept teams engaged. 

These lessons will be crucial for improving future iterations of the format and ensuring its applicability to other innovation agencies.

What’s next

The datathon laid the groundwork for further collaboration between the EIC and some of the participating researchers. Building on these connections will help turn early findings into practical policy insights. The EIC plans to work closely with key research teams, offering data access and support to refine their analysis.

The event also attracted interest from other innovation agencies keen to explore both the research outputs and the synthetic data model used. This ties in with IGL’s wider mission to promote data-driven decision-making. We will continue supporting initiatives like this across the policy landscape to bring the innovation funding agency and academic research communities together to make a more insightful use of data.

If you are an innovation agency, funder or public sector organisation interested in running a similar event, contact us at [email protected]