Evidence & Evaluation
Roundtable: Evidence and Evaluation
Summary of roundtable discussion, 4th February 2026
Setting and purpose
The roundtable brought together practitioners, evaluators, academics and policy officials to explore whether current evidence standards and evaluation methods narrow government's field of view, causing us to undervalue promising alternatives. Two provocations were shared with participant in advance:
A larger reality, Sophia Parker
Three speakers opened the discussion before it was opened out to the wider group.
Opening contributions
Sophia Parker (JRF) chaired. She set the scene by describing the messy middle she inhabits between the edges and the centre, and the caricature each holds of the other: to the centre, the edges can appear marginal and unserious; meanwhile, to people working around the edges, the centre can look like a slow-moving dinosaur. She framed the core question as how to work more generatively with the tension between the stabilising forces of the centre and the energy at the edges, emphasising that the stuckness many of us feel is systemic, and not a failure of individuals.
James Plunkett (Kinship Works) offered three reflections. First, that the risk of becoming stuck in a local maximum is an inherent feature of applying the scientific method to policy - we advance by narrowing and focusing. So it’s natural that occasionally we need to zoom out, and broaden our view. Second, it’s valuable to visit work at the edges, to see how it feels to interface with ‘the centres’ evidence standards. Projects that are visibly transforming people’s lives often struggle to evidence their impact in ways the system understands, being asked for proof in ways that don't fit the nature of the work, while less promising interventions attract funding not because they are more impactful, but because they lend themselves more neatly to our favoured evaluation methods. Third, we acknowledge that these challenges go beyond evaluation methods - they arise partly from politics, accountability, incentives, and institutional design. But in today's conversation we’ll try to stay focused on the evidence and evaluation dimensions.
Moira Wallace (LSE / former director of the Social Exclusion Unit) offered a historical perspective, describing the SEU's methods in the late 1990s as revolutionary at the time. Central to the unit's design was a founding principle of crossing departmental perimeters - the SEU worked on problems that crossed more than one department, giving it license to operate and a defence against the charge of interfering. Beyond that structural feature, the SEU’s methods were distinctive in several ways: starting by defining and quantifying the problem, using data and social research extensively (both unusual at the time), running calls for evidence that gathered hundreds of responses, going out to look at complex problems directly, and using a "walk-through" method - a systematic attempt to describe what it is actually like to be on the receiving end of a policy or service; ensuring the work was grounded in lived experience. Reports were written in plain language, accessible to people experiencing the problems in question. Moira noted that almost every metric of adolescent disadvantage improved during that period, and subsequently worsened after 2010. Her four observations for today: (1) government structures should match the structure of reality - someone must be accountable for cross-cutting problems; (2) data should be used to inform decisions, not managed defensively as something to be controlled; (3) an inquiry method in policymaking is vital - asking why a problem exists before reaching for solutions; and (4) RCTs are valuable but limited, and evidence about problems deserves as much attention as evidence about solutions.
Felix Anselm van Leer (Oxford Government Outcomes Lab / Test, Learn & Grow evaluator) described the tendency towards the local maximum as something he observes directly in evaluation work - even in programs designed to stay open, organisations often narrow down. Felix introduced the "doughnut" metaphor: the jam is the intervention, the dough is the scaffolding of processes, relationships, leadership and culture that holds it together, and that determines how outcomes play out. An example of the ‘life chances fund’ evaluation illustrates this distinction: surface-level data suggested outcomes-based contracting was working; a quasi-experimental study confirmed these methods outperformed other contracts types; but only deep longitudinal qualitative work revealed that it wasn't really the contract mechanism that was driving the results - it was transformed leadership, culture, and the adoption of more relational practices. Had the evaluation stopped at the quantitative stage, it would have produced conclusions that looked right on the surface but that missed what mattered.
His conclusions: be question-led rather than methods-led; build genuinely multidisciplinary teams; allow more time for evaluation to capture lasting effects. Felix also flagged the potential of machine learning to analyse large volumes of unstructured qualitative data at scale - a way to gain some of the depth of qualitative insight without being confined to small samples, though one that needs to be approached with care.
Wider discussion - A number of themes emerged:
On the limits of current evaluation practice. Several contributors noted the tendency to rush to an RCT before undertaking the foundational implementation science that should precede it. One contributor who runs RCTs noted that implementation research should inform an RCT, not run alongside it simultaneously; yet the pressure from funders to get definitive answers quickly means both often happen at the same time. Evidence-based programmes imported from the US have repeatedly failed to replicate in the UK partly because the contextual "dough" (from the donut) was never understood or adapted - we ran straight to an RCT without developing a recipe that worked in this country. Funders were identified as a significant part of the problem - too prescriptive about methodology, too focused on definitive answers, and insufficiently attentive to delivery organisations who understand what's actually feasible on the ground.
On the scale of the problem. One participant offered a corrective, noting that genuine innovation in evaluation methods already happens inside government - e.g. see theory-based methods used in the government’s electoral integrity work, the use of wellbeing measures in both appraisal and evaluation, and theory-based/systems evaluation methods used to understand homelessness. This is worth acknowledging to avoid a one-sided picture, even if it represents the cutting edge rather than the norm.
On the weighting of different forms of evidence. A recurring theme was that qualitative, relational, and practitioner knowledge is systematically underweighted relative to quantitative and academic evidence. One participant gave the example of applying the academic literature on children's services interventions to Aboriginal communities in Australia - the work would have failed if they had applied the evidence base without deeply engaging with cultural knowledge that cannot be learned from a paper. The hierarchy of evidence, as taught in many universities, was also criticised as an unhelpful paradigm for complex social problems.
On evidence translation. One participant underscored an important area that is often overlooked: taking what is known from an evaluation and applying it to a new context, which is in itself a complex question. Original evaluations often don't help - they don’t capture enough about how an intervention was delivered, what was context-specific, or what the "secret ingredient" was - this makes it hard to know what can travel and what needs to be adapted. The What Works centres have begun to move into this space, but there are significant areas of social policy they don't cover, leaving it unclear where this translation work is supposed to happen in other areas.
On co-design of evaluations. Several contributors pointed to co-designing evaluation frameworks and metrics with people being served; they felt this was still an underexplored and promising approach - not just co-designing interventions, but also their evaluations. One participant described involving young people in selecting evaluators and designing research questions, noting that young people are reliably good at identifying where adults have got things wrong. Another participant illustrated how co-design can change what gets measured: a community safety project co-designed with residents surfaced a metric that would never have emerged from the centre - not crime reduction, but whether people felt safe enough to leave a ground floor window open. The argument throughout was that evaluators have blind spots, and those with lived experience can surface different and more meaningful measures.
On data and insight. One participant distinguished between the wealth of data now available to policy-makers, and a poverty of insight - having data is not the same as turning data into actionable understanding. This calls for analytical tools, not just data tools, e.g. ways to identify people at risk before problems escalate (and note also that the capability to do this work varies enormously across public institutions). There is also a problem further upstream: decisions are often made before any appraisal has happened, making subsequent evaluation beside the point - HS2 was cited as an example of a commitment made without a sufficient appraisal.
On the role of universities. One participant argued that universities are pushing in the wrong direction - towards certainty as the primary virtue, when much of what we need is to act well under uncertainty. Realist evaluation methods, for example, are rarely taught in universities; much common is the traditional ‘hierarchy of evidence’. There is also an issue of trust: quant and qual researchers often actively undervalue each other in the academy, speak different languages, and hold incompatible methodological ideologies - which makes building the mixed teams described earlier in the discussion genuinely difficult in practice. The participant also noted co-production and participatory governance as promising approaches: building a policy system that doesn't lean so heavily on narrow metrics, and that substitutes trust and democratic accountability instead - although this requires substantial devolution.
On the role of finance and risk. One participant raised the important role of financial instruments, and risk-management frameworks - e.g. notice how the financial transactions control framework for institutions, such as the National Housing Bank, requires financial rather than economic returns; this may systematically exclude investments that are economically justified, but not commercially so. Is this another example of us having a ‘narrow field of view’?
On scaling the intangible. A participant argued that we too often try to scale frameworks and structures when what actually works is relational - a conversation style, a way of asking questions, a culture of trust. The participant cited the example of a building society that, rather than rolling out an all-singing, all-dancing branch transformation, identified that what made the real difference was the way customer service representatives asked questions - so they scaled that instead. There is a parallel for public services: sometimes the thing worth scaling is the intangible relational ingredient, not the policy framework around it. Healthy systems indicators - developed with people experiencing multiple disadvantages - were offered as an example of how we can make these invisible conditions of a well-functioning system visible.
On the importance of a permissive environment. A participant described an initiative in community mental health as a positive case study; the policy intent was genuinely permissive and drew in learning from the edges, producing outcomes frameworks built around what people actually said they wanted: more connection, hope for the future, choice and control. The downside, however, was that the intent dissolved over time because the surrounding architecture - performance metrics, regulation, planning frameworks - didn’t change to accommodate the work. The pull back to old ways of working was relentless. The participant argued for a new relationship between centre and place: fewer centrally mandated metrics, more latitude for co-design and local definition of what matters, combined with greater clarity about the small number of things that genuinely do need to be measured centrally.
Closing reflection
Sophia Parker closed by offering a question for the participants to sit with: does it change how we think about evidence if we understand ourselves to be gathering evidence not to achieve certainty, but to navigate uncertainty?
