“The ability of an evaluation to detect a meaningful impact of a program is determined by the evaluation’s sample size and statistical power. This is a tool for policymakers and practitioners that describes some of the factors that affect statistical power and sample size. Further information on the dangers of running an evaluation with inadequate power can be found in a companion resource….
The statistical power, or power, of an evaluation reflects the likelihood of detecting any meaningful changes in an outcome of interest brought about by a successful program. In the process of designing a randomized evaluation, researchers conduct power analyses to inform decisions such as:
- Whether to conduct the evaluation
- At which unit to randomize (e.g., individual, household, or group)
- How many units to randomize
- How many units or individuals to survey
- How many times to survey each unit or individual over the course of the evaluation
- How many different program alternatives to test
- How much baseline information to collect
- Which outcomes to measure
- How to measure the outcomes of interest
It is important to understand how the factors above are interrelated and affect the overall power and sample size needed for a randomized evaluation. The rules of thumb outline the key relationships between the determinants of statistical power and sample size, and demonstrate how to design a high-powered randomized evaluation” (p.2).
Major Findings & Recommendations
The tool outlines the following six rules of thumb: • “Rule of thumb #1: a larger sample increases the statistical power of the evaluation… Larger samples are more likely to be representative of the original population…and are more likely to capture impacts that would occur in the population. Additionally, larger samples increase the precision of impact estimates and the statistical power of the evaluation” (p.4). • “Rule of thumb #2: if the effect size of a program is small, the evaluation needs a larger sample to achieve a given level of power The effect size of an intervention is the magnitude of the impact of the intervention on a particular outcome of interest. When designing an evaluation, the research team wants to ensure that they are able to identify the effect of the program with precision” (p.4) • “Rule of thumb #3: an evaluation of a program with low take-up needs a larger sample Randomized evaluations are designed to detect the average effect of a program over the entire sample that is assigned to the treatment group. Therefore, lower take-up decreases the magnitude of the average effect of the program” (p.5). • “Rule of thumb #4: if the underlying population has high variation in outcomes, the evaluation needs a larger sample… Especially when running an evaluation on a population with high variance, selecting a larger sample increases the likelihood that you will be able to distinguish the impact of the program from the impact of naturally occurring variation in key outcome measures” (p.6). • “Rule of thumb #5: for a given sample size, power is maximized when the sample is equally split between the treatment and control group. To achieve maximum power for a given sample size, the sample should be evenly divided between the treatment group and control group. If you have the opportunity to add study participants, regardless of whether you add them to the treatment or control group, power will increase because the overall sample size is increasing” (p.7). • “Rule of thumb #6: for a given sample size, randomizing at the cluster level as opposed to the individual level reduces the power of the evaluation. The more similar the outcomes of individuals within clusters are, the larger the sample needs to be” (p.8). (Abstractor: Author and Website Staff)