Effect Size in Educational Research

Jessica R. Toste

Effect Size in Educational Research

Jessica R. Toste, PhD | The University of Texas at Austin


Researchers rely on significance testing to determine whether their findings provide evidence to reject the null hypothesis (e.g., that there is no relationship between the variables or groups of interest). The lower the reported significance l

evel or p-value, the more confidence one has that the null hypothesis should be rejected. In addition to reporting the sta

tistical significance of results, educational researchers also want to make claims about the practical significance of their findings. This article describes an essential companion to significance testing: effect size.  

Why report effect size?Effect size quantifies the difference between two groups and has many advantages over using tests of statistical significance alone. Effect size indicates themagnitude of the difference. When examining a te

aching practice, we often compare one group of students who receive an intervention to another group of students who do not (i.e., control group). The p-value generated by significance testing indicates the probability that the null hypothesis can be rejected (e.g., that the intervention did not have an effect on students’ outcomes), but it doesn’t tell us the size of this effect—or how much more effective the novel teaching practice was compared to typical instruction. Just because a find

ing is statistically significant does not necessarily mean that it is practically meaningful, so it is important to also consid

er the effect size.

How is effect size reported? Common types of effect size (e.g., Cohen’s d or Hedge’s g) represent the standardized difference between two groups. Interpretation of effect sizes depends on the context within which the research is conducted and can vary according to many factors (e.g., outcome area, type of assessment). Keeping that in mind, 0.20 is generally considered a small effect, 0.50 moderate, and above 0.80 large.

How is effect size interpreted? What does it mean if we have a significant p-value and a small effect size? This depends on what we consider to be practically important. For example, it can be very difficult to see change when intervening with certain skills and/or populations of students (e.g., increasing reading performance for high school students in high-poverty schools). So a smaller effect size may actually be practically important in such cases. On the other hand, sometimes we are confronted with a nonsignificant finding (p < .05), but a large effect size. Concluding that findings are not meaningful based solely on lack of statistical significance could be making a big mistake. Let’s take a look at an example.


Sunshine School has invested in a Tier 2 reading program and tests its efficacy with second graders. Forty students are randomly assigned to one of two groups: reading intervention or business-as-usual control. Students in the reading intervention group receive individual tutoring 4 days each week for 20 weeks, while students in the control group continue to receive their standard classroom instruction. After the intervention, they compare the groups’ scores on reading comprehension. The school research team notes that the groups do not significantly differ from each other statistically (p < .05). They decide that this is probably not the best intervention for their second graders. But wait! The effect size is 0.70. What does this mean?


The p­­-value is highly influenced by sample size (e.g., number of students in the study). Small differences between groups can be statistically significant in studies with very large samples and, as with the study of reading instruction at Sunshine School, meaningful differences between groups may be nonsignificant due to small sample size. The effect size provides us with a value that can be used to interpret the magnitude of the effect of an intervention regardless of sample size. An effect size of 0.70 means that the treatment group made gains that are 70% of a standard deviation above the mean of the control group; or, stated differently, that 76% of the treatment group scored above the average score for the control group. After looking at this effect size, the school team decides that these differences are important and concludes that the reading program should be continued.


While no test is perfect, considering both significance testing and effect size in research is important for fully understanding the impact of an educational practice on learner outcomes.