Cohen's d and Hedges' g--what are they and why do I care?
Traditionally, researchers have used probability values (p-values) to interpret statistical analyses conducted in experiments and other studies. Although p-values provide important information, such as the likelihood that the null hypothesis is true given the results of the study (see Travers, Cook, & Cook, 2017), they also have important limitations. For example, p-values are influenced by sample size. Therefore a study with hundreds of participants can yield statistically significant findings, which is traditionally indicated by a p-value of < .05, even if the difference in performance between the treatment and control groups is small and not practically important. Conversely, large and potentially important differences between groups can result in nonsignificant findings (p > .05) when studies involve a small number of participants. Therefore, researchers now commonly provide effect sizes in addition to p-values when reporting group experimental studies.
Cohen’s d (named after Jacob Cohen) and Hedges’ g (named after Larry Hedges) are commonly used effect sizes in group-experimental research that indicate the magnitude of the effect of an intervention or treatment. Both d and g represent the difference in performance between the treatment group, which receives the intervention, and the control group, which experiences “business as usual” conditions. As such, a d or g of 0 means that, on average, study participants in the treatment and control groups performed the same. A positive d or g means that participants in the treatment group outperformed those in the control group; whereas a negative d or g indicates that control group outperformed the treatment group. The larger the value of d or g, the greater the difference between the groups in terms of performance.
In addition to providing a metric that is not influenced by sample size (as p-values are), effect sizes are also standardized, which helps research consumers interpret study findings and compare results across studies. For example, a 5-point mean difference between groups would indicate a large and meaningful effect if everyone in the control group scored between 10 and 12 on a test, and everyone in the treatment group scored between 15 and 17 on the same test (i.e., little variance in student performance). However, a 5-point mean difference would represent a small and trivial effect if test scores for the control group ranged from 0 to 95, and scores for the treatment group ranged from 5 to 98. To control for variability in participants’ performance on the outcome measure, group-difference effect sizes are standardized by dividing the difference between the groups by the standard deviation of the dependent variable. Cohen’s d and Hedges’ g are very similar and are interpreted using the same guidelines, though Hedges’ g is calculated to correct for a slight tendency of Cohen’s d to overestimate effects.
Cohen (1988) provided loose guideline for interpreting d, which can also be applied to Hedges g. Cohen suggested that d should be at least 0.2 to be considered a small effect, at least 0.5 to be considered a medium effect, and 0.8 or greater to be considered a large effect. However, he cautioned that these values should not be used as hard-and-fast rules, because effect size is affected by many factors related to the context of the study. For example, students with disabilities and older students tend to not make as large of improvements in response to a new practice as students without disabilities and younger students. Additionally, using a researcher-created assessment that is closely tied to the intervention tends to produce larger effects than when researchers use a standardized assessment. The conditions of the control group influence the effect size as well. If “business as usual” in the control group consists of instruction using a highly effective practice, effects will be much smaller than if the control group receives ineffective or no instruction. Accordingly, although a d (or g) of 0.45 is small according to Cohen’s guidelines, it might be considered to be medium or even large effect in a study of high-school students with disabilities using a standardized assessment. Thus, Cook and colleagues’ (2018) take-home message is that group-difference effect sizes, “enable research consumers to evaluate the practical importance of study findings when considered appropriately in the context of study characteristics such as participants, dependent variables, and comparison conditions” (p. 56).
Cohen, J. (1988). Statistical power for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum.
Cook, B. G., Cook, L., & Therrien, W. J. (2018). Group-difference effect sizes: Gauging the practical importance of findings from group-experimental research. Learning Disabilities Research & Practice, 33, 56-63. doi:
Travers, J. C., Cook, B. G., & Cook, L. (2017). Null hypothesis significance testing and p-values. Learning Disabilities Research & Practice, 32, 208-215. doi: 10.1111/ldrp.12147Back