Decide whether to normalise ChatGPT scores for abstract length in addition to field and year

Ascertain whether residual length-related bias exists in ChatGPT 4o-mini REF-style research quality scores after accounting for second-order effects (such as journal abstract policies and short-form article types), and determine whether normalising scores for abstract length, in addition to field and year, is warranted.

Background

The analyses showed that longer abstracts often received higher ChatGPT scores, with evidence suggesting this association may partly reflect journal-level policies and the inclusion of short-form articles rather than a direct model bias. Nevertheless, the possibility of a residual length-induced bias cannot be ruled out.

Because the authors recommend field- and year-normalisation of scores, they explicitly raise the unresolved question of whether an additional normalisation for abstract length is appropriate, pending further investigation into the extent of any remaining length-related bias.

References

Finally, the abstract length factor found potentially indicates another ChatGPT bias, such as against articles in journals with stricter abstract length restrictions, but, from the discussion, it seems more likely that weaker articles are more likely to appear in journals that allow short abstracts or to be for shorter contributions, an acceptable second order effect. Again, more research is needed to investigate this and decide whether it would ever be appropriate to normalise ChatGPT scores for abstract length in addition to field and year.

Research evaluation with ChatGPT: Is it age, country, length, or field biased?  (2411.09768 - Thelwall et al., 2024) in Conclusions