[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Published 9 Apr 2024 in cs.CL | (2404.06214v2)

Abstract: After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule changes and their rationale in greater detail, give a timeline of this year's competition, and provide answers to frequently asked questions from last year's challenge.

Abstract PDF HTML Upgrade to Chat

References (17)

Citations (9)

View on Semantic Scholar

Summary

The paper introduces sample-efficient pretraining methods inspired by human language acquisition under strict data constraints.
It launches a new multimodal track combining vision and text to enhance traditional language model capabilities.
The challenge provides updated evaluation criteria and baseline models to drive innovative, cognitively inspired research.

Summary of the "2nd BabyLM Challenge: Sample-efficient Pretraining on a Developmentally Plausible Corpus"

The paper "The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus" (2404.06214) outlines the specifications, rules, and goals for the second iteration of the BabyLM Challenge. This competition focuses on the optimization of LLM pretraining within strict data limits, compelling researchers to innovate in scenarios similar to human cognition and linguistic development.

Revised Challenge Framework

The BabyLM Challenge continues to prioritize the efficient use of training data inspired by human development. This year's iteration introduces several significant modifications to its structure:

Paper Track Introduction: In addition to model-based submissions, researchers can now submit papers describing cognitively inspired benchmarks or analytical methods pertinent to the challenge.
Custom Dataset Construction: Participants are allowed to formulate their datasets under a fixed budget of either 100M or 10M words, offering them the flexibility to explore data quality impacts on pretraining.
Multimodal Track: A new Vision-language track integrates 50% text and 50% image-text data to promote the development of models capable of understanding and generating language in multimodal forms.

Participation Tracks and Dataset Specifications

The challenge features three distinct tracks:

Strict and Strict-small Tracks: These revolve around language-only datasets, requiring models to be trained within a cap of 100M and 10M words, respectively. Participants may use either the provided datasets or construct new ones, with mandatory datasheets for self-made data.
Vision Track: This track allows models trained on paired image-text data. Participants can utilize the provided multimodal dataset or their own, as long as they adhere to the 100M-word constraint.
Paper Track: Participants may submit papers detailing novel metrics, analyses of BabyLM models, or innovative approaches to cognitive modeling, irrespective of evaluation task scores.

Evaluation and Baselines

The evaluation framework replicates last year's with enhancements based on participant feedback. It will include both existing linguistic tasks and new multimodal tasks for Vision track submissions. Evaluation remains accessible through a shared Google Colab environment, supplemented by public evaluation code for personal execution.

The baseline models released are informed by the previous year's winning entries, comprising GPT2, LTG-Bert, and Contextualizer models for language-only tracks, and GIT and Flamingo models for the Vision track. These are intended to incentivize participants to exceed the current capabilities of leading models.

Implications and Future Directions

The BabyLM Challenge serves to democratize research in sample-efficient pretraining domains by constraining data resources, akin to human linguistic experience. The inclusion of multimodality heralds potential advancements in versatile LM applications, reflective of real-world language interactions. Future iterations of the challenge may explore cognitive modeling implications and continue fostering advancements in sample-efficient techniques across varied learning paradigms.

Conclusion

The 2nd BabyLM Challenge not only extends its predecessors' foundational goals but also ushers in novel tracks and rules to enhance the exploration of pretraining methodologies within realistic data limitations. It fosters interdisciplinary contributions merging cognitive science with AI research, encouraging innovations that align with the stringent demands of sample-efficient pretraining scenarios. This paradigm serves both theoretical inquiries into human-like language acquisition and practical prowess in developing economized AI systems.