Papers
Topics
Authors
Recent
Search
2000 character limit reached

[Call for Papers] The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus

Published 9 Apr 2024 in cs.CL | (2404.06214v2)

Abstract: After last year's successful BabyLM Challenge, the competition will be hosted again in 2024/2025. The overarching goals of the challenge remain the same; however, some of the competition rules will be different. The big changes for this year's competition are as follows: First, we replace the loose track with a paper track, which allows (for example) non-model-based submissions, novel cognitively-inspired benchmarks, or analysis techniques. Second, we are relaxing the rules around pretraining data, and will now allow participants to construct their own datasets provided they stay within the 100M-word or 10M-word budget. Third, we introduce a multimodal vision-and-language track, and will release a corpus of 50% text-only and 50% image-text multimodal data as a starting point for LM model training. The purpose of this CfP is to provide rules for this year's challenge, explain these rule changes and their rationale in greater detail, give a timeline of this year's competition, and provide answers to frequently asked questions from last year's challenge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736.
  2. Datasheets for datasets. Communications of the ACM, 64(12):86–92.
  3. Martin Gerlach and Francesc Font-Clos. 2018. A standardized Project Gutenberg corpus for statistical analysis of natural language and quantitative linguistics. Computing Research Repository, arXiv:1812.08092.
  4. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision, 128(7):1956–1981.
  5. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  6. Pierre Lison and Jörg Tiedemann. 2016. OpenSubtitles2016: Extracting large parallel corpora from movie and TV subtitles. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 923–929, Portorož, Slovenia. European Language Resources Association (ELRA).
  7. Brian MacWhinney. 2000. The CHILDES project: The database, volume 2. Psychology Press.
  8. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193.
  9. Connecting vision and language with localized narratives. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part V 16, pages 647–664. Springer.
  10. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9.
  11. Trained on 100 million words and still in shape: Bert meets british national corpus. arXiv preprint arXiv:2303.09859.
  12. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  13. Dialogue act modeling for automatic tagging and recognition of conversational speech. Computational Linguistics, 26(3):339–371.
  14. Git: A generative image-to-text transformer for vision and language. arXiv preprint arXiv:2205.14100.
  15. Call for papers–the babylm challenge: Sample-efficient pretraining on a developmentally plausible corpus. arXiv preprint arXiv:2301.11796.
  16. Findings of the babylm challenge: Sample-efficient pretraining on developmentally plausible corpora. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1–34.
  17. Towards more human-like language models based on contextualizer pretraining strategy. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 317–326.
Citations (9)

Summary

  • The paper introduces sample-efficient pretraining methods inspired by human language acquisition under strict data constraints.
  • It launches a new multimodal track combining vision and text to enhance traditional language model capabilities.
  • The challenge provides updated evaluation criteria and baseline models to drive innovative, cognitively inspired research.

Summary of the "2nd BabyLM Challenge: Sample-efficient Pretraining on a Developmentally Plausible Corpus"

The paper "The 2nd BabyLM Challenge: Sample-efficient pretraining on a developmentally plausible corpus" (2404.06214) outlines the specifications, rules, and goals for the second iteration of the BabyLM Challenge. This competition focuses on the optimization of LLM pretraining within strict data limits, compelling researchers to innovate in scenarios similar to human cognition and linguistic development.

Revised Challenge Framework

The BabyLM Challenge continues to prioritize the efficient use of training data inspired by human development. This year's iteration introduces several significant modifications to its structure:

  • Paper Track Introduction: In addition to model-based submissions, researchers can now submit papers describing cognitively inspired benchmarks or analytical methods pertinent to the challenge.
  • Custom Dataset Construction: Participants are allowed to formulate their datasets under a fixed budget of either 100M or 10M words, offering them the flexibility to explore data quality impacts on pretraining.
  • Multimodal Track: A new Vision-language track integrates 50% text and 50% image-text data to promote the development of models capable of understanding and generating language in multimodal forms.

Participation Tracks and Dataset Specifications

The challenge features three distinct tracks:

  • Strict and Strict-small Tracks: These revolve around language-only datasets, requiring models to be trained within a cap of 100M and 10M words, respectively. Participants may use either the provided datasets or construct new ones, with mandatory datasheets for self-made data.
  • Vision Track: This track allows models trained on paired image-text data. Participants can utilize the provided multimodal dataset or their own, as long as they adhere to the 100M-word constraint.
  • Paper Track: Participants may submit papers detailing novel metrics, analyses of BabyLM models, or innovative approaches to cognitive modeling, irrespective of evaluation task scores.

Evaluation and Baselines

The evaluation framework replicates last year's with enhancements based on participant feedback. It will include both existing linguistic tasks and new multimodal tasks for Vision track submissions. Evaluation remains accessible through a shared Google Colab environment, supplemented by public evaluation code for personal execution.

The baseline models released are informed by the previous year's winning entries, comprising GPT2, LTG-Bert, and Contextualizer models for language-only tracks, and GIT and Flamingo models for the Vision track. These are intended to incentivize participants to exceed the current capabilities of leading models.

Implications and Future Directions

The BabyLM Challenge serves to democratize research in sample-efficient pretraining domains by constraining data resources, akin to human linguistic experience. The inclusion of multimodality heralds potential advancements in versatile LM applications, reflective of real-world language interactions. Future iterations of the challenge may explore cognitive modeling implications and continue fostering advancements in sample-efficient techniques across varied learning paradigms.

Conclusion

The 2nd BabyLM Challenge not only extends its predecessors' foundational goals but also ushers in novel tracks and rules to enhance the exploration of pretraining methodologies within realistic data limitations. It fosters interdisciplinary contributions merging cognitive science with AI research, encouraging innovations that align with the stringent demands of sample-efficient pretraining scenarios. This paradigm serves both theoretical inquiries into human-like language acquisition and practical prowess in developing economized AI systems.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.