JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources
Abstract: Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
- Evolutionary optimization of model merging recipes. arXiv preprint arXiv:2403.13187 (2024).
- Modern information retrieval, vol.Ā 463. ACM press New York, 1999.
- Bassani, E. ranx: A blazing-fast python library for ranking evaluation and comparison. In European Conference on Information Retrieval (2022), Springer, pp.Ā 259ā264.
- Retrieval techniques. Annual review of information science and technology 22 (1987), 109ā145.
- MMARCO: A multilingual version of the MS MMARCO passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877ā1901.
- Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216 (2024).
- ClaviƩ, B. Jacolbert and hard negatives, towards better japanese-first embeddings for retrieval: Early technical report. arXiv preprint arXiv:2312.16144 (2023).
- Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp.Ā 8440ā8451.
- Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).
- The road less scheduled. arXiv preprint arXiv:2405.15682 (2024).
- Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019), pp.Ā 4171ā4186.
- Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021), pp.Ā 2288ā2292.
- A white box analysis of colbert. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28āApril 1, 2021, Proceedings, Part II 43 (2021), Springer, pp.Ā 257ā263.
- SimCSE: Simple contrastive learning of sentence embeddings, 2022.
- Beneath the [mask]: An analysis of structural query tokens in colbert. In European Conference on Information Retrieval (2024), Springer, pp.Ā 431ā439.
- Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017).
- Distilling the knowledge in a neural network, 2015.
- Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666 (2020).
- Introducing neural bag of whole-words with colberter: Contextualized late interactions using enhanced reduction. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022), pp.Ā 737ā747.
- Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395 (2024).
- Distilling knowledge from reader to retriever for question answering. In ICLR 2021-9th International Conference on Learning Representations (2021).
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp.Ā 6769ā6781.
- Lstm vs. bm25 for open-domain qa: A hands-on comparison of effectiveness and efficiency. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (2017), pp.Ā 1309ā1312.
- Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (2020), pp.Ā 39ā48.
- Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI) (08 2021), pp.Ā 2628ā2635.
- Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521ā3526.
- Information retrieval on the web. ACM computing surveys (CSUR) 32, 2 (2000), 144ā173.
- Jglue: Japanese general language understanding evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (2022), pp.Ā 2957ā2966.
- Splade-v3: New baselines for splade. arXiv preprint arXiv:2403.06789 (2024).
- Rethinking the role of token retrieval in multi-vector retrieval. Advances in Neural Information Processing Systems 36 (2024).
- Simple yet effective neural ranking and reranking baselines for cross-lingual information retrieval. arXiv preprint arXiv:2304.01019 (2023).
- In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021) (2021), pp.Ā 163ā173.
- LlamaTeam. The llama 3 herd of models. HUMAN & MACHINE INTELLIGENCE, AI @ META REPORTS (2024).
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
- Louis, A. Decouvrir: A benchmark for evaluating the robustness of information retrieval models in french, 2024.
- Colbert-xm: A modular multi-vector representation model for zero-shot multilingual information retrieval. arXiv preprint arXiv:2402.15059 (2024).
- A reproducibility study of plaid. arXiv preprint arXiv:2404.14989 (2024).
- Manning, C.Ā D. Introduction to information retrieval. Syngress Publishing,, 2008.
- Arctic-embed: Scalable, efficient, and accurate text embedding models. arXiv preprint arXiv:2405.05374 (2024).
- An introduction to neural information retrieval. Foundations and TrendsĀ® in Information Retrieval 13, 1 (2018), 1ā126.
- Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 (2022).
- Transfer learning approaches for building cross-language dense retrieval models. In European Conference on Information Retrieval (2022), Springer, pp.Ā 382ā396.
- MS MARCO: A human generated machine reading comprehension dataset. choice 2640 (2016), 660.
- Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020).
- Polyak, B. New stochastic approximation type procedures. Avtomatica i Telemekhanika 7 (01 1990), 98ā107.
- Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30, 4 (1992), 838ā855.
- Squeezing water from a stone: a bag of tricks for further improving cross-encoder effectiveness for reranking. In European Conference on Information Retrieval (2022), Springer, pp.Ā 655ā670.
- Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021), pp.Ā 5835ā5847.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1ā67.
- Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016), pp.Ā 2383ā2392.
- Shopping queries dataset: A large-scale esci benchmark for improving product search. arXiv preprint arXiv:2206.06588 (2022).
- Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021), pp.Ā 2825ā2835.
- Okapi at trec-3. Nist Special Publication Sp 109 (1995), 109.
- Udapdr: Unsupervised domain adaptation via llm prompting and distillation of rerankers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023), pp.Ā 11265ā11279.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
- Plaid: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022), pp.Ā 1747ā1756.
- ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022), pp.Ā 3715ā3734.
- Tateno, Y. Jacwir: Japanese casual web ir. HuggingFace Datasets, https://huggingface.co/datasets/hotchpotch/JaCWIR (2024).
- Tateno, Y. Jqara: Japanese question answering with retrieval augmentation. HuggingFace Datasets, https://huggingface.co/datasets/hotchpotch/JQaRA (2024).
- BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Improvements to bm25 and language models examined. In Proceedings of the 19th Australasian Document Computing Symposium (2014), pp.Ā 58ā65.
- Japanese SimCSE technical report. arXiv preprint arXiv:2310.19349 (2023).
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (2024), pp.Ā 370ā390.
- Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
- Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672 (2024).
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776ā5788.
- The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems 30 (2017).
- C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597 (2023).
- LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp.Ā 6442ā6454.
- Translate-distill: Learning cross-language dense retrieval by translation and distillation. In European Conference on Information Retrieval (2024), Springer, pp.Ā 50ā65.
- Pretrained transformers for text ranking: Bert and beyond. In Proceedings of the 14th ACM International Conference on web search and data mining (2021), pp.Ā 1154ā1156.
- Yin, G. Stochastic approximation via averaging: the polyakās approach revisited. In Simulation and Optimization: Proceedings of the International Workshop on Computationally Intensive Methods in Simulation and Optimization held at the International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria, August 23ā25, 1990 (1992), Springer, pp.Ā 119ā134.
- Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022), pp.Ā 12104ā12113.
- Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning (2021), pp.Ā 127ā137.
- Mr. tydi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning (2021), pp.Ā 127ā137.
- MIRACL: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics 11 (2023), 1114ā1131.
- čé ē„äø. ę„ę¬čŖ Reranker ä½ęć®ććÆćć«ć«ć¬ćć¼ć. Technical Report, accessed at https://secon.dev/entry/2024/04/02/080000-japanese-reranker-tech-report/ [Tateno, Y. (2024). Technical report on Japanese Reranker] (2024).
- é“ęØ ę½¤ and ę¾ē° čå² and é“ęØ ę£ę and å č¤ ęē and å®®č 峻平 and č„æē° äŗ¬ä». ć©ć¤ćć³ć³ććć£ć·ć§ć³ļ¼ćAI ēļ½ćÆć¤ćŗ AI ę„ę¬äøę±ŗå®ę¦ļ½ć. čŖē¶čØčŖå¦ē, 28 (3), pp 888-894. [Suzuki, J., Matsuda, K., Suzuki, M., Kato, T., Miyawaki, S., and Nishida, K. (2021). Live Competition: "AI King: Quiz AI Japan Championship". Natural Language Processing, 28 (3), pp. 888ā894.] (2021).
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.