Papers
Topics
Authors
Recent
Search
2000 character limit reached

JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources

Published 30 Jul 2024 in cs.IR, cs.AI, and cs.CL | (2407.20750v1)

Abstract: Neural Information Retrieval has advanced rapidly in high-resource languages, but progress in lower-resource ones such as Japanese has been hindered by data scarcity, among other challenges. Consequently, multilingual models have dominated Japanese retrieval, despite their computational inefficiencies and inability to capture linguistic nuances. While recent multi-vector monolingual models like JaColBERT have narrowed this gap, they still lag behind multilingual methods in large-scale evaluations. This work addresses the suboptimal training methods of multi-vector retrievers in lower-resource settings, focusing on Japanese. We systematically evaluate and improve key aspects of the inference and training settings of JaColBERT, and more broadly, multi-vector models. We further enhance performance through a novel checkpoint merging step, showcasing it to be an effective way of combining the benefits of fine-tuning with the generalization capabilities of the original checkpoint. Building on our analysis, we introduce a novel training recipe, resulting in the JaColBERTv2.5 model. JaColBERTv2.5, with only 110 million parameters and trained in under 15 hours on 4 A100 GPUs, significantly outperforms all existing methods across all common benchmarks, reaching an average score of 0.754, significantly above the previous best of 0.720. To support future research, we make our final models, intermediate checkpoints and all data used publicly available.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (82)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023).
  2. Evolutionary optimization of model merging recipes. arXiv preprint arXiv:2403.13187 (2024).
  3. Modern information retrieval, vol.Ā 463. ACM press New York, 1999.
  4. Bassani, E. ranx: A blazing-fast python library for ranking evaluation and comparison. In European Conference on Information Retrieval (2022), Springer, pp.Ā 259–264.
  5. Retrieval techniques. Annual review of information science and technology 22 (1987), 109–145.
  6. MMARCO: A multilingual version of the MS MMARCO passage ranking dataset. arXiv preprint arXiv:2108.13897 (2021).
  7. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  8. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation. arXiv preprint arXiv:2402.03216 (2024).
  9. ClaviƩ, B. Jacolbert and hard negatives, towards better japanese-first embeddings for retrieval: Early technical report. arXiv preprint arXiv:2312.16144 (2023).
  10. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (2020), pp.Ā 8440–8451.
  11. Overview of the trec 2019 deep learning track. arXiv preprint arXiv:2003.07820 (2020).
  12. The road less scheduled. arXiv preprint arXiv:2405.15682 (2024).
  13. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (2019), pp.Ā 4171–4186.
  14. Splade: Sparse lexical and expansion model for first stage ranking. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (2021), pp.Ā 2288–2292.
  15. A white box analysis of colbert. In Advances in Information Retrieval: 43rd European Conference on IR Research, ECIR 2021, Virtual Event, March 28–April 1, 2021, Proceedings, Part II 43 (2021), Springer, pp.Ā 257–263.
  16. SimCSE: Simple contrastive learning of sentence embeddings, 2022.
  17. Beneath the [mask]: An analysis of structural query tokens in colbert. In European Conference on Information Retrieval (2024), Springer, pp.Ā 431–439.
  18. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017).
  19. Distilling the knowledge in a neural network, 2015.
  20. Improving efficient neural ranking models with cross-architecture knowledge distillation. arXiv preprint arXiv:2010.02666 (2020).
  21. Introducing neural bag of whole-words with colberter: Contextualized late interactions using enhanced reduction. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022), pp.Ā 737–747.
  22. Minicpm: Unveiling the potential of small language models with scalable training strategies. arXiv preprint arXiv:2404.06395 (2024).
  23. Distilling knowledge from reader to retriever for question answering. In ICLR 2021-9th International Conference on Learning Representations (2021).
  24. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp.Ā 6769–6781.
  25. Lstm vs. bm25 for open-domain qa: A hands-on comparison of effectiveness and efficiency. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (2017), pp.Ā 1309–1312.
  26. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval (2020), pp.Ā 39–48.
  27. Comparing kullback-leibler divergence and mean squared error loss in knowledge distillation. In Proceedings of the 30th International Joint Conference on Artificial Intelligence (IJCAI) (08 2021), pp.Ā 2628–2635.
  28. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114, 13 (2017), 3521–3526.
  29. Information retrieval on the web. ACM computing surveys (CSUR) 32, 2 (2000), 144–173.
  30. Jglue: Japanese general language understanding evaluation. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (2022), pp.Ā 2957–2966.
  31. Splade-v3: New baselines for splade. arXiv preprint arXiv:2403.06789 (2024).
  32. Rethinking the role of token retrieval in multi-vector retrieval. Advances in Neural Information Processing Systems 36 (2024).
  33. Simple yet effective neural ranking and reranking baselines for cross-lingual information retrieval. arXiv preprint arXiv:2304.01019 (2023).
  34. In-batch negatives for knowledge distillation with tightly-coupled teachers for dense retrieval. In Proceedings of the 6th Workshop on Representation Learning for NLP (RepL4NLP-2021) (2021), pp.Ā 163–173.
  35. LlamaTeam. The llama 3 herd of models. HUMAN & MACHINE INTELLIGENCE, AI @ META REPORTS (2024).
  36. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
  37. Louis, A. Decouvrir: A benchmark for evaluating the robustness of information retrieval models in french, 2024.
  38. Colbert-xm: A modular multi-vector representation model for zero-shot multilingual information retrieval. arXiv preprint arXiv:2402.15059 (2024).
  39. A reproducibility study of plaid. arXiv preprint arXiv:2404.14989 (2024).
  40. Manning, C.Ā D. Introduction to information retrieval. Syngress Publishing,, 2008.
  41. Arctic-embed: Scalable, efficient, and accurate text embedding models. arXiv preprint arXiv:2405.05374 (2024).
  42. An introduction to neural information retrieval. Foundations and TrendsĀ® in Information Retrieval 13, 1 (2018), 1–126.
  43. Mteb: Massive text embedding benchmark. arXiv preprint arXiv:2210.07316 (2022).
  44. Transfer learning approaches for building cross-language dense retrieval models. In European Conference on Information Retrieval (2022), Springer, pp.Ā 382–396.
  45. MS MARCO: A human generated machine reading comprehension dataset. choice 2640 (2016), 660.
  46. Document ranking with a pretrained sequence-to-sequence model. arXiv preprint arXiv:2003.06713 (2020).
  47. Polyak, B. New stochastic approximation type procedures. Avtomatica i Telemekhanika 7 (01 1990), 98–107.
  48. Acceleration of stochastic approximation by averaging. SIAM journal on control and optimization 30, 4 (1992), 838–855.
  49. Squeezing water from a stone: a bag of tricks for further improving cross-encoder effectiveness for reranking. In European Conference on Information Retrieval (2022), Springer, pp.Ā 655–670.
  50. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2021), pp.Ā 5835–5847.
  51. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
  52. Squad: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (2016), pp.Ā 2383–2392.
  53. Shopping queries dataset: A large-scale esci benchmark for improving product search. arXiv preprint arXiv:2206.06588 (2022).
  54. Rocketqav2: A joint training method for dense passage retrieval and passage re-ranking. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing (2021), pp.Ā 2825–2835.
  55. Okapi at trec-3. Nist Special Publication Sp 109 (1995), 109.
  56. Udapdr: Unsupervised domain adaptation via llm prompting and distillation of rerankers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (2023), pp.Ā 11265–11279.
  57. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108 (2019).
  58. Plaid: an efficient engine for late interaction retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management (2022), pp.Ā 1747–1756.
  59. ColBERTv2: Effective and efficient retrieval via lightweight late interaction. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2022), pp.Ā 3715–3734.
  60. Tateno, Y. Jacwir: Japanese casual web ir. HuggingFace Datasets, https://huggingface.co/datasets/hotchpotch/JaCWIR (2024).
  61. Tateno, Y. Jqara: Japanese question answering with retrieval augmentation. HuggingFace Datasets, https://huggingface.co/datasets/hotchpotch/JQaRA (2024).
  62. BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663 (2021).
  63. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  64. Improvements to bm25 and language models examined. In Proceedings of the 19th Australasian Document Computing Symposium (2014), pp.Ā 58–65.
  65. Japanese SimCSE technical report. arXiv preprint arXiv:2310.19349 (2023).
  66. Attention is all you need. Advances in neural information processing systems 30 (2017).
  67. Seaeval for multilingual foundation models: From cross-lingual alignment to cultural reasoning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) (2024), pp.Ā 370–390.
  68. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533 (2022).
  69. Multilingual e5 text embeddings: A technical report. arXiv preprint arXiv:2402.05672 (2024).
  70. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. Advances in Neural Information Processing Systems 33 (2020), 5776–5788.
  71. The marginal value of adaptive gradient methods in machine learning. Advances in neural information processing systems 30 (2017).
  72. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597 (2023).
  73. LUKE: Deep contextualized entity representations with entity-aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2020), pp.Ā 6442–6454.
  74. Translate-distill: Learning cross-language dense retrieval by translation and distillation. In European Conference on Information Retrieval (2024), Springer, pp.Ā 50–65.
  75. Pretrained transformers for text ranking: Bert and beyond. In Proceedings of the 14th ACM International Conference on web search and data mining (2021), pp.Ā 1154–1156.
  76. Yin, G. Stochastic approximation via averaging: the polyak’s approach revisited. In Simulation and Optimization: Proceedings of the International Workshop on Computationally Intensive Methods in Simulation and Optimization held at the International Institute for Applied Systems Analysis (IIASA), Laxenburg, Austria, August 23–25, 1990 (1992), Springer, pp.Ā 119–134.
  77. Scaling vision transformers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (2022), pp.Ā 12104–12113.
  78. Mr. TyDi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning (2021), pp.Ā 127–137.
  79. Mr. tydi: A multi-lingual benchmark for dense retrieval. In Proceedings of the 1st Workshop on Multilingual Representation Learning (2021), pp.Ā 127–137.
  80. MIRACL: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics 11 (2023), 1114–1131.
  81. 舘野 焐一. ę—„ęœ¬čŖž Reranker ä½œęˆć®ćƒ†ć‚Æćƒ‹ć‚«ćƒ«ćƒ¬ćƒćƒ¼ćƒˆ. Technical Report, accessed at https://secon.dev/entry/2024/04/02/080000-japanese-reranker-tech-report/ [Tateno, Y. (2024). Technical report on Japanese Reranker] (2024).
  82. 鈓木 潤 and ę¾ē”° č€•å² and 鈓木 ę­£ę• and åŠ č—¤ ę‹“ēœŸ and 宮脇 峻平 and 脿田 京介. ćƒ©ć‚¤ćƒ–ć‚³ćƒ³ćƒšćƒ†ć‚£ć‚·ćƒ§ćƒ³ļ¼šć€ŒAI ēŽ‹ļ½žć‚Æć‚¤ć‚ŗ AI ę—„ęœ¬äø€ę±ŗå®šęˆ¦ļ½žć€. č‡Ŗē„¶čØ€čŖžå‡¦ē†, 28 (3), pp 888-894. [Suzuki, J., Matsuda, K., Suzuki, M., Kato, T., Miyawaki, S., and Nishida, K. (2021). Live Competition: "AI King: Quiz AI Japan Championship". Natural Language Processing, 28 (3), pp. 888–894.] (2021).
Citations (4)

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Authors (1)

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 2 likes about this paper.