SaulLM-7B: A pioneering Large Language Model for Law
Abstract: In this paper, we introduce SaulLM-7B, a LLM tailored for the legal domain. With 7 billion parameters, SaulLM-7B is the first LLM designed explicitly for legal text comprehension and generation. Leveraging the Mistral 7B architecture as its foundation, SaulLM-7B is trained on an English legal corpus of over 30 billion tokens. SaulLM-7B exhibits state-of-the-art proficiency in understanding and processing legal documents. Additionally, we present a novel instructional fine-tuning method that leverages legal datasets to further enhance SaulLM-7B's performance in legal tasks. SaulLM-7B is released under the MIT License.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- Predicting judicial decisions of the european court of human rights: A natural language processing perspective. PeerJ computer science, 2:e93.
- The falcon series of open language models.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403.
- Qwen technical report.
- Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397â2430. PMLR.
- Searching for needles in a haystack: On the role of incidental bilingualism in palmâs translation capability. arXiv preprint arXiv:2305.10266.
- Umar Butler. 2023. Open australian legal corpus.
- Instruction mining: High-quality instruction data selection for large language models. arXiv preprint arXiv:2307.06290.
- Neural legal judgment prediction in english. arXiv preprint arXiv:1906.02059.
- Legal-bert: The muppets straight out of law school. arXiv preprint arXiv:2010.02559.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079.
- Adapting large language models via reading comprehension. arXiv preprint arXiv:2309.09530.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Together Computer. 2023. Redpajama: an open dataset for training large language models.
- Chatlaw: Open-source legal large language model with integrated external knowledge bases. arXiv preprint arXiv:2306.16092.
- Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691.
- Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344â16359.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- Whatâs in my big data?
- Croissantllm: A truly bilingual french-english language model. arXiv preprint arXiv:2402.00786.
- Revisiting instruction fine-tuned model evaluation to guide industrial applications. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- The pile: An 800gb dataset of diverse text for language modeling.
- Albert Gu and Tri Dao. 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- Legalbench: Prototyping a collaborative benchmark for legal reasoning. arXiv preprint arXiv:2209.06120.
- Legalbench: A collaboratively built benchmark for measuring legal reasoning in large language models. arXiv preprint arXiv:2308.11462.
- Donât stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
- Spanish legalese language model and corpora. arXiv preprint arXiv:2110.12201.
- Kenneth Heafield. 2011. KenLM: Faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187â197, Edinburgh, Scotland. Association for Computational Linguistics.
- Pile of law: Learning responsible data filtering from the law and a 256gb open-source legal dataset. Advances in Neural Information Processing Systems, 35:29217â29234.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
- Distinguishing human generated text from chatgpt generated text using machine learning. arXiv preprint arXiv:2306.01761.
- Domain-specific continued pretraining of language models for capturing long context in mental health. arXiv preprint arXiv:2304.10447.
- Mistral 7b.
- Mixtral of experts.
- Gpt-4 passes the bar exam. Available at SSRN 4389233.
- The stack: 3 TB of permissively licensed source code. Transactions on Machine Learning Research.
- Philipp Koehn. 2005. Europarl: A parallel corpus for statistical machine translation. In Proceedings of Machine Translation Summit X: Papers, pages 79â86, Phuket, Thailand.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499.
- Datasets: A community library for natural language processing. arXiv preprint arXiv:2109.02846.
- Starcoder: may the source be with you!
- Slimorca: An open dataset of gpt-4 augmented flan reasoning traces, with verification.
- Daniele Licari and Giovanni ComandĂš. 2022. Italian-legal-bert: A pre-trained transformer language model for italian law. In CEUR Workshop Proceedings (Ed.), The Knowledge Management for Law Workshop (KM4LAW).
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
- Prompt discriminative language models for domain adaptation. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 247â258.
- At which training stage does code data help llms reasoning?
- Better call gpt, comparing large language models against lawyers. arXiv preprint arXiv:2401.16212.
- Michael McCloskey and Neal J. Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. volume 24 of Psychology of Learning and Motivation, pages 109â165. Academic Press.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
- Chenghaomou/text-dedup: Reference snapshot.
- Orca: Progressive learning from complex explanation traces of gpt-4.
- Nash learning from human feedback. arXiv preprint arXiv:2312.00886.
- Swiss-judgment-prediction: A multilingual legal judgment prediction benchmark. arXiv preprint arXiv:2110.00806.
- Joel Niklaus and Daniele Giofré. 2022. Budgetlongformer: Can we cheaply pretrain a sota legal language model from scratch? arXiv preprint arXiv:2211.17135.
- Joel Niklaus and Daniele Giofré. 2023. Can we pretrain a sota legal language model on a budget from scratch? Association for Computational Linguistics.
- Multilegalpile: A 689gb multilingual legal corpus.
- Unsupervised domain adaptation of language models for reading comprehension. arXiv preprint arXiv:1911.10768.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
- The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Henry Prakken. 2013. Logical tools for modelling legal argument: a study of defeasible reasoning in law, volume 32. Springer Science & Business Media.
- Robust speech recognition via large-scale weak supervision.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Explaining legal concepts with augmented large language models (gpt-4). arXiv preprint arXiv:2306.09525.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
- Slimpajama-dc: Understanding data combinations for llm training. arXiv preprint arXiv:2309.10818.
- Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053.
- Slimpajama: A 627b token cleaned and deduplicated version of redpajama.
- Distill and replay for continual language learning. In Proceedings of the 28th international conference on computational linguistics, pages 3569â3579.
- Galactica: A large language model for science. arXiv preprint arXiv:2211.09085.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models.
- Ledgar: A large-scale multi-label corpus for text classification of legal provisions in contracts. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1235â1241.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944.
- Trl: Transformer reinforcement learning. https://github.com/huggingface/trl.
- Effective unsupervised domain adaptation with adversarially trained language models. arXiv preprint arXiv:2010.01739.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
- Self-instruct: Aligning language models with self-generated instructions.
- Super-naturalinstructions: Generalization via declarative instructions on 1600+ nlp tasks. arXiv preprint arXiv:2204.07705.
- Testing of detection tools for ai-generated text. International Journal for Educational Integrity, 19(1):26.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652.
- Huggingfaceâs transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- Adapting large language models for document-level machine translation. arXiv preprint arXiv:2401.06468.
- Lawformer: A pre-trained language model for chinese legal long documents. AI Open, 2:79â84.
- A paradigm shift in machine translation: Boosting translation performance of large language models. arXiv preprint arXiv:2309.11674.
- Adapt-and-distill: Developing small, fast and effective pretrained language models for domains. arXiv preprint arXiv:2106.13474.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284.
- Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.