Papers
Topics
Authors
Recent
Search
2000 character limit reached

What has LeBenchmark Learnt about French Syntax?

Published 4 Mar 2024 in cs.CL | (2403.02173v1)

Abstract: The paper reports on a series of experiments aiming at probing LeBenchmark, a pretrained acoustic model trained on 7k hours of spoken French, for syntactic information. Pretrained acoustic models are increasingly used for downstream speech tasks such as automatic speech recognition, speech translation, spoken language understanding or speech parsing. They are trained on very low level information (the raw speech signal), and do not have explicit lexical knowledge. Despite that, they obtained reasonable results on tasks that requires higher level linguistic knowledge. As a result, an emerging question is whether these models encode syntactic information. We probe each representation layer of LeBenchmark for syntax, using the Orf\'eo treebank, and observe that it has learnt some syntactic information. Our results show that syntactic information is more easily extractable from the middle layers of the network, after which a very sharp decrease is observed.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations.
  2. Probing for constituency structure in neural language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 6738–6757, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  4. Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.
  5. Le projet orféo: un corpus d’étude pour le français contemporain. Corpus, (15).
  6. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  7. LeBenchmark: a reproducible framework for assessing self-supervised representation learning from speech. In Proc. Interspeech 2021, pages 1439–1443.
  8. John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
  9. Open sesame: Getting inside BERT’s linguistic knowledge. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 241–253, Florence, Italy. Association for Computational Linguistics.
  10. Annotation syntaxique automatique de la partie orale du ORFÉO. In Langages.
  11. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32.
  12. A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics, 8:842–866.
  13. Wave to syntax: Probing spoken language models for syntax. arXiv preprint arXiv:2305.18957.
  14. What do audio transformers hear? probing their representations for language delivery & structure. In 2022 IEEE International Conference on Data Mining Workshops (ICDMW), pages 910–925.
  15. Viable dependency parsing as sequence labeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 717–723, Minneapolis, Minnesota. Association for Computational Linguistics.
  16. What do you learn from context? probing for sentence structure in contextualized word representations. In International Conference on Learning Representations.
  17. Diagnostic classifiers revealing how neural networks process hierarchical structure. In CoCo@ NIPS, pages 69–77. Barcelona.
  18. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  19. Virginie André. 2016. Fleuron: Français langue Étrangère universitaire–ressources et outils numériques.
  20. ATILF. 2020. Tcof : Traitement de corpus oraux en français. ORTOLANG (Open Resources and TOols for LANGuage) –www.ortolang.fr.
  21. Corpus ofrom – corpus oral de français de suisse romande. Université de Neuchâtel.
  22. Le projet ORFÉO: un corpus d’étude pour le français contemporain. Corpus, (15).
  23. Janice Carruthers. 2013. French oral narrative corpus. Commissioning Body / Publisher: Oxford Text Archive.
  24. CLESTHIA. 2018. Cfpp2000. ORTOLANG (Open Resources and TOols for LANGuage) –www.ortolang.fr.
  25. The C-ORAL-ROM CORPUS. a multilingual resource of spontaneous speech for Romance languages. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).
  26. Présentation du corpus de référence du français parlé. Recherches sur le français parlé, 18:11–42. Equipe DELIC.
  27. Du corpus à la banque de données.: Du son, des textes et des métadonnées. l’évolution de banque de données textuelles orales valibel (1989-2009). Cahiers de l’Institut de linguistique de Louvain-CILL, 33(2):113.
  28. Magali Husianycia. 2011. Caractérisation de types de discours dans des situations de travail. Theses, Université Nancy 2.
  29. ICAR. 2017. Clapi. ORTOLANG (Open Resources and TOols for LANGuage) –www.ortolang.fr.
  30. Spoken Language Corpus and Linguistic Informatics. John Benjamins.
  31. Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France. European Language Resources Association.

Summary

No one has generated a summary of this paper yet.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.