Open-Source Web Service with Morphological Dictionary-Supplemented Deep Learning for Morphosyntactic Analysis of Czech

Published 18 Jun 2024 in cs.CL | (2406.12422v2)

Abstract: We present an open-source web service for Czech morphosyntactic analysis. The system combines a deep learning model with rescoring by a high-precision morphological dictionary at inference time. We show that our hybrid method surpasses two competitive baselines: While the deep learning model ensures generalization for out-of-vocabulary words and better disambiguation, an improvement over an existing morphological analyser MorphoDiTa, at the same time, the deep learning model benefits from inference-time guidance of a manually curated morphological dictionary. We achieve 50% error reduction in lemmatization and 58% error reduction in POS tagging over MorphoDiTa, while also offering dependency parsing. The model is trained on one of the currently largest Czech morphosyntactic corpora, the PDT-C 1.0, with the trained models available at https://hdl.handle.net/11234/1-5293. We provide the tool as a web service deployed at https://lindat.mff.cuni.cz/services/udpipe/. The source code is available at GitHub (https://github.com/ufal/udpipe/tree/udpipe-2), along with a Python client for a simple use. The documentation for the models can be found at https://ufal.mff.cuni.cz/udpipe/2/models#czech_pdtc1.0_model.

Abstract PDF HTML Upgrade to Chat

Summary

The paper proposes a hybrid approach that integrates deep learning with a morphological dictionary to improve Czech morphosyntactic analysis and reduce errors.
It achieves significant error reductions with up to 50% improvement in lemmatization and notable enhancements in POS tagging compared to baseline models.
The open-source web service leverages extensive training data from the Prague Dependency Treebank to deliver high-precision tokenization, lemmatization, and parsing.

Open-Source Web Service with Morphological Dictionary--Supplemented Deep Learning for Morphosyntactic Analysis of Czech

Overview

The paper entitled "Open-Source Web Service with Morphological Dictionary--Supplemented Deep Learning for Morphosyntactic Analysis of Czech" presents an advanced morphosyntactic analysis system integrating deep learning methodologies with conventional morphological dictionaries. Authored by Milan Straka and Jana Straková from the Institute of Formal and Applied Linguistics at Charles University, this work represents a notable synthesis of modern neural architectures and traditional linguistic resources to address limitations in pre-existing tools.

Methodology

The proposed system adopts a hybrid approach that leverages deep learning models alongside a morphological dictionary for enhanced morphosyntactic analysis. Specifically, the architecture builds upon UDPipe 2, a deep learning-based tool, which has been supplemented with the MorfFlex morphological dictionary to refine the inference process.

The architecture employs RobeCzech, a monolingual Czech pre-trained LLM, as the core. At inference time, the system reevaluates the predicted outputs of the deep learning model using the morphological dictionary. This rescoring mechanism serves two primary purposes:

Generalization: The deep learning model effectively addresses out-of-vocabulary (OOV) scenarios and improves disambiguation.
High Precision: The morphological dictionary ensures consistency and correctness by invalidating improbable model predictions.

Data

The training dataset is the Prague Dependency Treebank – Consolidated 1.0 (PDT-C 1.0), which is currently one of the most extensive morphosyntactic corpora for the Czech language. This corpus encompasses a wide array of sources including written texts, spoken data, and user-generated content. By employing such a rich dataset, the model attains a robust representation of Czech morphosyntax.

Results

The system demonstrates significant improvements over the existing UDPipe 2 and MorphoDiTa baselines, achieving:

Lemmatization Accuracy: A 50% error reduction compared to MorphoDiTa and a 35% error reduction compared to UDPipe 2.
POS Tagging Accuracy: A 58% error reduction compared to MorphoDiTa and a 16% error reduction compared to UDPipe 2.

Detailed Analysis

Error Analysis

A comprehensive error analysis reveals that most corrections made by the hybrid system pertained to invalid lemma generations by UDPipe 2 and lemma sense disambiguations. The inclusion of the morphological dictionary for rescoring notably pruned invalid lemma candidates, thus providing accurate outcomes even in cases with high lexical ambiguity.

Parsing Results

The parsing capabilities, evaluated in terms of Unlabeled Attachment Score (UAS) and Labeled Attachment Score (LAS), corroborated the benefits of the joint training approach. The model achieved UAS of 94.41% and LAS of 91.48% on the manually annotated section of the PDT corpus.

Practical and Theoretical Implications

Practically, this tool provides a high-precision segmentation, tokenization, morphological analysis, lemmatization, POS tagging, and dependency parsing system for the Czech language. Its deployment as an open-source web service ensures accessibility for a broad audience, including NLP researchers and developers focusing on Czech language processing.

Theoretically, this research underscores the potential advantages of hybrid models that incorporate the structured knowledge embedded within traditional linguistic resources into modern neural architectures. This paradigm can be extrapolated to other languages and morphologically-rich NLP tasks, promoting a symbiotic development of computational and traditional linguistics.

Speculative Outlook

Future directions might explore further enhancements in model efficiency and scalability. Integrating optimization techniques to balance between computational load and analytical precision could render the tool more viable for real-time applications. Additionally, the framework's adaptability to other languages warrants further exploration, potentially extending its applicability beyond Czech.

Conclusion

This paper introduces a robust, open-source tool for Czech morphosyntactic analysis that combines deep learning with high-precision morphological dictionaries, yielding substantial improvements in accuracy over baseline methodologies. The hybrid approach of combining deep learning models with conventional linguistic resources exemplifies an effective strategy for tackling complex NLP challenges.

The authors have provided extensive resources, including the web service deployment, source code, and trained models, fostering further advances in morphosyntactic analysis for Czech and potentially other languages.

Markdown Report Issue