Universal Dependencies Bootstrapping

Updated 12 February 2026

The paper presents a deterministic conversion pipeline and synthetic data generation method that effectively bootstraps Universal Dependencies, achieving notable LAS improvements.
It leverages linguistic heuristics, supervised parsing, delexicalized transfer, and BERT-based approaches to enhance the annotation quality of both high- and low-resource languages.
Empirical evaluations show improvements such as an 85.2 LAS on PerDT-UD and a +2.32pp gain using synthetic languages, underscoring the practical impact of these bootstrapping techniques.

Universal Dependencies (UD) bootstrapping is the process of constructing, expanding, or adapting syntactic treebanks or parsers within the Universal Dependencies schema when direct UD resources are limited. This process relies on systematic algorithms, linguistic mapping strategies, supervised and unsupervised methods, and synthetic data generation to build cross-lingual or monolingual resources exhibiting UD structural consistency. The aim is to facilitate robust syntactic parsing, multitask transfer, and evaluation in both high-resource and low-resource settings.

1. Bootstrapping Workflows: Deterministic Conversion and Synthetic Data Generation

Bootstrapping UD annotations from non-UD treebanks or from raw data involves several classes of methodologies:

Deterministic Conversion. For existing language-specific treebanks not annotated in UD, a conversion pipeline is constructed. The method introduced by Rasooli et al. for the 29,107-sentence Persian Dependency Treebank (PerDT) exemplifies this approach (Rasooli et al., 2020). The conversion workflow splits into:

A. Unified Tokenization: Clitics and multiword verbs are split according to UD’s one-token-one-word principle. Morphological analyzers identify main verbs and reassign auxiliaries as dependents.
B. POS Tag Mapping: Treebank-specific part-of-speech tags are mapped to UD’s UPOS using lookup tables and surface/NLP heuristics (e.g., NER for PROPN).
C. Systematic Corrections: Preprocessing repairs known annotation idiosyncrasies such as chain-conjunctions, numeral misclassification, and passive lemma errors.
D. Dependency Relation Mapping: A mapping function f_dep rewrites dependency labels and head-dependent pairs, applying structural pre-actions (chain rotation, head/dependent flipping) before final label assignment from lookup tables.

Synthetic Language Generation. The Galactic Dependencies (GD) framework produces large numbers of synthetic UD-conformant treebanks by permuting the dependents of real treebanks to emulate the word order statistics of other real languages (Wang et al., 2017). The core stochastic algorithm samples permutations π of dependents D_x of each head x (verb, noun, etc.) using a log-linear model:

$p_θ(π | x) = \frac{1}{Z(x)} \exp\left(\sum_{1 \le i < j \le m} θ\cdot f(π, i, j)\right)$

Parameters θ are trained to match word order in a superstrate language and interpolated with substrate language settings. This resource covers 53,428 synthetic treebanks spanning all pairwise combinations of 37 real languages.

2. Formalization of Key Transformations

Deterministic conversion methods are explicitly specified by mapping functions and transformation steps:

POS Mapping:

$f_{\text{pos}}(p) = \begin{cases} \text{PROPN} & \text{if } p\in\{\text{N,ADJ}\} \land \text{NER}(p)=\text{PERSON or LOC} \ \text{UD\_LOOKUP}(p) & \text{otherwise} \end{cases}$

Head/Dependent Flip:

$R_{\text{flip}}(h, d) = (d, h)$

Applied for specific dependency types (e.g., OBJ2, PROG, MOS) requiring content-head reattachment.

Case-mark Reattachment (CMR): Dependents tagged ADP or PART are mapped to UD case; other cases follow numerals or specific modifier patterns.
Conjunction Rotation: Chain-style conjunctions ("A and B and C") are rotated into head-first UD chains (head=A, conj=B, conj=C).
Dependency Mapping: Dependency labels are assigned through precondition checks, pre-actions (rotation, head flip), and lookup table mappings, with exceptions defaulting to the generic dep relation.

GD’s permutation algorithm is fully enumerative for up to 7 dependents and specifies the scoring of permutations by features on syntactic and positional attributes.

3. Machine Learning and Heuristic Components

Bootstrapping relies on both rule-based and statistical/ML methods:

Linguistic Heuristics: Named-entity recognition (e.g. via BERT-based NER) to distinguish PROPN, detection of ezāfe pronouns for case/advmod/amod labeling, verbal modality detection for aux/aux:pass, and light-verb markers for compound:lvc.
Supervised Parsers: UDPipe 2.0 with fastText embeddings is used for in-domain supervised parsing; transition-based neural scoring incorporates context windows of word, lemma, UPOS, and morphological features (Rasooli et al., 2020).
Delexicalized Transfer Models: Cross-lingual transfer employs averaged-perceptron arc-eager parsers (Yara), relying only on non-lexicalized features—critical for evaluating UD-compatibility across unrelated scripts/languages.
BERT-based Bootstrapping: UD-conformant trees can be bootstrapped from BERT or mBERT self-attentions, using greedy selection and ensembling of attention heads best aligning with UD labeled arcs, requiring only a handful of annotated sentences for minimal supervision (Limisiewicz et al., 2020).

4. Evaluation and Empirical Findings

Comprehensive evaluation is essential to validate UD bootstrapping:

Dataset Statistics: Converted PerDT-UD offers 26,196 training sentences versus 4,798 for the previous UDT treebank, with significantly expanded vocabulary and verb type counts (Rasooli et al., 2020).
Supervised Parsing: PerDT-UD achieves 85.2 LAS (labeled attachment score) versus 79.4 on UDT. Cross-treebank tagging/parsing shows large LAS/UAS drops, driven by annotation mismatches.
Delexicalized Transfer: Subsampled and delexicalized PerDT-UD, when transferred to English Web Treebank, yields 47.31 UAS and 38.59 LAS, with a +2 pp LAS gain over UDT—evidence of superior UD compatibility.
Synthetic Data Transfer: GD synthetic languages, when included in source pools, lift average dev UAS by +2.32 pp under optimal source selection in single-source transfer parsing, reaching 65.13 UAS versus 62.81 with real sources alone (Wang et al., 2017).
Cross-lingual Task Performance: Consistent application of UD bootstrapping enables successful downstream transfer in paraphrase identification and semantic relation extraction. Kernel-based classifiers trained on English UD parses transfer to Farsi and Arabic, e.g., 58.5% accuracy on Farsi paraphrase identification with tree kernels vs. collapse to chance level with non-UD parses (Taghizadeh et al., 2020).
Annotation Coverage: Expanded PROPN, csubj, iobj, compound:lv, and obl:arg annotations, and fixes for cop, xcomp, nmod, and light-verb relations, are obtained through conversion frameworks (Rasooli et al., 2020).

5. Comparison of Bootstrapping Approaches

Distinct bootstrapping paradigms target different data scenarios:

Approach	Input	Output	Evaluation Gains
Deterministic Conversion	Language-specific treebank	UD treebank, parser	+5.8pp LAS (in-domain)
Synthetic Generation (GD)	Any real UD treebank	53,428 synthetic UD treebanks	+2.32pp LAS (transfer)
Cross-lingual ML sharing	UD treebanks, multilingual	Multilingual neural parser	Up to +2.45 LAS (shared)
BERT-based Bootstrapping	Raw data, small UD devset	Direct UD parse via attention heads	UAS lift of ~8–10pp

Deterministic conversion methods (e.g., PerDT-UD) maximize compatibility and coverage for closely-available treebanks, whereas synthetic permutation (GD) creates typologically diverse data resources to bridge gaps in typological space. Multilingual neural models leverage parameter sharing to enhance low-resource language performance (Zapotoczny et al., 2017), and BERT-based methods exploit pretrained representations with minimal annotation (Limisiewicz et al., 2020).

6. Limitations and Future Research Directions

Universal Dependencies bootstrapping faces several limitations:

Annotation Divergences: Language-specific syntactic constructions (e.g., SOV order, MWEs, light-verb predicates) can challenge mapping consistency and downstream transfer, despite design efforts to minimize structural inconsistencies (Taghizadeh et al., 2020).
Synthetic Resource Constraints: Permutations in GD currently yield only projective trees; token-level permutations ignore prosody, long-distance dependencies ("heaviness"), and do not model morphological inflection or case marking (Wang et al., 2017).
Domain Generality: Conversion pipelines and synthetic generators may require additional language engineering for idiomatic constructions or to scale to domain-specific datasets.
Resource Bottlenecks: BERT-based methods are sensitive to the typological and script representation in pretraining corpora. SOV languages show reduced accuracy gains (Limisiewicz et al., 2020). Cross-lingual neural models still require at least a few hundred UD-annotated sentences to succeed (Zapotoczny et al., 2017).
Heuristic Fragility: Named-entity and MWE heuristics may suffer from recall/precision errors, especially under domain or language shift.
Open Directions: Planned advances include integrating non-projective reordering in GD, simulating novel vocabularies via phonological mapping, and joint neural training with multitask learning frameworks to bridge parsing and end-task objectives (Wang et al., 2017, Taghizadeh et al., 2020).

7. Best Practices for Universal Dependencies Bootstrapping

Based on large-scale empirical studies and conversion experience, the following protocol ensures robust UD bootstrapping (Rasooli et al., 2020):

Unify tokenization according to UD principles by detaching clitics and segmenting multiword verbs.
Repair or map POS tags via explicit mapping tables, supplemented with NER and structural heuristics.
Correct treebank-specific syntactic artifacts (e.g., chain conjunctions, numeral mislabels).
Specify dependency conversion as a two-stage process: (i) structural pre-actions (rotate, flip, reattach), then (ii) label assignment via mapping tables and conditional logic.
Rigorously evaluate conversions via both in-domain parsing and cross-lingual delexicalized transfer to test UD compatibility and minimize typological annotation drift.

Public codebases and converted resources are available, for example, the UD-Persian PerDT treebank (https://github.com/UniversalDependencies/UD_Persian-PerDT/tree/dev) supports community verification and adaptation to other languages.

Universal Dependencies bootstrapping synthesizes deterministic mapping, probabilistic transformation, and ML-based transfer in service of universal, scalable syntactic annotation and parsing. It is foundational for cross-linguistic NLP research, low-resource language technology, and downstream modeling reliant on syntactic representations.