Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation

Published 6 Nov 2020 in cs.CL | (2011.03286v2)

Abstract: In its daily use, the Indonesian language is riddled with informality, that is, deviations from the standard in terms of vocabulary, spelling, and word order. On the other hand, current available Indonesian NLP models are typically developed with the standard Indonesian in mind. In this work, we address a style-transfer from informal to formal Indonesian as a low-resource machine translation problem. We build a new dataset of parallel sentences of informal Indonesian and its formal counterpart. We benchmark several strategies to perform style transfer from informal to formal Indonesian. We also explore augmenting the training set with artificial forward-translated data. Since we are dealing with an extremely low-resource setting, we find that a phrase-based machine translation approach outperforms the Transformer-based approach. Alternatively, a pre-trained GPT-2 fined-tuned to this task performed equally well but costs more computational resource. Our findings show a promising step towards leveraging machine translation models for style transfer. Our code and data are available in https://github.com/haryoa/stif-indonesia

Abstract PDF Upgrade to Chat

Citations (18)

View on Semantic Scholar

Summary

The paper introduces iterative forward-translation to synthesize training data and enhance low-resource style transfer performance.
The study compares dictionary-based methods, PBSMT, Transformers, and GPT-2, identifying PBSMT and fine-tuned GPT-2 as highly effective.
Empirical results demonstrate BLEU scores near 49, underscoring the promise of leveraging pre-trained models in low-resource settings.

Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language

The paper "Semi-Supervised Low-Resource Style Transfer of Indonesian Informal to Formal Language with Iterative Forward-Translation" presents an exploration into the stylistic transformation of informal Indonesian texts into their formal counterparts using several machine translation methodologies. Focusing on the conversational and social media context, where informal language is pervasive due to colloquial expressions and code-mixing, the authors address the limitations posed by the scarcity of annotated datasets for informal Indonesian—a problem that significantly hinders existing NLP models developed primarily for formal Indonesian.

Methodology and Approaches

The researchers investigate this task as a sequence-to-sequence problem, comparing various translation strategies such as dictionary-based translation, Phrase-Based Statistical Machine Translation (PBSMT), Neural Machine Translation using Transformer models, and pre-trained LLMs, specifically GPT-2. Each method has its distinctive features and resource requirements, with PBSMT and GPT-2 showing effective results in this context.

Dictionary-Based Translation: Serving as a baseline, this method relies on a pre-existing word-level formal-informal dictionary. However, it only translates words directly available in the dictionary and often fails with contextually flexible informal expressions.
Phrase-Based Statistical Machine Translation (PBSMT): Given the low-resource setting, PBSMT tends to outperform neural approaches due to its efficacy with limited data, aligning sequences based on phrase-level correspondences.
Neural Machine Translation (Transformer): Although the Transformer generally advances the state of the art in many machine translation tasks, its performance is hamstrung under extreme low-resource conditions, as evidenced by the results that were even less favorable than the unmodified informal input.
Pre-trained Language Modeling (GPT-2): Fine-tuning a GPT-2 based LLM, initially trained on the OSCAR corpus for Indonesian, demonstrates competitive translating capabilities, highlighting the potential of leveraging large-scale pre-trained models even in low-resource tasks.

To enhance the training resources, the authors introduce the use of forward-translated synthetic datasets. As opposed to back-translation that requires substantial high-quality in-domain formal data, they utilize iterative forward-translation—proposing synthetic data generation iteratively to increase variability and utility in training data.

Experimental Results and Findings

The empirical results emphasize the comparative advantage of PBSMT and the fine-tuned GPT-2 model, both achieving a BLEU score near 49, with the PBSMT slightly outperforming GPT-2. The incorporation of synthetic data via iterative forward-translation modestly improved performance, indicating that semi-supervised approaches could incrementally contribute to the efficacy of style transfer under resource constraints.

Implications and Future Directions

The findings underscore the potential for machine translation models to extend beyond domain-specific tasks, providing pathways for preprocessing tools that enhance the adaptability of NLP systems to varying formality levels in language. Practically, these methodologies can be leveraged as preprocessing modules to augment downstream tasks without extensive reconfiguration.

Future directions are expected to concentrate on refining the generation of synthetic data to further bolster performance gains, exploring additional data augmentation strategies, and potentially adapting the findings to other low-resource language pairs or domains where informal language prevails. Furthermore, the research could expand into evaluating the transferability of these models in broader multilingual contexts, possibly through cross-lingual training paradigms.

Overall, the paper provides notable insights into overcoming the limitations of low-resource style transfer, establishing a valuable reference for researchers working on similar linguistic transitions and multilingual NLP challenges. The availability of their code and datasets contributes to reproducibility and further innovation in the field.

Markdown Report Issue