WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development
Abstract: Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML's hierarchical structures and styles. While LLMs have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML's hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs' understanding of HTML's structure and a contrastive fine-tuning approach to align LLMs' understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.
- Flamingo: a visual language model for few-shot learning. Preprint, arXiv:2204.14198.
- Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966.
- Nougat: Neural optical understanding for academic documents. Preprint, arXiv:2308.13418.
- Lei Chai and Ming Li. 2022. Pyramid attention for source code summarization. Advances in Neural Information Processing Systems, 35:20421–20433.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. Preprint, arXiv:2305.06500.
- Incoder: A generative model for code infilling and synthesis. Preprint, arXiv:2204.05999.
- SimCSE: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6894–6910, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Vision2ui: A real-world dataset with layout for code generation from ui designs. Preprint, arXiv:2404.06369.
- Longcoder: a long-range pre-trained language model for code completion. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- Deepseek-coder: When the large language model meets programming – the rise of code intelligence. Preprint, arXiv:2401.14196.
- Cogagent: A visual language model for gui agents. Preprint, arXiv:2312.08914.
- LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
- Mistral 7b. Preprint, arXiv:2310.06825.
- Obelics: An open web-scale filtered dataset of interleaved image-text documents. Preprint, arXiv:2306.16527.
- Unlocking the conversion of web screenshots into html code with the websight dataset. Preprint, arXiv:2403.09029.
- LAVIS: A one-stop library for language-vision intelligence. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), pages 31–41, Toronto, Canada. Association for Computational Linguistics.
- Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org.
- BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900. PMLR.
- Starcoder: may the source be with you! Preprint, arXiv:2305.06161.
- Improved baselines with visual instruction tuning.
- Visual instruction tuning. Preprint, arXiv:2304.08485.
- DoRA: Weight-decomposed low-rank adaptation.
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Preprint, arXiv:1711.05101.
- Starcoder 2 and the stack v2: The next generation. Preprint, arXiv:2402.19173.
- Mm1: Methods, analysis & insights from multimodal llm pre-training. Preprint, arXiv:2403.09611.
- Hierarchynet: Learning to summarize source code with heterogeneous representations. Preprint, arXiv:2205.15479.
- A conversational paradigm for program synthesis. arXiv preprint.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning.
- Hierarchical text-conditional image generation with clip latents. Preprint, arXiv:2204.06125.
- Alex Robinson. 2019. Sketch2code: Generating a website from a paper mockup. Preprint, arXiv:1905.13750.
- High-resolution image synthesis with latent diffusion models. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685.
- Code llama: Open foundation models for code. Preprint, arXiv:2308.12950.
- Complex wavelet structural similarity: A new image similarity index. IEEE Transactions on Image Processing, 18(11):2385–2401.
- CAST: Enhancing code summarization with hierarchical splitting and reconstruction of abstract syntax trees. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4053–4062, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Design2code: How far are we from automating front-end engineering? Preprint, arXiv:2403.03163.
- Codeart: Better code models by attention regularization when symbols are lacking. Proc. ACM Softw. Eng., 1(FSE).
- Llama: Open and efficient foundation language models. Preprint, arXiv:2302.13971.
- Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
- vikhyat. 2024. Moondream: tiny vision language model.
- Vary: Scaling up the vision vocabulary for large vision-language models. arXiv preprint arXiv:2312.06109.
- Hierarchical attention networks for document classification. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1480–1489, San Diego, California. Association for Computational Linguistics.
- Sigmoid loss for language image pre-training. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, Los Alamitos, CA, USA. IEEE Computer Society.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.