Papers
Topics
Authors
Recent
Search
2000 character limit reached

Generative adversarial networks vs large language models: a comparative study on synthetic tabular data generation

Published 20 Feb 2025 in cs.LG and cs.CL | (2502.14523v1)

Abstract: We propose a new framework for zero-shot generation of synthetic tabular data. Using the LLM GPT-4o and plain-language prompting, we demonstrate the ability to generate high-fidelity tabular data without task-specific fine-tuning or access to real-world data (RWD) for pre-training. To benchmark GPT-4o, we compared the fidelity and privacy of LLM-generated synthetic data against data generated with the conditional tabular generative adversarial network (CTGAN), across three open-access datasets: Iris, Fish Measurements, and Real Estate Valuation. Despite the zero-shot approach, GPT-4o outperformed CTGAN in preserving means, 95% confidence intervals, bivariate correlations, and data privacy of RWD, even at amplified sample sizes. Notably, correlations between parameters were consistently preserved with appropriate direction and strength. However, refinement is necessary to better retain distributional characteristics. These findings highlight the potential of LLMs in tabular data synthesis, offering an accessible alternative to generative adversarial networks and variational autoencoders.

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 14 likes about this paper.