Generative Adversarial Networks (GANs), synthetic data, quality assessment, privacy, classic data generation models. The concepts are individually abstract, but in a recently published article, INESC TEC researcher Álvaro Figueira – together with Bruno Vaz – established a relationship between them. The article “Survey on Synthetic Data Generation, Evaluation Methods and GANs” and its innovative character earned the authors the Mathematics Best Paper Award, an annual international recognition that acknowledges high-quality publications, scientific relevance and concrete influence.
Let’s take it one step at a time. Data is extremely valuable, particularly high-quality data that ensures privacy. Combining both factors has become a challenge, with companies and researchers increasingly resorting to artificially generated synthetic data. This type of data can improve the performance of machine learning models, for instance. GANs are state-of-the-art deep generative models that can generate new synthetic samples that follow the underlying data distribution of the original dataset.
Álvaro Figueira explained that, until recently, the area of synthetic data generation included studies and reviews on classical methods and GANs, but independently and separately. “This article is innovative since it combines both areas in a single comprehensive study, covering everything from classical synthetic data generation methods to generative adversarial networks (GANs), with a special focus on GANs for the creation of tabular data. In addition, the article provides a review of the main methods for evaluating the quality of the data generated, something that other studies do not cover – at least in such a comprehensive way.”
There are several sectors resorting to the use of synthetic data, e.g., healthcare, finance, technology or mobility. Concerning the first, the researcher explained that the solution “allowed the creation of data to carry out studies without violating the privacy of patients. In finance, synthetic data could contribute to simulate risk scenarios. Regarding the technology sector, synthetic data is vital to train AI algorithms when real data is scarce, or when there is a class imbalance problem.”
All these applications “allow to improve the efficiency, innovation and safety of processes, favouring the development and testing of solutions without restrictions in terms of sensitive or insufficient data”.
In this sense, both the research and the article “can act as a vital reference for new researchers, providing a solid basis for synthetic data generation methods and main evaluation techniques”, said Álvaro Figueira, also claiming that “by highlighting existing gaps, like the need to focus on tabular data, they can also encourage future research and allow advances in this specific area. This will lead to improved algorithms and practices for the creation and development of applications in critical sectors.”
It is a starting point for studies in this area, ensuring a comprehensive and structured view of the literature on synthetic data generation methods and GANs. “Besides compiling and analysing the most significant methods, the article introduces a future research proposal that includes the evaluation of the quality of the data generated by different GAN architectures for tabular data. This is especially relevant for applications that deal with unbalanced data and can bring improvements in the performance of machine learning models in minority classes,” concluded the researcher.
Recently published in the journal Mathematics and acknowledged as one of the best works in this area, the article “Survey on Synthetic Data Generation, Evaluation Methods and GANs” is available here.
The researcher mentioned in this news piece is associated with INESC TEC and UP-FCUP.