INESC TEC is developing new AI tools to generate synthetic data for lung cancer diagnosis

The Institute’s researchers involved in the European project Phase IV AI have developed four technologies that pave the way for better diagnosis systems without compromising patient privacy. These Artificial Intelligence (AI)-based tools can generate high-quality synthetic medical images.

Training AI algorithms to detect lung cancer is a challenge that goes beyond technology: it requires large volumes of annotated medical data that are difficult and costly to obtain, and which raise serious privacy concerns. This is precisely the problem that the INESC TEC team working on Phase IV AI is tackling, by developing tools that generate synthetic pulmonary CT (computed tomography) images – realistic enough to pass for real ones, but with no personal data attached.

The four technologies have been validated in different contexts and cover different stages of the data generation and improvement process for computer-aided diagnosis (CAD) systems. Whilst Phase IV AI has several use cases, INESC TEC is participating specifically in the lung cancer one.

From 3D generation to layer-by-layer synthesis – four complementary approaches

The first technology is a tool that can “create” lung CT scans that look real, despite not being so. The system generates medical images from scratch, with no real patient required. Training AI algorithms to detect lung cancer demands thousands of annotated CT scans – something that is notoriously hard to obtain, since patient data is protected, annotating medical images requires radiologists (which is expensive and time-consuming), and some clinical conditions are rare, meaning few examples exist. This technology addresses all those challenges.

Technically, it works in two steps: first, a diffusion model generates a low-resolution “sketch” version of the CT (essentially a rough draft in which the overall structure of the lung is already present, but without fine detail); then a SRGAN model takes that draft and upscales it fourfold, adding the fine details needed to make it realistic. The result is a 512×512×60 image – standard clinical resolution. Before this technology can be deployed more widely, clinical validation by radiologists is still required.

The second tool generates CT scans on demand, based on a map provided by the researchers. A researcher might, for instance, prompt the system with: “I want a CT scan in which the lung is in this location, with a nodule here.” This approach aims to address the shortage of specific, annotated data for training algorithms to identify nodules. Diffusion models are guided by anatomical segmentation masks, e.g., the position of the lung or nodules, enabling the generation of images that respect specific structural constraints. This cfeature is particularly useful when training data is scarce or imbalanced. The tool is aimed primarily at companies developing diagnosis software. Compared to the first tool, it offers greater control and is more suited to those requiring highly specific data, whilst the first is better for increasing data volume in a more general way.

HUYDRA (the third technology) represents an innovative approach: rather than generating the full CT image in one go, it breaks it down by Hounsfield Unit (HU) intervals – the scale used to characterise different tissue types – generating each interval separately before reconstructing the complete image. The multi-head VQVAE architecture used outperforms baseline models by around 6.2% on the FID metric, with lower computational complexity. The main advantage of this tool lies in the efficiency combined with improved quality. It is also the only one of the four technologies currently being tested by clinical staff (rather than purely by mathematical metrics) which signals a greater level of maturity for real-world use.

The fourth and final technology does not generate CT scans, unlike the others. Instead, it generates the maps that describe the contents of a CT; in other words, rather than producing the medical image itself, it generates the anatomical “blueprint” showing where the lung is and where any nodules are located. This “blueprint” can then be used as input for the other technologies, particularly the second, the on-demand CT generator. Using that second tool requires segmentation masks to guide the generation process, but said masks are scarce and carry associated patient data. This final technology resolves that issue by generating realistic synthetic masks from scratch, without needing any real patient as a starting point. The results achieved with this technology are available in a public repository.

The practical applications of these tools are wide-ranging: from expanding datasets for training AI models, to sharing synthetic data between institutions without legal risk, to radiology training. The research team also plans to scale the solutions to supercomputers and explore conditional generation based on specific clinical conditions.

The results have already been submitted to or published in leading scientific forums, including IEEE CBMS 2026, IEEE EMBC 2025, and the journal IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI).

Phase IV AI is a European project applying AI to concrete clinical use cases, focusing on lung cancer. At INESC TEC, the work is led by Hélder Oliveira and Tânia Pereira, with the participation of researchers Daniela Ferreira Santos, Pedro Sousa, António Cardoso, Vitória Cruz, Diogo Azevedo, Diogo Martins, and others. The project’s final conference will take place in Turku, Finland, in June 2025.

INESC TEC is developing new AI tools to generate synthetic data for lung cancer diagnosis

Categories

NEWSLETTER