“What would this image sound like?” INESC TEC technology translates visual emotion into music

The ability to automatically generate original soundtracks could have a significant impact on digital content production, advertising and multimedia. A solution developed by an INESC TEC researcher outperforms other state-of-the-art models and the plan is to be publicly available in the future.

Every day, the world’s most popular video library, YouTube, gets around 20 million new uploads: countless hours of footage, most accompanied by music. Yet finding the right soundtrack is not always straightforward. It can be time-consuming (requiring the study of numerous free databases) or costly and technically demanding. This challenge could be solved by a tool capable of generating royalty-free music directly from video images.

That is precisely the solution INESC TEC is developing; researcher Serkan Sulun is creating an Artificial Intelligence (AI) system capable of automatically generating original music that adapts not only to a video’s emotional tone, but also to rhythm and visual structure.

Published in IEEE’s journal IEEE Transactions on Multimedia, the research introduces EMSYNC – a model that composes music in symbolic (MIDI) format from any video and ensures two essential elements of a soundtrack: emotional coherence and temporal synchronisation with scene cuts or shot changes.

Unlike many systems that generate final audio directly, EMSYNC produces music in MIDI format: a set of digital instructions containing information on pitch, velocity and tempo. This means compositions can later be edited and refined, allowing creators greater creative control and flexibility.

Serkan’s model has been developed over more than five years and was explored during his PhD under the proposal Video-Based Music Generation. “Personally, I have always been passionate about music, but I lacked the talent to play. My master’s research focused on AI and videos, so this idea seemed like the only way I could contribute to some form of musical creation”, explained the researcher.

“Academically, current methods handle video-based music generation using a segment-based approach, such as matching video frames to short musical sections. I find this approach, as well as the results, unrealistic, and I wanted to take a more holistic approach, like how musicians compose: considering the entirety of the video and music, interpreting the emotions, and matching their rhythms to create synchronicity”, he added.

The model follows a two-stage approach; first, it analyses the video’s emotional content using a classifier that combines image, audio, text and facial expressions. Then, it converts detected emotions into a dimensional model with two key axes: valence (positive or negative) and arousal (level of energy).

The perfect marriage

Based on this classification, music generation begins. The result is a composition that reinforces the emotional state conveyed by the images and is suitable for use in digital content, advertising and multimedia – with lower costs and without copyright concerns.

But EMSYNC demonstrates that music must do more than simply match emotion. Emotional association alone does not necessarily produce a convincing soundtrack; the music must also follow the rhythm of the video.

This is where the concept of boundary offsets comes into play. The model anticipates scene cuts, associates musical chords with these transitions and calculates the temporal distance to the next cut. According to the paper, this focus on key narrative moments helps maintain rhythmic stability and produce more perceptible synchronisation – naturally “marrying” image and sound.

“Current research on music generation focuses on audio-only methods. Hybrid methods, using MIDI first and audio next, would better resemble how humans produce music: composing first, performing next. Additionally, current methods treat MIDI as a uniform sequence, although musicians compose using distinct sections like intro, verse, and chorus. MIDI generation models using section information would produce much more human-like music. Of course, this requires MIDI datasets with section labels, and here we can again use AI to automatically label existing MIDI datasets”, explained Serkan.

The system was compared with benchmark models in automatic music generation for video through both objective and subjective evaluations. According to the publication, the proposed structure “outperforms state-of-the-art models in all subjective metrics and in most objective metrics across all datasets”.

Serkan plans to make the model available to the public in the future and in an evaluation involving 153 participants, EMSYNC was consistently rated higher in musical quality, richness, emotional correspondence, rhythmic synchronisation and overall suitability to the video.

“What would this image sound like?” INESC TEC technology translates visual emotion into music

Categories

NEWSLETTER