HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Question

Does the background music fit this video?
Evaluate the video-music relevance across rhythm, theme, emotion, and culture.

Answer 1

VideoLLaMA2: The music is upbeat and catchy, with a strong rhythm that matches the lively movements. The emotion conveyed is one of joy and freedom.

HarmonySet:
Rhythmic Synchronization:     There is a distinct shift in the music at the 17- second mark, coinciding with a transition in the video from close-up shots of individuals to a shot of a large crowd and stage with pyrotechnics. This synchronization accentuates the shift in energy. The rapid transitions and dynamic visuals aligns well with the music's driving rhythm.

Thematic Coherence:     The content of the video is a celebration similar to a music festival. The explosive nature of the music and contemporary instruments used like electronic guitar indicate a modern celebration theme, suggesting an ongoing carnival that matches the main visual content.

Emotional Alignment:     The fast-paced music conveys exciting and joyful emotion similar to the visuals, enhancing the atmosphere of the celebration and exhilaration. In the latter part of the video, the music and visuals together reach the climax of emotion.

Cultural Relevance:     The video is primarily related to party culture. While the music does not contain specific cultural elements, it suggests the excitement of the party atmosphere.

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

Abstract

HarmonySet provides comprehensive video-music content, and stands out among existing video-music datasets by offering both semantic matching and temporal synchronization annotations.

While VideoLLaMA2 tuned on HarmonySet surpasses Gemini-1.5 Pro in certain aspects, it still falls short of human performance, highlighting both the challenging nature of our task and the limitations of current models.

Using 64 frames yields the lowest scores, indicating potential redundancy or even negative effects from excessive visual input within short (< 1 minute) videos.

Poster

BibTeX