HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization

1Wechat, Tencent Inc.    2Beijing University of Posts and Telecommunications
3Zhejiang University

*Indicates Equal Contribution


CVPR 2025
Code arXiv

🤗

Dataset

HarmonySet is the first instruction tuning dataset for MLLMs to understand the alignment between video and music, providing rich annotation on
rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance.

Abstract

We introduce HarmonySet, a comprehensive dataset designed to advance video-music understanding. HarmonySet consists of 48,328 diverse video-music pairs, annotated with detailed information on rhythmic synchronization, emotional alignment, thematic coherence, and cultural relevance. We propose a multi-step human-machine collaborative framework for efficient annotation, combining human insights with machine-generated descriptions to identify key transitions and assess alignment across multiple dimensions. Additionally, we introduce a novel evaluation framework with tasks and metrics to assess the multi-dimensional alignment of video and music, including rhythm, emotion, theme, and cultural context. Our extensive experiments demonstrate that HarmonySet, along with the proposed evaluation framework, significantly improves the ability of multimodal models to capture and analyze the intricate relationships between video and music.

Effectiveness on Assessing AI-Generated Music
Video with Music Uploaded by Users
VideoLLaMA2
The video and music match in terms of rhythm and synchronization, as the music plays throughout the video, and the camera pans and focuses on different parts of the city. The emotional impact of the music is upbeat and energetic. The cultural relevance of the video is related to Northern Ireland and its tourism industry.

VideoLLaMA2 with HarmonySet
Rhythm and Synchronization: The tempo is moderate, allowing the viewer to take in the cityscape. There are moments of synchronization, particularly when the camera focuses on the Hotel. Theme and Content: The music's theme appears to be centered around a sense of place and belonging, matching the visuals of the city and the doorman welcoming guests. Emotion: The music evokes a feeling of optimism and pride, using a uplifting nature complements the video's aim to present Belfast in a positive light. Culture: The music mainly serves as background music without specific cultural references.
Video with Music Generated by M2UGen
VideoLLaMA2
The music in the video is slow and moody. The video and music match in terms of rhythm and synchronization, as well as thematic content, emotional, and cultural (Vague interpretation). The video captures the essence of Belfast's historic and cultural significance, while the music reflects the city's past and present struggles and resilience. (wrong justification)

VideoLLaMA2 with HarmonySet
Rhythm and Synchronization: The overall rhythm of the music is the same, but it does not show good synchronization, especially when the video switches from the city to the hotel close-up. Theme and Content: The music does not represent a specific theme, while the video mainly focuses on city scenery. Emotion: The music maintains a repetitive and cyclic character, evoking a feeling of calm and contemplation. The warmth and hospitality displayed by the video may be more in line with a positive and energetic music. Culture: The music itself has no obvious cultural markers.
The vanilla model often provides the wrong justification for the generated music for its harmony with the visual content (highlighted in red text). The trained model offers more insightful analysis and can effectively assess both human-composed and AI-generated music. Our dataset facilitates a deeper understanding of both synchronization and semantic alignment.
More Data Cases of HarmonySet
Rhythmic Synchronization:     The fast-paced cuts and transitions of the video align perfectly with the energetic tempo of the song. The edits often occur on the beat, creating a dynamic and engaging visual experience that complements the music's rhythm.

Thematic Coherence:     The visuals showcase shared moments of fun, friendship, and perhaps even a hint of romantic interest, resonating with the song's desire-centric theme. The activities shown, like eating, shopping, and sightseeing, suggest a sense of adventure and pursuit.

Emotional Alignment:     Both the song and the video evoke feelings of joy, excitement, and youthful exuberance. The upbeat music and the smiles, laughter, and playful interactions in the video create a positive and energetic atmosphere.

Cultural Relevance:     The video captures elements of contemporary youth culture, including fashion, social activities, and popular brands. This aligns with the modern pop sound of the song, making the overall presentation feel relevant and relatable to a younger audience.
Rhythmic Synchronization:     The music's playful and slightly quirky rhythm complements the video's action. The comical sound effect at precisely 0 seconds, coinciding with the initial fall, is a perfect example of synchronization, emphasizing the humorous nature of the wipeout. The shift to more upbeat music around the 5-second mark fits the transition of plot.

Thematic Coherence:     The lighthearted and almost whimsical tone of the music matches the video's theme of a minor snowboarding accident. It doesn't take itself too seriously, which is reflected in the music choice. The music supports the narrative of a small setback turned into a funny story.

Emotional Alignment:     The music helps convey a sense of amusement and lightheartedness, even though the snowboarder fell. The comedic sound effect at the moment of impact immediately sets a humorous tone.

Cultural Relevance:     The music has a universal appeal due to its playful nature and doesn't necessarily tie into a specific culture. The video itself touches upon the culture of snowboarding and the lighthearted attitude often associated with extreme sports enthusiasts.

Poster

BibTeX