Video Echoed in Music:
Semantic, Temporal, and Rhythmic Alignment for Video-to-Music Generation

Welcome to Video Demo page for our paper. A subset of generated music samples is showcased here. All textual descriptions presented on this page are detailed in Appendix D.3, and the corresponding videos are included in the Supplementary Material.


1. Comparison on TB-Match Video Test

We provide 5 video samples, each with 8 soundtrack variations, 1 for groundtruth, 2 for the proposed VeM, and 5 for baseline methods (GVMGen, VidMuse, M2UGen, Diff-BGM, CMT). To showcase the superior temporal and rhythmic alignment of our method, we introduce a specialized demo (VeM_click) for each test video. The audible "clicks" clearly mark music beat timestamps that coincide with video transitions (the interval error < 0.5s). These markers show that transitions occur exactly at beat boundaries rather than at intermediate beat phases, explicitly verifying the tighter beat-transition synchronization of our method.

Groundtruth
Ours
Comparison
GroundTruth
VeM
VeM_click
GVMGen
VidMuse
M2UGen
Diff-bgm
CMT
GroundTruth
VeM
VeM_click
GVMGen
VidMuse
M2UGen
Diff-bgm
CMT
GroundTruth
VeM
VeM_click
GVMGen
VidMuse
M2UGen
Diff-bgm
CMT
GroundTruth
VeM
VeM_click
GVMGen
VidMuse
M2UGen
Diff-bgm
CMT
GroundTruth
VeM
VeM
GVMGen
VidMuse
M2UGen
Diff-bgm
CMT

2. Cross-Domain Video Test

We present 13 test video demos from various external domains, including 10 randomly selected online samples and 3 SymMV videos. The audible click markers emphasize beat-transition alignment. These demos demonstrate robustness in complex and diverse scenarios, particularly in terms of semantic relevance and rhythmic consistency.

Randomly Selected Online Samples:

Randomly Selected SymMV Samples:

3. Sora-Generated Video Test

We display 10 demos (2 variations per video) for 5 silent videos generated by Sora. Prioritizing temporal continuity in Sora-generated videos results in fewer transitions, and thus click markers are omitted for this set. Each video is tested twice with distinct music outputs, both tracks maintaining consistent style, ambiance, feelings, and temporal structure.