PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie1,2,†, Wentao Lei1,†, Guanjie Huang1,†, Pengfei Zhang1,†, Kai Jiang1,†, Chunhui Zhang1,3, Fengji Ma1, Haoyu He1, Han Zhang1, Jiangshan He1, Jinting Wang1, Linghan Fang4, Lufei Gao1, Orkesh Ablet1, Peihua Zhang2, Ruolin Hu1, Shengyu Li1, Weilin Lin1,2, Xiaoyang Feng1, Xinyue Yang1, Yan Rong1, Yanyun Wang1, Zihang Shao1, Zelin Zhao1, Chenxing Li2, Shan Yang2, Wenfu Wang2, Meng Yu2, Dong Yu2, Li Liu1,*
1HKUST(GZ), 2Tencent, 3Shanghai Jiao Tong University, 4Technical University of Munich

Core contributors.

*Corresponding author: Li LIU, avrillliu@hkust-gz.edu.cn

Abstract

Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.

Data Distribution

Data Distribution
Fig. 1: The data distribution of PhyAVBench.

Audio-Physics Sensitivity Test

Audio-Physics Sensitivity Test
Fig. 2: Overview of the PhyAVBench evaluation framework. The Audio-Physics Sensitivity Test (APST) uses paired prompts that differ by a single physical variable (e.g., material). By comparing the directional trends of generated audio features against ground-truth physical laws, we calculate the Contrastive Physical Response Score (CPRS) to assess the model's understanding of real-world physics.

Comparison with Existing Benchmarks

TABLE I: Comparison of unified audio-video generation benchmarks across audio-physics coverage, controlled setting, acoustic scenario coverage, data origin, ground-truth video numbers, and evaluation metrics.

SAVGBench evaluates unconditioned audio-video generation. VABench contains only text prompts and conducts evaluation using MLLM.

Benchmark Audio-Physics Coverage Controlled Setting with Paired Samples Acoustic Scenario Coverage Newly Collected #GT Videos per Prompt Evaluation Metric
Music SFX Speech Mix
TAVGBench 1 AV-Align
SAVGBench 1 Test Point - AV&Spatial-Align
Verse-Bench 1 AV-Align
JavisBench ✓ (partial) 1 AV-Align
VABench 4 Test Points - 0 AV&Stereo Align
PhyAVBench (Ours) 6 Dimensions & 50 Test Points ≥ 20 AV-Align & Physics Sensitivity Test

Data Curation Pipeline

Data Curation Pipeline
Fig. 3: The data curation pipeline of PhyAVBench.

Sample Video Pairs in PhyAVBench

Each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. The following are some sample video pairs in PhyAVBench, shwoing the diversity of the data.

Prompt GT Sora2 Veo3.1 OVI

Close-up, static camera. An index finger slowly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor.

m01_c03_t08_s02_g011_a01

Close-up, static camera. An index finger quickly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor.

m01_c03_t08_s02_g011_b01

Prompt GT Sora2 Veo3.1 OVI

Close-up, static camera. Water flows into a cup at a slow, gentle rate. Indoor.

m02_c05_t14_s02_g004_a01

Close-up, static camera. Water flows into a cup at a fast, strong rate. Indoor.

m02_c05_t14_s02_g004_b01

Prompt GT Sora2 Veo3.1 OVI

A close-up, static shot of a transparent plastic bottle. The bottle contains no water. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle.

m02_c06_t16_s02_g001_a01

A close-up, static shot of a transparent plastic bottle. The bottle is filled with water to about four-fifths of its capacity. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle.

m02_c06_t16_s02_g001_b01

Prompt GT Sora2 Veo3.1 OVI

Static medium shot of an empty plastic bottle being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room.

m02_c07_t18_s02_g001_a01

Static medium shot of a plastic bottle filled with water to about four-fifths of its capacity being dropped onto an indoor floor. water to about four-fifths of its capacity being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room.

m02_c07_t18_s02_g001_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it rapidly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket.

m02_c08_t20_s02_g001_a01

A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it slowly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket.

m02_c08_t20_s02_g001_b01

Prompt GT Sora2 Veo3.1 OVI

Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on low airflow, no nozzle attachment about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on.

m02_c08_t20_s02_g018_a01

Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on high airflow with a narrow concentrator nozzle attached about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on.

m02_c08_t20_s02_g018_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium close-up shot facing the doorway. The door is open. In a quiet indoor environment, music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall.

m02_c12_t33_s01_g001_a01

A static, medium close-up shot facing the doorway. The door is closed. In a quiet indoor environment, the same music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall.

m02_c12_t33_s01_g001_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium close-up shot of a smartphone placed on a table. The phone is not covered. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone and the surrounding tabletop.

m03_c12_t33_s03_g001_a01

A static, medium close-up shot of a smartphone placed on a table. The phone is completely covered by an upside-down transparent plastic box. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone, the transparent plastic box, and the surrounding tabletop.

m03_c12_t33_s03_g001_b01

Prompt GT Sora2 Veo3.1 OVI

Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with aperiodic, irregular timing; indoor, quiet.

m05_c18_t45_s02_g006_a01

Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with periodic, regular timing; indoor, quiet.

m05_c18_t45_s02_g006_b01

Prompt GT Sora2 Veo3.1 OVI

A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is boiling vigorously, producing continuous bubbling and steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the active boiling water.

m06_c19_t46_s02_g001_a01

A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is gently simmering, with occasional small bubbles forming and minimal steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the lightly boiling water.

m06_c19_t46_s02_g001_b01