PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie^1,2,†, Wentao Lei^1,†, Guanjie Huang^1,†, Pengfei Zhang^1,†, Kai Jiang^1,†, Chunhui Zhang^1,3, Fengji Ma¹, Haoyu He¹, Han Zhang¹, Jiangshan He¹, Jinting Wang¹, Linghan Fang⁴, Lufei Gao¹, Orkesh Ablet¹, Peihua Zhang², Ruolin Hu¹, Shengyu Li¹, Weilin Lin^1,2, Xiaoyang Feng¹, Xinyue Yang¹, Yan Rong¹, Yanyun Wang¹, Zihang Shao¹, Zelin Zhao¹, Chenxing Li², Shan Yang², Wenfu Wang², Meng Yu², Dong Yu², Li Liu^1,*

¹HKUST(GZ), ²Tencent, ³Shanghai Jiao Tong University, ⁴Technical University of Munich

^†Core contributors.

^*Corresponding author: Li LIU, avrillliu@hkust-gz.edu.cn

Abstract

Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.

Data Distribution

Audio-Physics Sensitivity Test

Comparison with Existing Benchmarks

TABLE I: Comparison of unified audio-video generation benchmarks across audio-physics coverage, controlled setting, acoustic scenario coverage, data origin, ground-truth video numbers, and evaluation metrics.

SAVGBench evaluates unconditioned audio-video generation. VABench contains only text prompts and conducts evaluation using MLLM.

Benchmark	Audio-Physics Coverage	Controlled Setting with Paired Samples	Acoustic Scenario Coverage				Newly Collected	#GT Videos per Prompt	Evaluation Metric
Benchmark	Audio-Physics Coverage	Controlled Setting with Paired Samples	Music	SFX	Speech	Mix	Newly Collected	#GT Videos per Prompt	Evaluation Metric
TAVGBench	✗	✗	✓	✓	✓	✓	✗	1	AV-Align
SAVGBench	1 Test Point	✗	✓	✗	✓	✗	✗	-	AV&Spatial-Align
Verse-Bench	✗	✗	✓	✓	✓	✓	✓	1	AV-Align
JavisBench	✗	✗	✓	✓	✓	✓	✓ (partial)	1	AV-Align
VABench	4 Test Points	✗	✓	✓	✓	✓	-	0	AV&Stereo Align
PhyAVBench (Ours)	6 Dimensions & 50 Test Points	✓	✓	✓	✓	✓	✓	≥ 20	AV-Align & Physics Sensitivity Test

Data Curation Pipeline

Sample Video Pairs in PhyAVBench

Each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. The following are some sample video pairs in PhyAVBench, shwoing the diversity of the data.

Prompt

Sora2

Veo3.1

OVI

Close-up, static camera. An index finger slowly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor.

m01_c03_t08_s02_g011_a01

Close-up, static camera. An index finger quickly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor.

m01_c03_t08_s02_g011_b01

Prompt

Sora2

Veo3.1

OVI

Close-up, static camera. Water flows into a cup at a slow, gentle rate. Indoor.

m02_c05_t14_s02_g004_a01

Close-up, static camera. Water flows into a cup at a fast, strong rate. Indoor.

m02_c05_t14_s02_g004_b01

Prompt

Sora2

Veo3.1

OVI

A close-up, static shot of a transparent plastic bottle. The bottle contains no water. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle.

m02_c06_t16_s02_g001_a01

A close-up, static shot of a transparent plastic bottle. The bottle is filled with water to about four-fifths of its capacity. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle.

m02_c06_t16_s02_g001_b01

Prompt

Sora2

Veo3.1

OVI

Static medium shot of an empty plastic bottle being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room.

m02_c07_t18_s02_g001_a01

Static medium shot of a plastic bottle filled with water to about four-fifths of its capacity being dropped onto an indoor floor. water to about four-fifths of its capacity being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room.

m02_c07_t18_s02_g001_b01

Prompt

Sora2

Veo3.1

OVI

A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it rapidly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket.

m02_c08_t20_s02_g001_a01

A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it slowly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket.

m02_c08_t20_s02_g001_b01

Prompt

Sora2

Veo3.1

OVI

Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on low airflow, no nozzle attachment about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on.

m02_c08_t20_s02_g018_a01

Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on high airflow with a narrow concentrator nozzle attached about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on.

m02_c08_t20_s02_g018_b01

Prompt

Sora2

Veo3.1

OVI

A static, medium close-up shot facing the doorway. The door is open. In a quiet indoor environment, music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall.

m02_c12_t33_s01_g001_a01

A static, medium close-up shot facing the doorway. The door is closed. In a quiet indoor environment, the same music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall.

m02_c12_t33_s01_g001_b01

Prompt

Sora2

Veo3.1

OVI

A static, medium close-up shot of a smartphone placed on a table. The phone is not covered. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone and the surrounding tabletop.

m03_c12_t33_s03_g001_a01

A static, medium close-up shot of a smartphone placed on a table. The phone is completely covered by an upside-down transparent plastic box. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone, the transparent plastic box, and the surrounding tabletop.

m03_c12_t33_s03_g001_b01

Prompt

Sora2

Veo3.1

OVI

Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with aperiodic, irregular timing; indoor, quiet.

m05_c18_t45_s02_g006_a01

Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with periodic, regular timing; indoor, quiet.

m05_c18_t45_s02_g006_b01

Prompt

Sora2

Veo3.1

OVI

A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is boiling vigorously, producing continuous bubbling and steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the active boiling water.

m06_c19_t46_s02_g001_a01

A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is gently simmering, with occasional small bubbles forming and minimal steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the lightly boiling water.

m06_c19_t46_s02_g001_b01