Text-to-audio-video (T2AV) generation underpins a wide range of applications demanding realistic audio-visual content, including virtual reality, world modeling, gaming, and filmmaking. However, existing T2AV models remain incapable of generating physically plausible sounds, primarily due to their limited understanding of physical principles. To situate current research progress, we present PhyAVBench, a challenging audio physics-sensitivity benchmark designed to systematically evaluate the audio physics grounding capabilities of existing T2AV models. PhyAVBench comprises 1,000 groups of paired text prompts with controlled physical variables that implicitly induce sound variations, enabling a fine-grained assessment of models' sensitivity to changes in underlying acoustic conditions. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST). Unlike prior benchmarks that primarily focus on audio-video synchronization, PhyAVBench explicitly evaluates models' understanding of the physical mechanisms underlying sound generation, covering 6 major audio physics dimensions, 4 daily scenarios (music, sound effects, speech, and their mix), and 50 fine-grained test points, ranging from fundamental aspects such as sound diffraction to more complex phenomena, e.g., Helmholtz resonance. Each test point consists of multiple groups of paired prompts, where each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. Both prompts and videos are iteratively refined through rigorous human-involved error correction and quality control to ensure high quality. We argue that only models with a genuine grasp of audio-related physical principles can generate physically consistent audio-visual content. We hope PhyAVBench will stimulate future progress in this critical yet largely unexplored domain.
TABLE I: Comparison of unified audio-video generation benchmarks across audio-physics coverage, controlled setting, acoustic scenario coverage, data origin, ground-truth video numbers, and evaluation metrics.
SAVGBench evaluates unconditioned audio-video generation. VABench contains only text prompts and conducts evaluation using MLLM.
| Benchmark | Audio-Physics Coverage | Controlled Setting with Paired Samples | Acoustic Scenario Coverage | Newly Collected | #GT Videos per Prompt | Evaluation Metric | |||
|---|---|---|---|---|---|---|---|---|---|
| Music | SFX | Speech | Mix | ||||||
| TAVGBench | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✗ | 1 | AV-Align |
| SAVGBench | 1 Test Point | ✗ | ✓ | ✗ | ✓ | ✗ | ✗ | - | AV&Spatial-Align |
| Verse-Bench | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | 1 | AV-Align |
| JavisBench | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ (partial) | 1 | AV-Align |
| VABench | 4 Test Points | ✗ | ✓ | ✓ | ✓ | ✓ | - | 0 | AV&Stereo Align |
| PhyAVBench (Ours) | 6 Dimensions & 50 Test Points | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ≥ 20 | AV-Align & Physics Sensitivity Test |
Each prompt is grounded by at least 20 newly recorded or collected real-world videos, thereby minimizing the risk of data leakage during model pre-training. The following are some sample video pairs in PhyAVBench, shwoing the diversity of the data.
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Close-up, static camera. An index finger slowly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor. m01_c03_t08_s02_g011_a01 |
|
|
|
|
|
Close-up, static camera. An index finger quickly and repeatedly presses the spacebar of a mechanical keyboard multiple times. Other keys remain still. Indoor. m01_c03_t08_s02_g011_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Close-up, static camera. Water flows into a cup at a slow, gentle rate. Indoor. m02_c05_t14_s02_g004_a01 |
|
|
|
|
|
Close-up, static camera. Water flows into a cup at a fast, strong rate. Indoor. m02_c05_t14_s02_g004_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A close-up, static shot of a transparent plastic bottle. The bottle contains no water. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle. m02_c06_t16_s02_g001_a01 |
|
|
|
|
|
A close-up, static shot of a transparent plastic bottle. The bottle is filled with water to about four-fifths of its capacity. In a quiet environment, a person holds the bottle and continuously blows air into the bottle opening for about 1 second, repeated three times. The close-up frame includes the person's face from below the nose, hand and the plastic bottle. m02_c06_t16_s02_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Static medium shot of an empty plastic bottle being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room. m02_c07_t18_s02_g001_a01 |
|
|
|
|
|
Static medium shot of a plastic bottle filled with water to about four-fifths of its capacity being dropped onto an indoor floor. water to about four-fifths of its capacity being dropped onto an indoor floor. The bottle falls and hits the ground in a quiet room. m02_c07_t18_s02_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it rapidly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket. m02_c08_t20_s02_g001_a01 |
|
|
|
|
|
A static, medium shot recorded in a corridor. A person holds a badminton racket and swings it slowly through the air multiple times in succession. The environment is quiet. The camera remains still, clearly capturing the person's upper body and the full racket. m02_c08_t20_s02_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on low airflow, no nozzle attachment about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on. m02_c08_t20_s02_g018_a01 |
|
|
|
|
|
Static medium close-up in a small tiled bathroom. A man blow-dries hair with a hair dryer on high airflow with a narrow concentrator nozzle attached about 10 cm from the hair, moving it slowly back and forth for 8 seconds. Mirror and sink visible, door closed, ceiling light on. m02_c08_t20_s02_g018_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium close-up shot facing the doorway. The door is open. In a quiet indoor environment, music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall. m02_c12_t33_s01_g001_a01 |
|
|
|
|
|
A static, medium close-up shot facing the doorway. The door is closed. In a quiet indoor environment, the same music is playing from inside the room. The camera remains still throughout the shot, capturing the doorway area and surrounding wall. m02_c12_t33_s01_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium close-up shot of a smartphone placed on a table. The phone is not covered. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone and the surrounding tabletop. m03_c12_t33_s03_g001_a01 |
|
|
|
|
|
A static, medium close-up shot of a smartphone placed on a table. The phone is completely covered by an upside-down transparent plastic box. In a quiet environment, an arguing conversation is playing from the phone speaker. With the raindrops pitter-patter, one woman's voice says, "How could you do it?" and the other man responds, "Because l believed your sister indifferent to him." The camera remains still, clearly capturing the phone, the transparent plastic box, and the surrounding tabletop. m03_c12_t33_s03_g001_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with aperiodic, irregular timing; indoor, quiet. m05_c18_t45_s02_g006_a01 |
|
|
|
|
|
Close-up shot, static camera focused on a retractable ballpoint pen held in one hand; the thumb presses and releases the top button with periodic, regular timing; indoor, quiet. m05_c18_t45_s02_g006_b01 |
|
|
|
|
| Prompt | GT | Sora2 | Veo3.1 | OVI |
|---|---|---|---|---|
|
A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is boiling vigorously, producing continuous bubbling and steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the active boiling water. m06_c19_t46_s02_g001_a01 |
|
|
|
|
|
A static, medium shot of a metal pot placed on a stove. The pot is filled with water that is gently simmering, with occasional small bubbles forming and minimal steam. The environment is otherwise quiet. The camera remains still, clearly capturing the pot and the lightly boiling water. m06_c19_t46_s02_g001_b01 |
|
|
|
|