Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method significantly outperforms existing state-of-the-art (SOTA), including SUTA and SGEM, in both accuracy and inference speed. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method’s enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
@article{fang2026boosting,title={Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards},author={Fang, Linghan and Xie, Tianxin and Liu, Li},journal={Proceedings of the AAAI Conference on Artificial Intelligence},volume={40},number={36},pages={30673--30681},year={2026},month=mar,doi={10.1609/aaai.v40i36.40323},url={https://ojs.aaai.org/index.php/AAAI/article/view/40323}}
2025
arXiv
PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation
Tianxin Xie, Wentao Lei, Kai Jiang, and 8 more authors
@article{xie2025phyavbench,title={PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation},author={Xie, Tianxin and Lei, Wentao and Jiang, Kai and Huang, Guanjie and Zhang, Pengfei and Zhang, Chunhui and Ma, Fengji and He, Haoyu and Zhang, Han and He, Jiangshan and others},journal={arXiv preprint arXiv:2512.23994},year={2025},}
EMNLP 2025
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey
Tianxin Xie, Yan Rong, Pengfei Zhang, and 2 more authors
In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025
Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides **the first** comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
@inproceedings{xie-etal-2025-towards,title={Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey},author={Xie, Tianxin and Rong, Yan and Zhang, Pengfei and Wang, Wenwu and Liu, Li},editor={Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet},booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing},year={2025},address={Suzhou, China},publisher={Association for Computational Linguistics},url={https://aclanthology.org/2025.emnlp-main.40/},doi={10.18653/v1/2025.emnlp-main.40},pages={764--791},isbn={979-8-89176-332-6},}
arXiv
EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering
Tianxin Xie, Shan Yang, Chenxing Li, and 2 more authors
@article{xie2025emosteer,title={EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering},author={Xie, Tianxin and Yang, Shan and Li, Chenxing and Yu, Dong and Liu, Li},journal={arXiv preprint arXiv:2508.03543},year={2025},}
ICASSP 2025
Inter- and Intra-Sentence Cuer-Invariant Representation Learning for Generalizable Cued Speech Recognition
Tianxin Xie and Li Liu
In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025
@article{xie2025tpami,author={Xie, Tianxin and Han, Hu and Shan, Shiguang and Chen, Xilin},journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},title={Natural Adversarial Mask for Face Identity Protection in Physical World},year={2025},volume={47},number={3},pages={2089-2106},keywords={Face recognition;Faces;Protection;Closed box;Privacy;Glass box;Perturbation methods;Three-dimensional printing;Target recognition;Face masks;Deep learning;adversarial example;face identity protection;physical world;natural face mask},doi={10.1109/TPAMI.2024.3522994},}