publications | Tianxin Xie

2026

AAAI 2026
Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Linghan Fang, Tianxin Xie, and Li Liu

Proceedings of the AAAI Conference on Artificial Intelligence, Mar 2026

Abs DOI Bib

Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method significantly outperforms existing state-of-the-art (SOTA), including SUTA and SGEM, in both accuracy and inference speed. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method’s enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
@article{fang2026boosting, title = {Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards}, author = {Fang, Linghan and Xie, Tianxin and Liu, Li}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, volume = {40}, number = {36}, pages = {30673--30681}, year = {2026}, month = mar, doi = {10.1609/aaai.v40i36.40323}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/40323} }

2025

arXiv

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, and 8 more authors

arXiv preprint arXiv:2512.23994, 2025

@article{xie2025phyavbench,
  title = {PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation},
  author = {Xie, Tianxin and Lei, Wentao and Jiang, Kai and Huang, Guanjie and Zhang, Pengfei and Zhang, Chunhui and Ma, Fengji and He, Haoyu and Zhang, Han and He, Jiangshan and others},
  journal = {arXiv preprint arXiv:2512.23994},
  year = {2025},
}

EMNLP 2025
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey

Tianxin Xie, Yan Rong, Pengfei Zhang, and 2 more authors

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Abs DOI Bib PDF Code

Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides **the first** comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
@inproceedings{xie-etal-2025-towards, title = {Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey}, author = {Xie, Tianxin and Rong, Yan and Zhang, Pengfei and Wang, Wenwu and Liu, Li}, editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.emnlp-main.40/}, doi = {10.18653/v1/2025.emnlp-main.40}, pages = {764--791}, isbn = {979-8-89176-332-6}, }

arXiv

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Tianxin Xie, Shan Yang, Chenxing Li, and 2 more authors

arXiv preprint arXiv:2508.03543, 2025

Bib PDF

@article{xie2025emosteer,
  title = {EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering},
  author = {Xie, Tianxin and Yang, Shan and Li, Chenxing and Yu, Dong and Liu, Li},
  journal = {arXiv preprint arXiv:2508.03543},
  year = {2025},
}

ICASSP 2025

Inter- and Intra-Sentence Cuer-Invariant Representation Learning for Generalizable Cued Speech Recognition

Tianxin Xie and Li Liu

In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

DOI Bib PDF

@inproceedings{xie2025icasspcued,
  title = {Inter- and Intra-Sentence Cuer-Invariant Representation Learning for Generalizable Cued Speech Recognition},
  author = {Xie, Tianxin and Liu, Li},
  booktitle = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year = {2025},
  pages = {1--5},
  keywords = {Hands; Representation learning; Degradation; Visualization; Lips; Speech recognition; Contrastive learning; Feature extraction; Synchronization; Speech processing; Cued Speech Recognition; Multi-modal Learning; Cuer Generalization; Contrastive Learning},
  doi = {10.1109/ICASSP49660.2025.10888246},
}

TPAMI

Natural Adversarial Mask for Face Identity Protection in Physical World

Tianxin Xie, Hu Han, Shiguang Shan, and 1 more author

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

DOI Bib PDF Code

@article{xie2025tpami,
  author = {Xie, Tianxin and Han, Hu and Shan, Shiguang and Chen, Xilin},
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  title = {Natural Adversarial Mask for Face Identity Protection in Physical World},
  year = {2025},
  volume = {47},
  number = {3},
  pages = {2089-2106},
  keywords = {Face recognition;Faces;Protection;Closed box;Privacy;Glass box;Perturbation methods;Three-dimensional printing;Target recognition;Face masks;Deep learning;adversarial example;face identity protection;physical world;natural face mask},
  doi = {10.1109/TPAMI.2024.3522994},
}