publications | Tianxin Xie

2026

ICML 2026

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

Pengfei ZHANG, Tianxin Xie, Minghao Yang, and Li Liu

In The Forty-Third International Conference on Machine Learning, 2026

@inproceedings{zhang2026agrepa,
  title = {AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching},
  author = {ZHANG, Pengfei and Xie, Tianxin and Yang, Minghao and Liu, Li},
  booktitle = {The Forty-Third International Conference on Machine Learning},
  year = {2026},
}

ICLR 2026

Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis

Pengfei ZHANG, Tianxin Xie, Minghao Yang, and Li Liu

In The Fourteenth International Conference on Learning Representations, 2026

Bib PDF Code

@inproceedings{zhang2026respagent,
  title = {Resp-Agent: An Agent-Based System for Multimodal Respiratory Sound Generation and Disease Diagnosis},
  author = {ZHANG, Pengfei and Xie, Tianxin and Yang, Minghao and Liu, Li},
  booktitle = {The Fourteenth International Conference on Learning Representations},
  year = {2026},
  url = {https://openreview.net/forum?id=ZkoojtEm3W},
}

AAAI 2026
Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards

Linghan Fang, Tianxin Xie, and Li Liu

Proceedings of the AAAI Conference on Artificial Intelligence, Mar 2026

Abs DOI Bib PDF

Recently, Automatic Speech Recognition (ASR) systems (e.g., Whisper) have achieved remarkable accuracy improvements but remain highly sensitive to real-world unseen data (data with large distribution shifts), including noisy environments and diverse accents. To address this issue, test-time adaptation (TTA) has shown great potential in improving the model adaptability at inference time without ground-truth labels, and existing TTA methods often rely on pseudo-labeling or entropy minimization. However, by treating model confidence as a learning signal, these methods may reinforce high-confidence errors, leading to confirmation bias that undermines adaptation. To overcome these limitations, we present ASR-TRA, a novel Test-time Reinforcement Adaptation framework inspired by causal intervention. More precisely, our method introduces a learnable decoder prompt and utilizes temperature-controlled stochastic decoding to generate diverse transcription candidates. These are scored by a reward model that measures audio-text semantic alignment, and the resulting feedback is used to update both model and prompt parameters via reinforcement learning. Comprehensive experiments on LibriSpeech with synthetic noise and L2 Arctic accented English datasets demonstrate that our method significantly outperforms existing state-of-the-art (SOTA), including SUTA and SGEM, in both accuracy and inference speed. Ablation studies further confirm the effectiveness of combining audio and language-based rewards, highlighting our method’s enhanced stability and interpretability. Overall, our approach provides a practical and robust solution for deploying ASR systems in challenging real-world conditions.
@article{fang2026boosting, title = {Boosting ASR Robustness via Test-Time Reinforcement Learning with Audio-Text Semantic Rewards}, author = {Fang, Linghan and Xie, Tianxin and Liu, Li}, journal = {Proceedings of the AAAI Conference on Artificial Intelligence}, volume = {40}, number = {36}, pages = {30673--30681}, year = {2026}, month = mar, doi = {10.1609/aaai.v40i36.40323}, url = {https://ojs.aaai.org/index.php/AAAI/article/view/40323}, }

2025

arXiv

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, and Li Liu

arXiv preprint arXiv:2512.23994, 2025

Bib PDF Code

@article{xie2025phyavbench,
  title = {PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation},
  author = {Xie, Tianxin and Lei, Wentao and Jiang, Kai and Huang, Guanjie and Zhang, Pengfei and Zhang, Chunhui and Ma, Fengji and He, Haoyu and Zhang, Han and He, Jiangshan and Wang, Jinting and Fang, Linghan and Gao, Lufei and Ablet, Orkesh and Zhang, Peihua and Hu, Ruolin and Li, Shengyu and Lin, Weilin and Feng, Xiaoyang and Yang, Xinyue and Rong, Yan and Wang, Yanyun and Shao, Zihang and Zhao, Zelin and Li, Chenxing and Yang, Shan and Wang, Wenfu and Yu, Meng and Yu, Dong and Liu, Li},
  journal = {arXiv preprint arXiv:2512.23994},
  year = {2025},
}

EMNLP 2025
Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey

Tianxin Xie, Yan Rong, Pengfei Zhang, Wenwu Wang, and Li Liu

In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025

Abs DOI Bib PDF Code

Text-to-speech (TTS) has advanced from generating natural-sounding speech to enabling fine-grained control over attributes like emotion, timbre, and style. Driven by rising industrial demand and breakthroughs in deep learning, e.g., diffusion and large language models (LLMs), controllable TTS has become a rapidly growing research area. This survey provides **the first** comprehensive review of controllable TTS methods, from traditional control techniques to emerging approaches using natural language prompts. We categorize model architectures, control strategies, and feature representations, while also summarizing challenges, datasets, and evaluations in controllable TTS. This survey aims to guide researchers and practitioners by offering a clear taxonomy and highlighting future directions in this fast-evolving field. One can visit https://github.com/imxtx/awesome-controllabe-speech-synthesis for a comprehensive paper list and updates.
@inproceedings{xie-etal-2025-towards, title = {Towards Controllable Speech Synthesis in the Era of Large Language Models: A Systematic Survey}, author = {Xie, Tianxin and Rong, Yan and Zhang, Pengfei and Wang, Wenwu and Liu, Li}, editor = {Christodoulopoulos, Christos and Chakraborty, Tanmoy and Rose, Carolyn and Peng, Violet}, booktitle = {Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing}, year = {2025}, address = {Suzhou, China}, publisher = {Association for Computational Linguistics}, url = {https://aclanthology.org/2025.emnlp-main.40/}, doi = {10.18653/v1/2025.emnlp-main.40}, pages = {764--791}, isbn = {979-8-89176-332-6}, }

arXiv

EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering

Tianxin Xie, Shan Yang, Chenxing Li, Dong Yu, and Li Liu

arXiv preprint arXiv:2508.03543, 2025

Bib PDF

@article{xie2025emosteer,
  title = {EmoSteer-TTS: Fine-Grained and Training-Free Emotion-Controllable Text-to-Speech via Activation Steering},
  author = {Xie, Tianxin and Yang, Shan and Li, Chenxing and Yu, Dong and Liu, Li},
  journal = {arXiv preprint arXiv:2508.03543},
  year = {2025},
}

ICASSP 2025

Inter- and Intra-Sentence Cuer-Invariant Representation Learning for Generalizable Cued Speech Recognition

Tianxin Xie and Li Liu

In ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025

DOI Bib PDF

@inproceedings{xie2025icasspcued,
  title = {Inter- and Intra-Sentence Cuer-Invariant Representation Learning for Generalizable Cued Speech Recognition},
  author = {Xie, Tianxin and Liu, Li},
  booktitle = {ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year = {2025},
  pages = {1--5},
  keywords = {Hands; Representation learning; Degradation; Visualization; Lips; Speech recognition; Contrastive learning; Feature extraction; Synchronization; Speech processing; Cued Speech Recognition; Multi-modal Learning; Cuer Generalization; Contrastive Learning},
  doi = {10.1109/ICASSP49660.2025.10888246},
  url = {https://ieeexplore.ieee.org/document/10888246}
}

TPAMI

Natural Adversarial Mask for Face Identity Protection in Physical World

Tianxin Xie, Hu Han, Shiguang Shan, and Xilin Chen

IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

DOI Bib PDF Code

@article{xie2025tpami,
  author = {Xie, Tianxin and Han, Hu and Shan, Shiguang and Chen, Xilin},
  journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence},
  title = {Natural Adversarial Mask for Face Identity Protection in Physical World},
  year = {2025},
  volume = {47},
  number = {3},
  pages = {2089-2106},
  keywords = {Face recognition;Faces;Protection;Closed box;Privacy;Glass box;Perturbation methods;Three-dimensional printing;Target recognition;Face masks;Deep learning;adversarial example;face identity protection;physical world;natural face mask},
  doi = {10.1109/TPAMI.2024.3522994},
  url = {https://ieeexplore.ieee.org/document/10816466}
}