English | 中文

I’m Haoyu Li (李浩宇), an incoming Ph.D. student at Nanjing University starting September 2026, where I will be supervised by Shuai Wang, Tenure-Track Associate Professor. I received my M.S. in Computer Science and Technology from Shanghai Jiao Tong University under the supervision of Prof. Kai Yu.

My research focuses on Target Speaker Extraction (TSE), Automatic Speech Recognition (ASR), and Speech Large Language Models (Speech LLMs). I aim to develop robust speech interaction systems capable of operating in noisy, multi-talker real-world environments.


Research Interests

Generally, I am focusing on TSE front-end and Multi-talker ASR for front-end signal processing and speech understanding:

  • Speech Separation, including TSE and Blind Source Separation (BSS)
  • Speaker-Attributed ASR (SA-ASR)
  • Keyword Spotting (KWS) for resource-constrained edge devices

Research Experience

My recent work spans both academic labs and industry research:

  • Text-Guided Speech Separation and Robust Keyword Spotting (AISpeech, Suzhou)
    I developed a text-guided speech separation system that reduced the false rejection rate to 4.3% and lowered the false wake-up rate to 20% of the baseline in real-world multi-talker scenarios. Concurrently, I engineered an end-to-end keyword spotting algorithm for low-SNR environments, integrating robust streaming decoding with WFST optimization, yielding two papers accepted at ICASSP 2025.

  • Speaker-Adaptive Alignment for Flow-Matching TTS (Alibaba, Beijing)
    I proposed a dual temporal and hierarchical adaptive scheme that dynamically modulates supervision strength during denoising and assigns layer-specific alignment objectives, significantly enhancing timbre consistency in zero-shot voice cloning. This work has been submitted to Interspeech 2026.

  • A Novel Paradigm for Keyword-Guided Target Speaker Extraction (Nanjing University / Collaborative Research)
    I proposed a three-stage Detect-Attend-Extract framework that achieves extraction performance superior to conventional speech-enrollment baselines using only partial text cues. This work has been submitted to IJCAI 2026.


Publications (Selected)

You may refer to the full list on Google Scholar.

* indicates equal contribution.

  • Text-aware Speech Separation for Multi-talker Keyword Spotting
    Haoyu Li, Baochen Yang, Yu Xi, Linfeng Yu, Tian Tan, Hao Li, Kai Yu
    Interspeech 2024.
    paper link

  • Detect, Attend and Extract: Keyword Guided Target Speaker Extraction
    Haoyu Li*, Yu Xi*, Yidi Jiang, Shuai Wang, Kate Knill, Mark Gales, Haizhou Li, Kai Yu
    arXiv:2602.07977. Submitted to IJCAI-ECAI 2026.
    paper link

  • Time-Layer Adaptive Alignment for Speaker Similarity in Flow-Matching Based Zero-Shot TTS
    Haoyu Li*, Mingyang Han*, Yu Xi, Dongxiao Wang, Hankun Wang, Haoxiang Shi, Boyu Li, Jun Song, Bo Zheng, Shuai Wang, Kai Yu
    arXiv:2511.09995. Submitted to Interspeech 2026.
    paper link

  • Streaming Keyword Spotting Boosted by Cross-layer Discrimination Consistency
    Yu Xi*, Haoyu Li*, Xiaoyu Gu, Hao Li, Yidi Jiang, Kai Yu
    ICASSP 2025.
    paper link

  • NTC-KWS: Noise-aware CTC for Robust Keyword Spotting
    Yu Xi, Haoyu Li, Hao Li, Jiaqi Guo, Xu Li, Wen Ding, Kai Yu
    ICASSP 2025.
    paper link

  • MFA-KWS: Effective Keyword Spotting with Multi-head Frame-asynchronous Decoding
    Yu Xi, Haoyu Li, Xiaoyu Gu, Yidi Jiang, Kai Yu
    TASLP.
    paper link

  • G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
    Jing Peng*, Ziyi Chen*, Haoyu Li*, Yucheng Wang, Duo Ma, Mengtian Li, Yunfan Du, Dezhu Xu, Kai Yu, Shuai Wang
    arXiv:2603.10468. Submitted to Interspeech 2026.
    paper link


Contact Information

I am willing to chat and collaborate on the topics above and you can contact me by: