基本信息
- 标题:
CosyVoice2: Scalable Streaming Speech Synthesis with Large Language Models
- 作者:
- 01 Zhihao Du
- 02 Yuxuan Wang
- 03 Qian Chen
- 04 Xian Shi
- 05 Xiang Lv
- 06 Tianyu Zhao
- 07 Zhifu Gao
- 08 Yexin Yang
- 09 Changfeng Gao
- 10 Hui Wang
- 11 Fan Yu
- 12 Huadai Liu
- 13 Zhengyan Sheng
- 14 Yue Gu
- 15 Chong Deng
- 16 Wen Wang
- 17 Shiliang Zhang
- 18 Zhijie Yan
- 19 Jingren Zhou
- 链接:
- 文件:
- ArXiv
- [Publication] #TODO
展开原文
In our previous work, we introduced CosyVoice, a multilingual speech synthesis model based on supervised discrete speech tokens. By employing progressive semantic decoding with two popular generative models, language models (LMs) and Flow Matching, CosyVoice demonstrated high prosody naturalness, content consistency, and speaker similarity in speech in-context learning. Recently, significant progress has been made in multi-modal large language models (LLMs), where the response latency and real-time factor of speech synthesis play a crucial role in the interactive experience. Therefore, in this report, we present an improved streaming speech synthesis model, CosyVoice2, which incorporates comprehensive and systematic optimizations. Specifically, we introduce finite-scalar quantization to improve the codebook utilization of speech tokens. For the text-speech LM, we streamline the model architecture to allow direct use of a pre-trained LLM as the backbone. In addition, we develop a chunk-aware causal flow matching model to support various synthesis scenarios, enabling both streaming and non-streaming synthesis within a single model. By training on a large-scale multilingual dataset, CosyVoice2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in the streaming mode. We invite readers to listen to the demos at this https URL.
在我们先前的工作中, 我们提出了 CosyVoice, 一种基于监督式离散语音 Token 的多语言语音合成模型. 通过采用两种流行的生成式模型 (语言模型, 流匹配) 实现的渐进式语义解码, CosyVoice 在语音上下文学习中展现出了高韵律自然度, 内容一致性和说话人相似度.
最近, 多模态大语言模型出现了显著进展, 语音合成的响应延迟和实时因子成为了交互体验的关键. 因此, 在本报告中, 我们展示了一个改进的流式语音合成模型, CosyVoice2, 继承了全面且系统的优化.
具体来说, 我们引入有限标量量化 (Finite-Scalar Quantization, FSQ) 来提升语音 Token 的码本利用率. 对于文本-语音语言模型, 我们将模型架构简化, 允许直接使用预训练的大语言模型作为骨干. 此外, 我们开发了一个基于分块感知的因果流匹配模型, 以支持各种合成场景, 使得单个模型能够支持流式和非流式合成.
通过在大规模多语言数据集上训练, CosyVoice2 在流式模式下实现了人类齐平的自然度, 最小的响应延迟, 几乎无损的合成质量.
展开原文
In recent years, neural text-to-speech (TTS) synthesis models have garnered significant attention for surpassing traditional concatenative and statistical parametric methods ([Tacotron [1]]; [Tacotron2 [2]]; [DeepVoice3 [3]]; Clarinet [4]; [FastSpeech [5]]; [Transformer-TTS [6]]; [FastSpeech2 [7]]). These models have achieved high fidelity and naturalness on pre-defined specific speakers. Recent studies show that zero-shot TTS models are able to synthesize speech for any speaker by imitating the timbre, prosody and style of a reference speech (VALL-E [8]). Beyond their in-context learning (ICL) capability, zero-shot TTS models benefit from large-scale training data, achieving synthesis quality and naturalness nearly indistinguishable from human speech.
Recent zero-shot TTS models can be broadly divided into three categories: codec language models, feature diffusion models and their hybrid systems. Codec language models utilize a speech codec model to extract discrete speech representation (SoundStream [9]; EnCodec [10]; FunCodec [11]) and employ an autoregressive (VALL-E [8]; SPEAR-TTS [12]; ELLA-V [13]; VALL-T [14]; RALL-E [15]; VALL-E 2 [16]; VALL-E R [17]) or masked (MaskGCT [18]) language model to predict the speech tokens, which are then synthesized to waveforms via codec vocoders (WaveNeXt [19]; Vocos [20]). Continuous speech representations are also explored in MELLE [21]. Language model-based TTS can generate varied and prosody-consistent speech via autoregressive sampling.
Inspired by advances in image generation, denoising diffusion (DDPM [22]; Song et al. [23]) and flow matching models [24] have been introduced into non-autoregressive (NAR) speech synthesis. Early diffusion-based TTS models required duration prediction for each text (phone) to address the length disparity between text and speech features (Voicebox [25]; NaturalSpeech3 [26]; VoiceFlow [27]; Matcha-TTS [28]). However, this rigid alignment can affect naturalness, resulting in flat prosody. To mitigate this issue, cross-attention and Diffusion Transformers (DiT) have been introduced into NAR TTS models (E3- TS [29]; Ditto-TTS [30]). Recent research indicates simpler approaches for text-speech alignment in NAR TTS models, such as E2 TTS [31], F5-TTS [32] and Seed-TTS [33]. In these models, input text is padded with special tokens to match the total speech length which is either automatically predicted by the utterance duration prediction module or specified by the user in advance. Since NAR TTS models are not constrained by codec vocoders, they can achieve superior speech quality.
Hybrid systems combine the text-to-codec language model and codec-to-feature diffusion model (Seed-TTS [33]; CosyVoice [34]; FireRedTTS [35]). The language model addresses the alignment between text and speech as well as the utterance duration prediction, while the codec-to-feature diffusion model synthesizes speech features (Mel spectrum) based on the generated codec and other conditions. By leveraging the strengths of both generative models, hybrid systems achieve high diversity, prosody consistency and speech quality.
Despite the success of recent zero-shot TTS models, they generally operate in non-streaming (offline) mode, which involves complete input text and requires synthesizing the entire utterance before returning the waveform. This results in high latency, negatively impacting user experience in applications like voice chat (GPT-4o [36]; SpeechGPT [37]). To address this issue, streaming synthesis has been explored for language model-based zero-shot TTS models (LiveSpeech [38]; LiveSpeech2 [39]; BASE TTS [40]; LLM2Speech [41]), but diffusion-based TTS models and hybrid systems lack well-established streaming solutions.
Building on the success of CosyVoice [34], we introduce CosyVoice2, a streaming zero-shot TTS model with improved prosody naturalness, content consistency, and speaker similarity. Our contributions include:
- Unifying streaming and non-streaming synthesis in a single framework and proposing the unified text-speech language model and chunk-aware causal flow matching model, leading to lossless streaming synthesis compared to offline mode.
- Simplifying the LM architecture by removing the text encoder and speaker embedding, allowing pre-trained textual large language models (LLMs) to serve as the backbone, enhancing context understanding.
- Replacing vector quantization (VQ) in the speech tokenizer with finite scalar quantization (FSQ), improving codebook utilization and capturing more speech information.
- Upgrading the instructed TTS capacity to support more instructions, including emotion, accent, role style, and fine-grained control. In CosyVoice2, the instruction and zero-shot capacity are integrated into a single model, enabling more versatile and vivid synthesis.
Through the above systemic modification and optimization, CosyVoice2 achieves human-parity synthesis quality and is nearly lossless in streaming mode. The unified framework loosens deployment requirements, enabling a single model to support both streaming and non-streaming synthesis. The upgraded instructed TTS capacity provides a more powerful and easier approach for users to generate various speeches. In addition, the chunk-aware flow matching design can also be applied to NAR TTS models, which suggests the potential for streaming NAR models.
展开原文
CosyVoice2 builds on the similar design philosophy of its predecessor (CosyVoice [34]) by separating the semantic and acoustic information of speech signals and modeling them independently. The speech generation process is redefined as a gradual semantic decoding procedure, where conditional information is progressively incorporated. Specifically, the text-speech language model (LM) focuses solely on semantic information, decoding high-level text tokens into supervised semantic speech tokens. In the Flow Matching model, acoustic details, such as timbre, are introduced through speaker embeddings and reference speech, converting speech tokens into the Mel spectrum for a given speaker. Finally, a pre-trained vocoder model reinstates the phases, transforming the Mel spectrum back into the original audio signal. The following sections will introduce the details of CosyVoice2 and the modifications for streaming synthesis from five respects: text tokenizer, supervised semantic speech tokenizer, unified text-speech LM for streaming/non-streaming synthesis and chunk-aware Flow Matching model. Figure \ref{fig:overall} provides an overview of CosyVoice2.
展开原文
CosyVoice2 uses the raw text as input directly, which is tokenized using a BPE-based text tokenizer. This eliminates the need for a frontend model that obtains phonemes via the grapheme-to-phoneme (g2p) transformation. This approach not only simplifies the data preprocessing workflow but also enables the model to learn the pronunciations of words within various contexts in an end-to-end manner. Unlike the tokenizers commonly used in textual LLMs, CosyVoice2 masks out the one-to-many tokens. This prevents the pronunciation of a token from becoming excessively long and reduces corner cases caused by data sparsity. Specifically, if a BPE token encodes more than one Chinese character, it will be masked out, and each character will be encoded separately during the tokenization process. Other languages, such as English, Japanese, and Korean, are not subject to special handling.
展开原文
As shown in Figure \ref{fig:overall} (a), we insert the finite scalar quantization (FSQ) module [42] into the encoder of SenseVoice-Large ASR model (FunAudioLLM [43]). At the training stage, the input speech
In the FSQ module, the intermediate representations
$$ \begin{aligned} \bar{H} &= \text{ROUND}(\text{Proj}{down}(H)) \ \hat{H} &= \text{Proj}{up}(\bar{H}) \end{aligned} $$
At the training stage, the straight-through estimation is used to approximate the gradients of FSQ module and
$$ \mu_i = \sum_{j=0}^{D-1}{\bar{h}_{i,j}(2K+1)^{j}} $$
The
展开原文
In CosyVoice2, the pre-trained textual LLM, Qwen2.5-0.5B [45], is used as the text-speech language model to generate the speech tokens autoregressively with the input text as a prompt. Similar to other LMs, the text-speech LM is also trained in a next-token-prediction scheme as shown in Figure \ref{fig:overall} (b). Different from the previous CosyVoice, we remove the speaker embedding to avoid information leaking. More importantly, we find that such utterance-level vector contains not only speaker identify but also language and paralanguage information, which harms the prosody naturalness and cross-lingual capability of the text-speech LM. Besides, we also abandon the text encoder of the previous CosyVoice, since we find that the Qwen2.5-0.5B model is powerful enough to align the text and speech tokens, and the text encoder is no longer needed.
Benefiting from the simplicity of text-speech LM, we can build a unified model for both streaming and non-streaming synthesis.
Here, streaming mode
means the input text is received in a continuous flow rather than being known as a complete sentence in advance.
In CosyVoice2, the difference between streaming and non-streaming modes is only the way of sequence construction for LM:
- For the Non-Streaming mode, the
start of sequence
\circled{S}
, all text tokens,turn of speech
token\circled{T}
, all speech tokens and theend of sequence
\circled{E}
are concatenated sequentially as shown in the bottom of Figure \ref{fig:ULM}. Ignore token means that their losses are ignored while minimizing the cross-entropy objective function. - For the Streaming mode, we mix up the text and speech tokens in a pre-defined ratio of
$N$ :$M$, i.e. every$N$ text tokens are followed by$M$ speech tokens seen in the top of Figure \ref{fig:ULM}. If the next token is a text token, the model is expected to predict a filling token (rather than the text token), which indicates that the next$N$ text tokens should be concatenated at the inference stage. Once the text tokens are ran out of, theturn of speech
token\circled{T}
and the remaining speech tokens are concatenated sequentially, forming the hybrid text-speech token sequence in the streaming mode. In our experiments,$N$ and$M$ are set to 5 and 15, respectively.
By training the text-speech LM on the above two sequences simultaneously, we can perform streaming and non-streaming speech generation within a single unified model. In real-life scenarios, such as speaker fine-tuning (SFT) and in-context learning (ICL), the inference sequence differs as follows:
-
ICL, Non-Streaming: In ICL, the LM requires prompt text and speech tokens from the reference audio to imitate the accent, prosody, emotion and style.
In the non-streaming mode, the prompt and to-synthesize text tokens are concatenated as the whole entity, and the prompt speech tokens are treated as the pre-generated results and are fixed:
\circled{S}, prompt\_text, text, \circled{T}, prompt\_speech
. The autoregressive generation of LM is started from such sequence until theEnd of sequence
token\circled{E}
is detected. -
ICL, Streaming: In this scenario, we assume the to-generate text is already known and the speech tokens should be generated in a streaming manner.
Similarly, we treat the prompt and to-generate text as a whole entity.
Then, we mix it up with the prompt speech tokens on the ratio of
$N$ :$M$:\circled{S}, mixed\_text\_speech, \circled{T}, remaining\_speech
. If the length of text is greater than that of prompt speech tokens, the LM will generatefilling token
. In this situation, we manually pad$N$ text tokens. If the text tokens run out of, theTurn of speech
token\circled{T}
will be added. In the streaming mode, we return generation results every$M$ tokens until the\circled{E}
is detected. -
SFT, Non-Streaming: In the SFT scenario, the LM is fine-tuned on a specific speaker, and the prompt text and speech are no longer needed.
Thus, the initial sequence is very simple:
\circled{S}, text, \circled{T}
. Starting from this, the text-speech LM can generate speech tokens autoregressively until\circled{T}
. -
SFT, Streaming: In the streaming mode of SFT, we start the speech generation from the following sequence:
\circled{S}, first\_N\_text
. Then, the LM will generate$M$ speech tokens, and we manually pad the next$N$ text tokens. We repeat the above process until all text tokens run out of, and then\circled{T}
is added. Note that this mode can also be adopted by the speech-to-speech multi-modal large language models to obtain an extremely low latency.
展开原文
In CosyVoice2, we employ the Mel spectrogram as the acoustic feature with the frame rate of 50 Hz and the sampling rate of 24000. Due the frame-rate mismatch between speech tokens and Mel features, we up-sample the speech tokens with the ratio of two to match the frame rate of Mel spectrogram.
Before the up-sampling operation, we add an additional look-ahead convolution layer to provide the future information for the following causal modules.
The look-ahead layer is implemented by a right-padded 1-D convolution with the pad size of
Subsequently, our goal is to further decode the speech tokens into the Mel spectrogram specified by the speaker embedding and reference speech.
To achieve this, we employ a conditional flow matching (CFM) model to sample the Mel spectrogram, given speech tokens, reference speech and speaker embedding as conditions.
In the CFM model, the distribution of target Mel spectrogram is described by a probability density path from a prior distribution
$$ \begin{aligned} \omega_t(\phi^{OT}_t(X_0,X_1)|X_1)&=X_1-X_0 \ \phi^{OT}_t(X_0,X_1)&=(1-t)X_0+tX_1 \ X_0&\sim p_0(X)=\mathcal{N}(0,I) \ X_1&\sim q(X) \end{aligned} $$
A causal convolutional Transformer UNet is employed to learn the above ODE with the up-sampled token
$$ \begin{aligned} \nu_t(\phi^{OT}t(X_0,X_1)|\theta) &= \text{UNet}\theta\left(\phi^{OT}t(X_0,X_1),t;\mathbf{v},{\mu}{1:L},\tilde{X_1}\right) \ \end{aligned} $$
At the training stage, the masked Mel spectrogram is obtained by randomly masking out 70% to 100% of the final frames in
$$ \theta = \arg\min_{\theta}\mathbb{E}_{p_0(X),q(X),t}{\left|\omega_t(\phi^{OT}_t(X_0,X_1))-\nu_t(\phi^{OT}_t(X_0,X_1)|\theta;\mu,\tilde{X}_1,\mathbf{v})\right|_1} $$
At the training stage, the timestep follows a uniform distribution
$$ t:=1-\cos\left(\frac{1}{2}t\pi\right) $$
Besides, we also train the model on both conditional and non-conditional situations to enable the classifier-free guidance (CFG) (CFG [46]; iDDPM [47]; Voicebox [48]) at the inference stage:
$$ \begin{aligned} &\tilde{\nu}_t(\phi^{OT}_t(X_0,X_1)|\theta;\Psi)=(1+\beta)\cdot\nu_t(\phi^{OT}_t(X_0,X_1)|\theta;\Psi)-\beta \cdot \nu_t(\phi^{OT}_t(X_0,X_1)|\theta) \end{aligned} $$
where
The current flow matching models always work on a offline mode, i.e., only all the speech tokens are generated, the Mel spectrogram can be sampled, which is not friendly for the streaming synthesis. To overcome this issue, we treat the multi-step flow estimation as a stacked deeper neural network, which repeats the UNet ten times. Thus, by making the unfolded neural network causal, we can apply it on the streaming synthesis. We construct four masks to satisfy different application situations:
- Non-causal Mask is used for offline mode, which can achieve the best performance by attending all frames of conditions. Non-causal mask is suitable for the latency-insensitive situations.
- Full-causal Mask is designed for scenarios required extremely low latency, in which only the past frames can be attended.
-
Chunk-$M$ Mask is a trade off between latency and performance, which can leverage the information of the past and
$M$ future frames. This mask is more suitable for the first chunk of generation with low latency. - Chunk-$2M$ Mask can achieve a approximate performance of offline mode by sacrificing more latency, which can be used for the cascade generation chunk for better performance.
For each training case in a mini-batch, we randomly sample a mask from the above four masks under the uniform distribution. In this manner, one flow matching model can be compatible to different scenarios, lowering the deployment complexity. Another advantage of this chunk-aware training is that the masks with more context sever as a teacher for the ones with less context, benefiting from the implicit self-distillation scheme.
展开原文
The first-package latency is an important metric for streaming synthesis models, which significantly affects the user experience especially in LLM-based voice chat applications, such as GPT-4o [36].
In the context of TTS, the to-synthesize text is known in advance, and the latency comes from the aspects of speech token generation, Mel spectrogram reconstruction and waveform synthesis.
Thus, the first-package latency
$$ L_{TTS} = M\cdot d_{lm} + M \cdot d_{fm} + M \cdot d_{voc} $$
where
$$ L_{Chat} \leq N\cdot d_{llm} + L_{TTS} $$
where
展开原文
To enhance the controllability of CosyVoice2, we integrated the instructed dataset into the base training set.
We have collected 1500 hours of instructed training data, which includes both natural language instructions and fine-grained instructions, as outlined in Table \ref{tab:example_instruct}. For natural language instructions, we prepend a natural language description and a special end token, <|endofprompt|>
before the to-synthesize input text.
These descriptions cover aspects such as emotion, speaking rate, role-playing, and dialects.
For fine-grained instructions, we insert vocal bursts between text tokens, using markers like [laughter]
and [breath]
. Additionally, we apply vocal feature tags to phrases; for instance, <strong>XXX</strong>
indicates emphasis on certain words, while <laughter>XXX</laughter>
signifies speaking with laughter.
展开原文
Fine-tuning the pre-trained model on specific speakers (SFT) can further improve the generation quality and speaker similarity.
In this report, we introduce the multi-speaker fine-tuning (mSFT), in which the pretrained model is fine-tuned on multiply speakers simultaneously rather than a single speaker.
This approach ensures comprehensive prosody and pronunciation coverage across multiple speakers and mitigates potential catastrophic forgetting from the pretrained models.
To avoid timbre confusion between various speakers, we prepend speaker-prompt tags, Speaker A<|endofprompt|>
to the input text for a specific speaker.
If a training sample is not labeled to a speaker, a special tag, unknown<|endofprompt|>
, is utlized.
The learning rate is set to 1e-5 during the whole multi-speaker fine-tuning process.
展开原文
Reinforcement learning is a commonly used method in the training of large language models, which can make the LM output align with human preference.
In CosyVoice2, we employ speaker similarity (SS) and recognition word error rate (WER) from the ASR system as the reward function to improve speaker similarity and pronunciation accuracy in the fine-tuning stage.
We use WER and SS to distinguish preferred sample
$$ L_{DPO}(\pi_\theta; \pi_{\text{ref}}) = -\log \sigma(\beta \log \frac{\pi_\theta(\mu^w | y)}{\pi_{\text{ref}}(\mu^w | y)} - \beta \log \frac{\pi_\theta(\mu^l | y)}{\pi_{\text{ref}}(\mu^l | y)}) $$
where
However, this method is time-consuming and computation-consuming as it should synthesis the audios through the TTS system repeatedly to obtain distinguishable preference and rejected samples.
During training, four forward operations are needed for one training step.
To simplify the process, we recover the LM predicted token
$$ \bar{h}_{i,j} = \left\lfloor \frac{\mu_i}{(2K+1)^j} \right\rfloor \mod (2K+1) $$
$$ \begin{aligned} \hat{H} &= \text{Proj}{up}(\bar{H}) \ L{ASR} &= -\log P(Y | \hat{H}; \theta_{ASR}) \end{aligned} $$
where
展开原文
A 200,000-hour dataset is used to train the speech tokenizer with normalized transcriptions as labels. Detailed data information is listed in Table~\ref{tab:fsqdata}. The training data comes from three different resources: open source ASR datasets, internal industrial datasets and TTS generation datasets. Although we only used Chinese and English data when training the speech tokenizer, as shown in Table ~\ref{tab:fsqdata}, subsequent experiments revealed that the speech tokenizer had zero-shot capability for other languages. It can be also used for speech synthesis in languages such as Japanese and Korean.
展开原文
CosyVoice2 shares the same training data as its previous version (CosyVoice [34]). We first collect the speech-only data with internal speech processing tools. Subsequently, the Paraformer [50] and SenseVoice (FunAudioLLM [43]) are employed to generate pseudo text labels for Chinese and other languages, respectively. We also employ an internal force-alignment model to filter out low-quality data and enhances the accuracy of punctuation. Data details are provided in Table \ref{tab:cv_data}.
展开原文
We evaluate our CosyVoice2 on two test sets. The first one is constructed from the test-clean set of Librispeech corpus (UniCATS [51]), denoted as test-clean. This test set is used to evaluate CosyVoice2 on a limited English domain. The Whisper-large V3 is used as the ASR model to evaluate the content consistency. As for the speaker similarity (SS), we employ the ERes2Net model [52] to extract speaker embeddings of prompt and generated utterances, and their raw cosine similarity is treated as the speaker similarity. NMOS score (DNSMOS P.385 [53]) is used to evaluate the objective quality.
The second evaluation is conducted on the SEED test sets (Seed-TTS [33]), which is widely used to evaluate recent TTS models, covering various text domains and reference speeches. In this evaluation, about 2,000 Chinese and 1,000 English samples are selected from CommonVoice datasets, denoted as test-zh and test-en, respectively. In addition, about 400 hard test cases are also included to evaluate the robustness of TTS models on text repetition, tongue twister and other challenging synthesis cases, denoted as test-hard in this report. The Paraformer is employed to recognize the synthesis results of test-zh and test-hard, while the Whisper-large V3 is adopted for test-en to evaluate the content consistency. We adopt two speaker verification (SV) models to evaluate speaker similarity: WavLM-finetuned SV model and ERes2Net.
展开原文
We prepare two test sets, denoted as test-ja and test-ko, for the evaluation on Japanese and Korean speech synthesis. The test-ja consists 1,000 samples extracted from the CommonVoice dataset, which are used to measure the model’s performance on various metrics, such as WER, SS, MOS. Specifically, we randomly shuffle and pair the entire CommonVoice JA-test set as reference utterance and target utterance spoken. Considering the wide range of utterances' text lengths of JA-test set, we randomly selected 1,000 pairs of reference-target utterances from the length range from 8 to 32 characters as our final test set. For the test-ko, we selected 1,000 speech samples with a WER of less than 5% and no deletion or insertion errors, utilizing the Whisper-Large V3 [54] as the ASR model. These samples were used as reference utterances for the Korean speech synthesis. For the input text, we randomly selected 1,000 text samples from the remaining data. We have released the lists of prompt speeches, prompt transcriptions and input text from these two test sets are released to facilitate result reproduction. By providing this open-source data, we aim to establish a benchmark for evaluating Japanese and Korean TTS models. The Whisper-large V3 is used as the ASR model for Japanese and Korean evaluations.
展开原文
An ideal speech tokenizer is supposed to effectively utilizes the codebook, preserves information at a high fidelity, and demonstrates speaker independence. In this part we evaluate our supervised speech tokenizer from four aspects:
- Codebook utilization rate;
- ASR error rate within the entire encoder;
- Token visualization of different speakers;
- Speaker identification training.
Table~\ref{fsqres} shows the codebook utilization and ASR error rate. It turns out that the FSQ-based tokenizer fully utilizes the codebook and maintains more effective information from the aspect of ASR, indicating more semantic information maintained by FSQ.
We further analyze the characteristics of FSQ through the t-SNE visualization. As an upstream model for TTS tasks, the tokenizer should strive to minimize the entanglement of speaker identity information with the speech signal. We selected 100 speech samples from each of the three speakers in the VoxCeleb1 dataset and visualized the corresponding tokens. As illustrated in Figures~\ref{fsqvis}(a) and (b), it is evident that before the quantization, Encoder$_1$'s outputs exhibit different distributions among different speakers. In contrast, the distributions of quantized representations are nearly indistinguishable. In addition, Figure~\ref{fsqvis}(c) also shows that the tokenizer fully utilizes the codebook. Subsequently, the S3prl toolkit [55] is employed to further evaluate the speaker entanglement by performing speaker identification (SID) task. We use Sensevoice-large encoder with FSQ as an upstream feature extractor and train SID task with representations before or after the quantization. Figure~\ref{fsqtrain} shows the accuracy curves during the training. The SID layer with quantized tokens does not converge, which proves the decoupling function of the tokenizer on speaker information.
展开原文
We first evaluated our CosyVoice2 models on a limited English text domain and compared it with several open-source models, such as ChatTTS [Github] [56], GPT-SoVITS [Github] [57], OpenVoice [58], ParlerTTS [59], EmotiVoice [Github] [60], and its predecessor CosyVoice [34]. The objective results are presented in Table \ref{tab:res-librispeech}, including content consistency (WER), speech quality (NMOS) and speaker similarity (SS). From the table, we can see that CosyVoice2 achieves state-of-the-art performance on the Librispeech test-clean set, surpassing all baseline models acros all evaluation metrics. Notably, CosyVoice2 even demonstrates higher content consistency, speech quality, and speaker similarity than human utterances, indicating its human-parity synthesis quality.
We also evaluated CosyVoice2 on the commonly-used test sets: SEED test-zh, test-en and test-hard, which include diverse input texts and reference speeches from various domains. The experimental results for CosyVoice2 and the baseline models are presented in Table \ref{tab:comapre}. On the test-zh set, CosyVoice2 surpasses all open-sourced models in terms of CER and SS, falling short of the commercial model SEED-TTS by only a small margin. On the test-en set, CosyVoice2 ranks fourth and third in terms of WER and SS, respectively. This may result from the imbalance in the volume of training data between Chinese and English. We plan to explore data scaling in future work to enhance content consistency in English. On the test-hard set, the offline CosyVoice2 model achieves state-of-the-art performance across all compared baseline, demonstrating its robustness in challenging synthesis scenarios. Compared with human-generated speeches, CosyVoice2 shows comparable content consistency and superior speaker similarity. Considering the recognition errors can also stem from the ASR model, it is reasonable to conclude that CosyVoice2 achieves a human-parity synthesis capability. We also evaluated the streaming mode, denoted as ``CosyVoice2-S'' in Table \ref{tab:res-librispeech} and \ref{tab:comapre}. For both evaluation settings, the streaming mode's performance is nearly lossless in typical test cases. Only in challenging cases is there a slight degradation in content consistency, highlighting the strength of our unified streaming/non-streaming framework. We found that the results of speaker similarity are not consistent on different SV models. This may indicate a new research topic on how to evaluate speaker similarity for TTS models automatically. Since different TTS models may use different SV models to extract speaker information, evaluating speaker similarity with the same SV model allows a more accurate evaluation on the utilization of speaker information. Therefore, we employ ERes2Net [Github] for evaluating speaker similarity in subsequent experiments.
展开原文
We conducted a modular ablation study on the text-speech language model to assess the impacts of our modifications, including LLM initialization, removing speaker embedding, and utilizing FSQ. Table \ref{tab:modular} illustrates the step-by-step development of CosyVoice2 from its predecessor. By replacing the randomly initialized language model with a pretrained LLM, we achieved relative improvements in content consistency of 18.46% and 15.40% on the test-zh and test-hard sets, respectively. Next, we removed the speaker embedding from the text-to-speech language model, which helps prevent information leakage and disturbances in in-context learning. This change resulted in a significant reduction in content errors while maintaining speaker similarity, indicating that content information is primarily modeled by the LM, and speaker information is mainly recovered by the flow matching model. Finally, by replacing VQ with FSQ, we achieved the CosyVoice2 model, noting much higher content consistency and unchanged speaker similarity. By fully utilizing the codebook, FSQ captures more content information and context variation, leading to better alignment between text and speech tokens. Furthermore, we conducted a comparative experiment by incorporating pitch loss as a constraint during the training of the FSQ-based speech tokenizer. We found that this approach led to improved performance in downstream TTS tasks, as indicated in the last row of Table \ref{tab:modular}. In future versions of CosyVoice, we plan to carry out more detailed experiments and analyses.
We also conducted another modular analysis to evaluate the impact of streaming modules on the synthesis performance. Table \ref{tab:res-streaming} shows the results for content consistency and speaker similarity. We fount that the streaming LM has a minimal impact on typical cases from the test-zh and test-en sets, indicating the effectiveness of our unified training framework. The primary impact of the streaming LM is observed in challenging cases from the test-hard set, likely due to the loss of contextual information in streaming mode. Interestingly, the streaming flow matching model results in slightly higher speaker similarity compared to the offline mode. This may be due to the higher prompt-to-generation ratio of initial chunks in streaming mode, whereas the prompt-to-generation ratio in offline mode can be very low, with many padding tokens. The negative effect of the streaming flow matching model on content consistency is much less pronounced compared to streaming LMs, thanks to the semantic-acoustic decoupled modeling in CosyVoice2.
展开原文
In addition to Chinese and English, CosyVoice2 also supports Japanese and Korean. We evaluated the content consistency, speaker similarity and speech quality on our constructed Japanese and Korean test sets. As shown in Table \ref{tab:ja-ko}, CosyVoice2 performs significantly better on Korean than on Japanese across all evaluation metrics. This discrepancy is primarily due to the overlap in the character set between Japanese and Chinese, which leads to Chinese pronunciations in Japanese contexts. In the future work, we plan to explore ways to enhance linguistic context for multilingual synthesis. Since Korean does not have character overlap with other languages, its speech synthesis achieves much better performance. Another issue is data imbalance. We believe that increasing the volume of training data could further improve synthesis performance for both Japanese and Korean.
展开原文
To evaluate the performance of instructed generation, we have created a Chinese test set comprising 290 samples. This set includes 29 types of instructions, shown in Table \ref{tab:example_instruct}, each with 10 different input texts. We utilize five audio prompts and speaker embeddings from five speakers (three female and two male) as conditions for the flow matching model. Our testing is conducted in offline mode. We objectively evaluate content consistency (CER), speaker similarity (SS), and speech quality (NMOS). Subjectively, we assess the accuracy and naturalness of instruction using the Mean Opinion Score for Instruction (MOS-I), which ranges from 1 to 5. Each sample is assessed by 10 native Chinese speakers, with scores assigned in increments of 0.5. The evaluation criteria focus on whether the speech adheres to all specified instructions, such as emotional expression, speech rate adjustment, dialect usage, and role-playing. Fine-grained controls, including the insertion of laughter, speaking with laughter, breath control, and emphasis, are evaluated for naturalness and accuracy. As illustrated in Table \ref{tab:res-instruct}, CosyVoice2 exhibits superior content consistency (CER), speaker similarity (SS), and accuracy and naturalness in instruction control (MOS-I), while maintaining comparable speech quality to CosyVoice-Instruct. When input instructions are removed from CosyVoice2, there is a notable decline in MOS-I; however, improvements are observed in content consistency (CER), speaker similarity (SS), and speech quality (NMOS). This indicates that instruction controllability is difficult to implicitly emerge from content text.
展开原文
During the fine-tuning phase, we employ unsupervised clustering on the speaker embeddings of the same speaker to ensure the stability of the speaker's timbre. We have demonstrated that a target speaker with as few as 400 audio recordings can achieve reasonably good speech synthesis performance, with only slight variations in objective metrics observed among different speakers, as shown in Figure \ref{fig:sft-fm}. Our experiments indicate that most speakers can inherit the zero-shot TTS model's robust contextual understanding and perception, thereby naturally expressing various moods and emotions in response to the input text.
展开原文
Although the SFT can improve the performance on most speakers, the results of Spk E are still worse than the base model especially on English. Because Spk E has a more complex voice and faster speech speed. Additionally, only Chinese recordings are available for Spk E. So we apply the reinforcement learning on Spk E for further improvements. For DPO, we synthesis 10 thousand samples-pairs by the SFT models to change the preference biasing of the LM by the ASR and SS rewards. We also use the differentiable ASR rewards to optimize the LM parameters. After RL, we evaluate the model with content consistency (WER), speaker similarity (SS) and speech quality (NMOS) on the test set of Spk E and further evaluated the WER on the SeedTTS test sets to explore whether the model can maintain robustness to out-of-domain or cross-lingual input text. Results are shown in Table \ref{tab:res-sft}.
Compared to the pre-trained base model, the SFT model shows higher speaker similarity and speech quality, however, the WER could be worse than the base model. We find that the audio synthesized by the base model always has a slower speed than the SFT and ground truth, which is more friendly to the ASR systems. For the target speaker dataset, both preference biasing and differentiable rewards can reduce the WER with little harmful effect on the other two metrics. But for the SEED test sets, the DPO based reinforcement only benefits the Chinese and English subset, while the hard samples will be worse. The reason could be that the hard samples contain many repeated words or phrases, they could be regarded as rejected samples during DPO training. However, the differentiable ASR reward will not suffer this problem, as it can directly optimize the TTS system by the ASR posterior. This means that the differentiable ASR reward has a better generalization ability in the out-of-domain situations. Finally, we can combine them with each other for further improvements.
展开原文
Building on the success of CosyVoice, this report presents CosyVoice2, an improved streaming speech synthesis model that leverages large language models. By unifying streaming and non-streaming synthesis within a single framework, CosyVoice2 achieves human-parity naturalness, minimal response latency, and virtually lossless synthesis quality in streaming mode. Key innovations include finite scalar quantization for full codebook utilization, a simplified text-to-speech language model architecture that incorporates pre-trained textual LLMs, and the development of a chunk-aware causal flow matching model to support diverse synthesis scenarios. Additionally, improvements in instructed TTS capacity allow for versatile and vivid speech generation with fine-grained control over emotion, accent, role style, and vocal bursts. Through systematic modifications and optimizations, CosyVoice2 not only delivers superior synthesis quality but also loosens deployment requirements, making it suitable for both streaming and non-streaming applications. We believe that CosyVoice2 represents a significant advancement in scalable, high-quality, and interactive text-to-speech synthesis.
展开原文
CosyVoice2 has several limitations that need to be addressed. First, it supports only a limited number of languages. For languages with overlapping character sets, synthesis performance may degrade, presenting an open challenge for future research. Second, CosyVoice2 cannot control acoustic characteristics, such as timbre, through textual instructions, which could be a fascinating area of exploration for role-playing applications. Additionally, CosyVoice does not perform well when tasked with singing.