Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516

Open
ukemamaster opened this issue Jan 15, 2024 · 24 comments
Open

[Bug] Fine tuned XTTS v2 produces strange sounds for short text #3516

ukemamaster opened this issue Jan 15, 2024 · 24 comments
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.

Comments

@ukemamaster
Copy link

Describe the bug

I have fine tuned XTTS v2 model on my own data containing both long and short audios (with the following histogram showing duration in seconds on x-axis. Labels 'old' and 'new' represent 2 datasets with long and short audios respectively.)

data_es_mix_hist

But the model produces strange sounds in case of 1-2 words text, like the following 2 examples for text='hola':

2.mp4
1.mp4

It seems like the model tries to produce at least 3 seconds audio even if the text is very short. And thus it adds some meaningless sounds to the sound of the original word in text.

@erogol Is there any way to avoid this behavior? or any parameter (may be in model args) to control this?
There are gpt_start_audio_token and gpt_stop_audio_token parameters in TTS.tts.models.xtts.XttsArgs() class but i am not sure what is the impact of these parameters?

To Reproduce

N/A

Expected behavior

Should produce short audio for short text.

Logs

No response

Environment

{
    "CUDA": {
        "GPU": [
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30",
            "NVIDIA A30"
        ],
        "available": true,
        "version": "12.1"
    },
    "Packages": {
        "PyTorch_debug": false,
        "PyTorch_version": "2.1.0+cu121",
        "TTS": "0.22.0",
        "numpy": "1.23.0"
    },
    "System": {
        "OS": "Linux",
        "architecture": [
            "64bit",
            "ELF"
        ],
        "processor": "x86_64",
        "python": "3.10.12",
        "version": "#64-Ubuntu SMP Thu Jan 5 11:43:13 UTC 2023"
    }
}

Additional context

No response

@ukemamaster ukemamaster added the bug Something isn't working label Jan 15, 2024
@ukemamaster
Copy link
Author

ukemamaster commented Jan 16, 2024

I tried several times to re-cut the data into ranges from 0.5s to 20s, guaranteeing alignment with the corresponding text. But nothing improves. There might be a difference between model args in the training recipe and in the already trained model provided.

@erogol Can you please make sure the model args provided in the training recipe are the same as your own trained model?

@bensonbs
Copy link

Same Issues

@ukemamaster
Copy link
Author

@bensonbs
Have you fine tuned the xtts-v2 model on your own dataset?
Can you share a histogram of the audio lengths of your dataset?
Have you tried to modify the training code or model args to avoid this?

@insomnia777
Copy link

Same Issues

@kaveenkumar
Copy link

Same issue. Pre-trained XTTSv2 produces extra speech after the intended "text", 10-20% of the time

@peterliu2023
Copy link

Same issue. The pretrained Xtts v2 generate extra speech randomly.

@bensonbs
Copy link

bensonbs commented Apr 12, 2024

I have implemented the Diversified Perturbation Optimized (DPO) loss in TTS/tts/layers/xtts/gpt.py to improve the model's generalization ability and robustness. This implementation aims to address the issue of strange sounds occurring for short text inputs. By introducing the DPO loss, the model is expected to generate more consistent and natural-sounding audio output, even for shorter text sequences.

Code Snippet:
TTS/tts/layers/xtts/gpt.py

text_logits, mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

reject_text_logits, reject_mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)
text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)

loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)

TTS/tts/layers/xtts/trainer/gpt_trainer.py

        loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]
        
  • VRAM Usage and Training Time Comparison:
    • Without DPO loss:
      VRAM usage: X GB
      Training time per epoch: Y minutes
    • With DPO loss:
      VRAM usage: 2X GB
      Training time per epoch: 2Y minutes

@insomnia777
Copy link

can you give me an explanation? and how to try it?

@bensonbs
Copy link

bensonbs commented Apr 15, 2024

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ.
To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py.
I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

@insomnia777
Copy link

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

Wouldn't it be easier to impose a penalty on the length of the generated sequence, based on median character-per-second data?

@tuanh123789
Copy link

can you give me an explanation? and how to try it?

When the GPT-2 model generates shorter sentences, it sometimes fails to accurately produce the [STOP] token, resulting in the inclusion of peculiar sounds in the generated content. These sounds may be inconsistent as they are not explicitly guided, meaning that each generation might differ. To address this issue, during training, I compare the outputs of two generations produced under the same conditions to detect any peculiar sounds. Whether both generations contain strange sounds or only one does while the other doesn't, the model receives a penalty. This encourages it to avoid generating incoherent random content.

Methods can refer to the modifications in TTS/tts/layers/xtts/gpt.py and TTS/tts/layers/xtts/trainer/gpt_trainer.py. I am currently testing which loss function is more stable. Compared to cross entropy, MSE can more accurately eliminate abnormal sounds, but I am not sure if it is theoretically correct.

This method can only be used during fine-tuning, and when using this method, make sure that your fine-tuning dataset includes enough short audio files.

can you share some sample with DPO loss ?

@saiful9379
Copy link

saiful9379 commented Jul 5, 2024

@bensonbs Thank you for your clear explanation, Could you please share some samples after applying DPO and the audio quality?

@anhnh2002
Copy link

Same Issues

@tuanh123789
Copy link

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

@nvtinh368
Copy link

nvtinh368 commented Aug 17, 2024

.

@nvtinh368
Copy link

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

Hello, can you be more specific?

@sushant-samespace
Copy link

Hello @tuanh123789 , do you have any source to finetune dvae? Thanks

@kerlynla
Copy link

Hi everybody, I found the optimal way to fix this issues. Just finetune Dvae with your data :D

Bạn có model tiếng Việt nào đã finetune chưa ?

@anhnh2002
Copy link

Hello @tuanh123789 , do you have any source to finetune dvae? Thanks

https://github.com/nguyenhoanganh2002/XTTSv2-Finetuning-for-New-Languages

@rose07
Copy link

rose07 commented Oct 15, 2024

@JohannPie
Copy link

So we cannot use the pretrained xttsv2 model? We have to finetune our own with dvae?

Copy link

stale bot commented Dec 8, 2024

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Dec 8, 2024
@eschmidbauer
Copy link

I have implemented the Diversified Perturbation Optimized (DPO) loss in TTS/tts/layers/xtts/gpt.py to improve the model's generalization ability and robustness. This implementation aims to address the issue of strange sounds occurring for short text inputs. By introducing the DPO loss, the model is expected to generate more consistent and natural-sounding audio output, even for shorter text sequences.

Code Snippet: TTS/tts/layers/xtts/gpt.py

text_logits, mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)

reject_text_logits, reject_mel_logits = self.get_logits(
    text_emb,
    self.text_head,
    mel_emb,
    self.mel_head,
    prompt=cond_latents,
    get_attns=return_attentions,
    return_latent=return_latent,
    attn_mask_cond=attn_mask_cond,
    attn_mask_text=attn_mask_text,
    attn_mask_mel=attn_mask_mel,
)
text_probs = F.softmax(text_logits, dim=-1)
mel_probs = F.softmax(mel_logits, dim=-1)

loss_text_dpo = F.cross_entropy(reject_text_logits, text_probs)
loss_mel_dpo = F.cross_entropy(reject_mel_logits, mel_probs)

TTS/tts/layers/xtts/trainer/gpt_trainer.py

        loss_dict["loss_text_ce"] = loss_text * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_ce"] = loss_mel * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss_text_dpo"] = loss_text_dpo * self.args.gpt_loss_text_ce_weight
        loss_dict["loss_mel_dpo"] = loss_mel_dpo * self.args.gpt_loss_mel_ce_weight
        loss_dict["loss"] = loss_dict["loss_text_ce"] + loss_dict["loss_mel_ce"] + loss_dict["loss_text_dpo"] + loss_dict["loss_mel_dpo"]
        
  • VRAM Usage and Training Time Comparison:

    • Without DPO loss:
      VRAM usage: X GB
      Training time per epoch: Y minutes
    • With DPO loss:
      VRAM usage: 2X GB
      Training time per epoch: 2Y minutes

I used these settings to train and i found that the avg_loss_text_ce does not seem to be improving with the settings. The light blue line are the settings mentioned here.
image

@stale stale bot removed the wontfix This will not be worked on but feel free to help. label Dec 26, 2024
Copy link

stale bot commented Jan 26, 2025

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. You might also look our discussion channels.

@stale stale bot added the wontfix This will not be worked on but feel free to help. label Jan 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working wontfix This will not be worked on but feel free to help.
Projects
None yet
Development

No branches or pull requests