Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

InconsolableCellist
Copy link

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation

Overview

This PR introduces chunked generation support with latent space interpolation, to be used with voice cloning with the transformer model variant (not hybrid). The implementation uses overlapping windows in the latent space to maintain coherence across chunk boundaries.

Important Usage Notes

  • Best Used With:
    • Voice cloning (speaker audio provided)
    • Transformer model variant only
    • Longer multi-sentence texts
  • Not Recommended For:
    • Hybrid model variant
    • Generation without voice cloning

Key Changes

Core Generation

  • Added sentence-based chunking for long-form text processing
  • Implemented NLTK-based sentence tokenization
  • Added cosine-based crossfade between chunks
  • Changed maximum generation length from 30 seconds to 120 seconds (but the length the latent feature can generate is unbounded)

Gradio Interface

  • New toggle in Advanced Parameters

Technical Implementation

  1. Text is split into sentences using NLTK
  2. Each sentence is processed independently with the same seed
  3. Overlap regions are analyzed for best transition points
  4. Cosine crossfade is applied at chunk boundaries
  5. Results are concatenated with smooth transitions

Misc.

  • Created a cache directory mounted in the Docker compose file, to avoid having to constantly download the models during development

Limitations/Improvements Needed

  • Audio Artifacting
    • Chunking introduces discontinuities in the latent space, which cause small audio artifacts, which sometimes manifest as the sound of a microphone being adjusted, and other time aren't perceptible
    • Sometimes sentences are separated by a second pause

Examples

With Latent Windowing (123 seconds)

latent.windowing.4.mp4

Without Latent Windowing (46 seconds)

regular_3.mp4

@FurkanGozukara
Copy link

amazing improvement

@darkacorn
Copy link
Contributor

interesting approach with latens .. i spoke to the team 1-2 days ago about this ..

internally they just splice it and gen them individually and stich it together on the production api ( that was the information i got)

a proposed option they recommended to make transition smooth would be prefix the last 2-3 words as prefix audio and cut that out of gen 2 (text has to be prefixed too) but that would maybe allow infinite length in theory

but you would eventually need some asr to prefix the text chunks too

ideal solution is probably somewhere in the middle - thanks for that approach

@InconsolableCellist
Copy link
Author

I haven't had a chance to try the playground version. Does it have better performance doing it that way, and is it consistent?

What's ASR is that context?

@darkacorn
Copy link
Contributor

asr - whisper - pretty much stt / as otherwise it be hard to know when to cut off and what to feed back in _ the text has to be prefixed just the way prefixes work .. - the playground has a few differences to what we have in oss - namely that they seem to use different samplers (internaly) albeit the model inferenced beeing the transformer

@InconsolableCellist
Copy link
Author

Ah whisper, gotcha.

Do you think the performance of my solution won't be enough to make it into upstream? Or you want to do that other approach eventually and won't use this?

@darkacorn
Copy link
Contributor

darkacorn commented Feb 16, 2025

no man ..i think your approach is super interesting, and something i would have not thought about. i was merely stating the conversations i had with the team to find out how they do it / and what ideas they got

ideally someone would find something that works for arbitrary length and mamba too - but this is a very cool approach already !

@Ph0rk0z
Copy link

Ph0rk0z commented Feb 17, 2025

This is quite important and has to basically be done for every TTS. Otherwise we have a hard limit on length.

@Ph0rk0z Ph0rk0z mentioned this pull request Feb 17, 2025
@InconsolableCellist
Copy link
Author

I merged the upstream changes in for the sampler but it creates dramatically worse results for me now, not sure why yet.

@@ -10,7 +10,10 @@ services:
network_mode: "host"
stdin_open: true
tty: true
command: ["python3", "gradio_interface.py"]
command: ["bash", "-c", "pip install nltk && python3 -c 'import nltk; nltk.download(\"punkt\"); nltk.download(\"punkt_tab\")' && python3 gradio_interface.py"]
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not install nltk with the rest of the python packages? furthermore, don't you already download punkt on lines 99 in gradio_interface.py?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants