Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

InconsolableCellist · 2025-02-16T00:03:43Z

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation

Overview

This PR introduces chunked generation support with latent space interpolation, to be used with voice cloning with the transformer model variant (not hybrid). The implementation uses overlapping windows in the latent space to maintain coherence across chunk boundaries.

Important Usage Notes

Best Used With:
- Voice cloning (speaker audio provided)
- Transformer model variant only
- Longer multi-sentence texts
Not Recommended For:
- Hybrid model variant
- Generation without voice cloning

Key Changes

Core Generation

Added sentence-based chunking for long-form text processing
Implemented NLTK-based sentence tokenization
Added cosine-based crossfade between chunks
Changed maximum generation length from 30 seconds to 120 seconds (but the length the latent feature can generate is unbounded)

Gradio Interface

New toggle in Advanced Parameters

Technical Implementation

Text is split into sentences using NLTK
Each sentence is processed independently with the same seed
Overlap regions are analyzed for best transition points
Cosine crossfade is applied at chunk boundaries
Results are concatenated with smooth transitions

Misc.

Created a cache directory mounted in the Docker compose file, to avoid having to constantly download the models during development

Limitations/Improvements Needed

Audio Artifacting
- Chunking introduces discontinuities in the latent space, which cause small audio artifacts, which sometimes manifest as the sound of a microphone being adjusted, and other time aren't perceptible
- Sometimes sentences are separated by a second pause

Examples

With Latent Windowing (123 seconds)

latent.windowing.4.mp4

Without Latent Windowing (46 seconds)

regular_3.mp4

* Hardcoded window size and overlap * Changed from linear interpolation to sinusoidal * Moved UI around, added text

FurkanGozukara · 2025-02-16T00:13:55Z

amazing improvement

darkacorn · 2025-02-16T07:06:06Z

interesting approach with latens .. i spoke to the team 1-2 days ago about this ..

internally they just splice it and gen them individually and stich it together on the production api ( that was the information i got)

a proposed option they recommended to make transition smooth would be prefix the last 2-3 words as prefix audio and cut that out of gen 2 (text has to be prefixed too) but that would maybe allow infinite length in theory

but you would eventually need some asr to prefix the text chunks too

ideal solution is probably somewhere in the middle - thanks for that approach

InconsolableCellist · 2025-02-16T07:40:33Z

I haven't had a chance to try the playground version. Does it have better performance doing it that way, and is it consistent?

What's ASR is that context?

darkacorn · 2025-02-16T07:43:48Z

asr - whisper - pretty much stt / as otherwise it be hard to know when to cut off and what to feed back in _ the text has to be prefixed just the way prefixes work .. - the playground has a few differences to what we have in oss - namely that they seem to use different samplers (internaly) albeit the model inferenced beeing the transformer

InconsolableCellist · 2025-02-16T07:46:22Z

Ah whisper, gotcha.

Do you think the performance of my solution won't be enough to make it into upstream? Or you want to do that other approach eventually and won't use this?

darkacorn · 2025-02-16T08:57:36Z

no man ..i think your approach is super interesting, and something i would have not thought about. i was merely stating the conversations i had with the team to find out how they do it / and what ideas they got

ideally someone would find something that works for arbitrary length and mamba too - but this is a very cool approach already !

Ph0rk0z · 2025-02-17T13:39:49Z

This is quite important and has to basically be done for every TTS. Otherwise we have a hard limit on length.

InconsolableCellist · 2025-02-22T21:50:26Z

I merged the upstream changes in for the sampler but it creates dramatically worse results for me now, not sure why yet.

EvanGee · 2025-02-24T23:54:11Z

docker-compose.yml

@@ -10,7 +10,10 @@ services:
    network_mode: "host"
    stdin_open: true
    tty: true
-    command: ["python3", "gradio_interface.py"]
+    command: ["bash", "-c", "pip install nltk && python3 -c 'import nltk; nltk.download(\"punkt\"); nltk.download(\"punkt_tab\")' && python3 gradio_interface.py"]


Why not install nltk with the rest of the python packages? furthermore, don't you already download punkt on lines 99 in gradio_interface.py?

InconsolableCellist added 4 commits February 14, 2025 20:32

Fairly functional sliding latent window

7aa0561

Improved latent window feature

f4b2be5

* Hardcoded window size and overlap * Changed from linear interpolation to sinusoidal * Moved UI around, added text

Increasing limit to 120 seconds

a064181

Merge branch 'main' into sliding_window_2

778cf15

InconsolableCellist mentioned this pull request Feb 16, 2025

Is there a maximum limit of characters for a single inference? #10

Open

Ph0rk0z mentioned this pull request Feb 17, 2025

Very short text limit #98

Open

Merge branch 'main' into sliding_window_2

40efd9f

Ph0rk0z mentioned this pull request Feb 22, 2025

Files + Infinite content in Gradio UI #148

Open

EvanGee reviewed Feb 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

InconsolableCellist commented Feb 16, 2025

FurkanGozukara commented Feb 16, 2025

darkacorn commented Feb 16, 2025

InconsolableCellist commented Feb 16, 2025

darkacorn commented Feb 16, 2025

InconsolableCellist commented Feb 16, 2025

darkacorn commented Feb 16, 2025 •

edited

Loading

Ph0rk0z commented Feb 17, 2025

InconsolableCellist commented Feb 22, 2025

EvanGee Feb 24, 2025

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Are you sure you want to change the base?

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation #101

Conversation

InconsolableCellist commented Feb 16, 2025

Attaining arbitrarily long audio generation using chunked generation and latent space interpolation

Overview

Important Usage Notes

Key Changes

Core Generation

Gradio Interface

Technical Implementation

Misc.

Limitations/Improvements Needed

Examples

FurkanGozukara commented Feb 16, 2025

darkacorn commented Feb 16, 2025

InconsolableCellist commented Feb 16, 2025

darkacorn commented Feb 16, 2025

InconsolableCellist commented Feb 16, 2025

darkacorn commented Feb 16, 2025 • edited Loading

Ph0rk0z commented Feb 17, 2025

InconsolableCellist commented Feb 22, 2025

EvanGee Feb 24, 2025

Choose a reason for hiding this comment

darkacorn commented Feb 16, 2025 •

edited

Loading