Major update, UI, Controls, Bug Fix, Speed up computation...

numz · Aug 13, 2023 · 7a817b3 · 7a817b3
1 parent f2bfff2
commit 7a817b3
Show file tree

Hide file tree

Showing 13 changed files with 329 additions and 281 deletions.
diff --git a/.gitignore b/.gitignore
@@ -14,3 +14,5 @@ scripts/wav2lip/output/masks/*.png
 scripts/wav2lip/output/*.mp4
 scripts/wav2lip/output/*.aac
 scripts/wav2lip/results/result_voice.mp4
+scripts/wav2lip/temp/*.avi
+scripts/wav2lip/temp/*.wav
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -24,6 +24,6 @@ Before submitting a pull request, please make sure your code adheres to the proj
 
 ## Contact
 
-If you have any questions or need help, please ping the developer via email at [email protected] to make sure your addition will fit well into such a large project and to get help if needed.
+If you have any questions or need help, please ping the developer via discord NumZ#7184 to make sure your addition will fit well into such a large project and to get help if needed.
 
 Thank you again for your contribution!
diff --git a/README.md b/README.md
@@ -1,27 +1,44 @@
-# Wav2Lip UHQ extension for Stable diffusion webui Automatic1111
+# 🔉👄 Wav2Lip UHQ extension for Stable Diffusion WebUI Automatic1111
 
 
 ![Illustration](https://user-images.githubusercontent.com/800903/258130805-26d9732f-4d33-4c7e-974e-7af2f1261768.gif)
 
-Result video can be find here : https://www.youtube.com/watch?v=-3WLUxz6XKM
-
 https://user-images.githubusercontent.com/800903/258139382-6594f243-b43d-46b9-89f1-7a9f8f47b764.mp4
 
-## Description
+## 💡 Description
 This repository contains a Wav2Lip UHQ extension for Automatic1111. 
 
-It's an all-in-one solution: just choose a video and a speech file (wav or mp3), and it will generate a lip-sync video. It improves the quality of the lip-sync videos generated by the [Wav2Lip tool](https://github.com/Rudrabha/Wav2Lip) by applying specific post-processing techniques with Stable diffusion.
+It's an all-in-one solution: just choose a video and a speech file (wav or mp3), and the extension will generate a lip-sync video. It improves the quality of the lip-sync videos generated by the [Wav2Lip tool](https://github.com/Rudrabha/Wav2Lip) by applying specific post-processing techniques with Stable diffusion tools.
+
+![Illustration](https://user-images.githubusercontent.com/800903/260311451-75d9ebeb-796b-489b-9192-65570ea0d83d.png)
+
+## 📖 Quick Index
+* [🚀 Updates](#-Updates)
+* [🔗 Requirements](#-Requirements)
+* [💻 Installation](#-Installation)
+* [🐍 Usage](#-Usage)
+* [📖 Behind the scenes](#-Behind-the-scenes)
+* [💪 Quality tips](#-Quality-tips)
+* [📝 TODO](#-TODO)
+* [😎 Contributing](#-Contributing)
+* [🙏 Appreciation](#-Appreciation)
+* [📝 Citation](#-Citation)
+* [📜 License](#-License)
 
-![Illustration](https://user-images.githubusercontent.com/800903/258130901-cd4403cd-f146-4e69-8a30-8ee4c51beb7f.png)
+## 🚀 Updates
 
-## Requirements
-- latest version of Stable diffusion webui automatic1111
-- FFmpeg
+**2023.08.13**
+- ⚡ Speed-up computation 
+- 🚢 Change User Interface : Add controls on hidden parameters
+- 👄 Only Track mouth if needed
+- 📰 Control debug
+- 🐛 Fix resize factor bug
 
-1. Install Stable Diffusion WebUI by following the instructions on the [Stable Diffusion Webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) repository.
-2. Download FFmpeg from the [official FFmpeg site](https://ffmpeg.org/download.html). Follow the instructions appropriate for your operating system. Note that FFmpeg should be accessible from the command line.
 
-## Installation
+## 🔗 Requirements
+- latest version of Stable Diffusion WebUI Automatic1111 by following the instructions on the [Stable Diffusion Webui](https://github.com/AUTOMATIC1111/stable-diffusion-webui) repository.
+
+## 💻 Installation
 
 1. Launch Automatic1111
 2. In the extensions tab, enter the following URL in the "Install from URL" field and click "Install":
@@ -32,48 +49,76 @@ It's an all-in-one solution: just choose a video and a speech file (wav or mp3),
 
 ![Illustration](https://user-images.githubusercontent.com/800903/258115651-196a07bd-ee4b-4aaf-b11e-8e2d1ffaa42f.png)
 
-5. if you don't see the "Wav2lip Uhq tab" restart automatic1111.
+4. If you don't see the "Wav2Lip UHQ tab" restart Automatic1111.
 
-6. 🔥 Important: Get the weights. Download the model weights from the following locations and place them in the corresponding directories:
+5. 🔥 Important: Get the weights. Download the model weights from the following locations and place them in the corresponding directories (take care about the filename, especially for s3fd)
 
 |        Model        |                                    Description                                     |                                                                        Link to the model                                                                         |                                       install folder                                       |
 |:-------------------:|:----------------------------------------------------------------------------------:|:----------------------------------------------------------------------------------------------------------------------------------------------------------------:|:------------------------------------------------------------------------------------------:|
 |       Wav2Lip       |                              Highly accurate lip-sync                              |        [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/Eb3LEzbfuKlJiR600lQWRxgBIY27JZg80f7V9jtMfbNDaQ?e=TBFBVW)         |                   extensions\sd-wav2lip-uhq\scripts\wav2lip\checkpoints\                   |
 |    Wav2Lip + GAN    |               Slightly inferior lip-sync, but better visual quality                |        [Link](https://iiitaphyd-my.sharepoint.com/:u:/g/personal/radrabha_m_research_iiit_ac_in/EdjI7bZlgApMqsVoEUUXpLsBxqXbn5z8VTmoxp55YNDcIA?e=n9ljGW)         |                   extensions\sd-wav2lip-uhq\scripts\wav2lip\checkpoints\                   |
 |        s3fd         |                          Face Detection pre trained model                          |                                           [Link](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth)                                           |      extensions\sd-wav2lip-uhq\scripts\wav2lip\face_detection\detection\sfd\s3fd.pth       |
-|        s3fd         |                 Face Detection pre trained model (alternate link)                  |         [Link](https://www.adrianbulat.com/downloads/python-fan/s3fd-619a316812.pth)         |      extensions\sd-wav2lip-uhq\scripts\wav2lip\face_detection\detection\sfd\s3fd.pth       |
 | landmark predicator |        Dlib 68 point face landmark prediction (click on the download icon)         |                              [Link](https://github.com/numz/wav2lip_uhq/blob/main/predicator/shape_predictor_68_face_landmarks.dat)                              | extensions\sd-wav2lip-uhq\scripts\wav2lip\predicator\shape_predictor_68_face_landmarks.dat |
 | landmark predicator |              Dlib 68 point face landmark prediction (alternate link)               | [Link](https://huggingface.co/spaces/asdasdasdasd/Face-forgery-detection/resolve/ccfc24642e0210d4d885bc7b3dbc9a68ed948ad6/shape_predictor_68_face_landmarks.dat) | extensions\sd-wav2lip-uhq\scripts\wav2lip\predicator\shape_predictor_68_face_landmarks.dat |
 | landmark predicator | Dlib 68 point face landmark prediction (alternate link click on the download icon) |                        [Link](https://github.com/italojs/facial-landmarks-recognition/blob/master/shape_predictor_68_face_landmarks.dat)                         | extensions\sd-wav2lip-uhq\scripts\wav2lip\predicator\shape_predictor_68_face_landmarks.dat |
 
 
-## Usage
+## 🐍 Usage
 1. Choose a video or an image.
 2. Choose an audio file with speech.
 3. choose a checkpoint (see table above).
-4. **Padding**:  Wav2Lip uses this to add a black border around the mouth, which is useful to prevent the mouth from being cropped by the face detection. You can change the padding value to suit your needs, but the default value gives good results.
-5. **No Smooth**: If checked, the mouth will not be smoothed. This can be useful if you want to keep the original mouth shape.
+4. **Padding**: Wav2Lip uses this to add a black border around the mouth, which is useful to prevent the mouth from being cropped by the face detection. You can change the padding value to suit your needs, but the default value gives good results.
+5. **No Smooth**: When checked, this option retains the original mouth shape without smoothing.
 6. **Resize Factor**: This is a resize factor for the video. The default value is 1.0, but you can change it to suit your needs. This is useful if the video size is too large.
-7. Choose a good Stable diffusion checkpoint, like [delibarate_v2](https://civitai.com/models/4823/deliberate) or [revAnimated_v122](https://civitai.com/models/7371) (SDXL models don't seem to work, but you can generate a SDXL image and change model for wav2lip process).
-8. Click on the "Generate" button.
+7. **Only Mouth**: This option tracks only the mouth, removing other facial motions like those of the cheeks and chin.
+8. **Mouth Mask Dilate**: This will dilate the mouth mask to cover more area around the mouth. depends on the mouth size.
+9. **Face Mask Erode**: This will erode the face mask to remove some area around the face. depends on the face size.
+10. **Mask Blur**: This will blur the mask to make it more smooth, try to keep it under or equal to **Mouth Mask Dilate**. 
+11. **Active debug**: This will create step-by-step images in the debug folder.
+12. Click on the "Generate" button.
 
-## Behind the scenes
+## 📖 Behind the scenes
 
 This extension operates in several stages to improve the quality of Wav2Lip-generated videos:
 
 1. **Generate a Wav2lip video**: The script first generates a low-quality Wav2Lip video using the input video and audio.
-2. **Mask Creation**: The script creates a mask around the mouth and try to keep other face motion like cheeks and chin.
-3. **Video Quality Enhancement**: It takes the low-quality Wav2Lip video and overlays the low-quality mouth onto the high-quality original video. 
-4. **Img2Img**: The script then sends the original image with the low-quality mouth and the mouth mask into Img2Img. 
+2. **Mask Creation**: The script creates a mask around the mouth and tries to keep other facial motions like those of the cheeks and chin.
+3. **Video Quality Enhancement**: It takes the low-quality Wav2Lip video and overlays the low-quality mouth onto the high-quality original video guided by the mouth mask. 
+4. **Face Enhancer**: The script then sends the original image with the low-quality mouth on face_enhancer tool of stable diffusion to generate a high-quality mouth image.
+5. **Video Generation**: The script then takes the high-quality mouth image and overlays it onto the original image guided by the mouth mask.
+6. **Video Post Processing**: The script then uses the ffmpeg tool to generate the final video.
 
-## Quality tips
+## 💪 Quality tips
 - Use a high quality image/video as input
-- Try to minimize the grain on the face on the input as much as possible, for example you can try to use "Restore faces" in img2img before use an image as wav2lip input.
-- Use a high quality model in stable diffusion webui like [delibarate_v2](https://civitai.com/models/4823/deliberate) or [revAnimated_v122](https://civitai.com/models/7371)
+- Try to minimize the grain on the face on the input as much as possible. For example, you can use the "Restore faces" feature in img2img before using an image as input for Wav2Lip.
+- Dilate the mouth mask. This will help the model retain some facial motion and hide the original mouth.
+- Mask Blur less or equal to Dilate Mouth Mask.
+
+## 📝 TODO
+- [ ] Add Suno/Bark to generate high quality text to speech audio as wav file input (see [bark](https://github.com/suno-ai/bark/))
+
+## 😎 Contributing
+
+We welcome contributions to this project. When submitting pull requests, please provide a detailed description of the changes. see [CONTRIBUTING](CONTRIBUTING.md) for more information.
+
+## 🙏 Appreciation 
+- [Wav2Lip](https://github.com/Rudrabha/Wav2Lip)
+
+## 📝 Citation
+If you use this project in your own work, in articles, tutorials, or presentations, we encourage you to cite this project to acknowledge the efforts put into it.
 
-## Contributing
+To cite this project, please use the following BibTeX format:
 
-Contributions to this project are welcome. Please ensure any pull requests are accompanied by a detailed description of the changes made.
+```
+@misc{wav2lip_uhq,
+  author = {numz},
+  title = {Wav2Lip UHQ},
+  year = {2023},
+  howpublished = {GitHub repository},
+  publisher = {numz},
+  url = {https://github.com/numz/sd-wav2lip-uhq}
+}
+``` 
 
-## License
+## 📜 License
 * The code in this repository is released under the MIT license as found in the [LICENSE file](LICENSE).
diff --git a/scripts/ui.py b/scripts/ui.py
@@ -2,55 +2,79 @@
 import gradio as gr
 from scripts.wav2lip.w2l import W2l
 from scripts.wav2lip.wav2lip_uhq import Wav2LipUHQ
-from modules.shared import opts, state
-from pathlib import Path
+from modules.shared import state
+
+
 def on_ui_tabs():
     wav2lip_uhq_sys_extend()
 
     with gr.Blocks(analytics_enabled=False) as wav2lip_uhq_interface:
-        gr.Markdown("<div align='center'> <h3> Follow installation instructions <a href='https://github.com/numz/sd-wav2lip-uhq'> here </a> </h3> </div>")
-        with gr.Row():
-            video = gr.File(label="Video or Image", info="Filepath of video/image that contains faces to use")
-            audio = gr.File(label="Audio", info="Filepath of video/audio file to use as raw audio source")
-            with gr.Column():
-                checkpoint = gr.Radio(["wav2lip", "wav2lip_gan"], value="wav2lip_gan",  label="Checkpoint", info="Name of saved checkpoint to load weights from")
-                no_smooth = gr.Checkbox(label="No Smooth", info="Prevent smoothing face detections over a short temporal window")
-                resize_factor = gr.Slider(minimum=1, maximum=4, step=1, label="Resize Factor", info="Reduce the resolution by this factor. Sometimes, best results are obtained at 480p or 720p")
-                generate_btn = gr.Button("Generate")
-                interrupt_btn = gr.Button('Interrupt', elem_id=f"interrupt", visible=True)
-
+        gr.Markdown(
+            "<div align='center'> <h3><a href='https://github.com/numz/sd-wav2lip-uhq'> Follow installation instructions here </a> </h3> </div>")
         with gr.Row():
             with gr.Column():
-                pad_top = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Top", info="Padding above lips")
-                pad_bottom = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Bottom", info="Padding below lips")
-                pad_left = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Left", info="Padding to the left of lips")
-                pad_right = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Right", info="Padding to the right of lips")
+                with gr.Row():
+                    video = gr.File(label="Video or Image", info="Filepath of video/image that contains faces to use",
+                                    file_types=["mp4", "png", "jpg", "jpeg", "avi"])
+                    audio = gr.File(label="Audio", info="Filepath of video/audio file to use as raw audio source",
+                                    file_types=["mp3", "wav"])
+                with gr.Row():
+                    checkpoint = gr.Radio(["wav2lip", "wav2lip_gan"], value="wav2lip_gan", label="Checkpoint",
+                                          info="Name of saved checkpoint to load weights from")
+                    no_smooth = gr.Checkbox(label="No Smooth", info="Prevent smoothing face detections")
+                    only_mouth = gr.Checkbox(label="Only Mouth", info="Only track the mouth")
+                    active_debug = gr.Checkbox(label="Active Debug", info="Active Debug")
+                with gr.Row():
+                    with gr.Column():
+                        resize_factor = gr.Slider(minimum=1, maximum=4, step=1, label="Resize Factor",
+                                                  info="Reduce the resolution by this factor.")
+                        mouth_mask_dilatation = gr.Slider(minimum=0, maximum=64, step=1, value=15,
+                                                          label="Mouth Mask Dilate",
+                                                          info="Dilatation of the mask around the mouth (in pixels)")
+                        erode_face_mask = gr.Slider(minimum=0, maximum=64, step=1, value=15, label="Face Mask Erode",
+                                                    info="Erode the mask around the face (in pixels)")
+                        mask_blur = gr.Slider(minimum=0, maximum=64, step=1, value=15, label="Mask Blur",
+                                              info="Kernel size of Gaussian blur for masking")
+                    with gr.Column():
+                        pad_top = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Top",
+                                            info="Padding above lips")
+                        pad_bottom = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Bottom",
+                                               info="Padding below lips")
+                        pad_left = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Left",
+                                             info="Padding to the left of lips")
+                        pad_right = gr.Slider(minimum=0, maximum=50, step=1, value=0, label="Pad Right",
+                                              info="Padding to the right of lips")
 
             with gr.Column():
                 with gr.Tabs(elem_id="wav2lip_generated"):
-                    result = gr.Video(label="Generated video", format="mp4").style(width=256)
-
+                    result = gr.Video(label="Generated video", format="mp4").style(width=512)
+                generate_btn = gr.Button("Generate")
+                interrupt_btn = gr.Button('Interrupt', elem_id=f"interrupt", visible=True)
 
         def on_interrupt():
             state.interrupt()
             return "Interrupted"
 
-        def generate(video, audio, checkpoint, no_smooth, resize_factor, pad_top, pad_bottom, pad_left, pad_right):
+        def generate(video, audio, checkpoint, no_smooth, only_mouth, resize_factor, mouth_mask_dilatation,
+                     erode_face_mask, mask_blur, pad_top, pad_bottom, pad_left, pad_right, active_debug):
             state.begin()
             if video is None or audio is None or checkpoint is None:
                 return
-            w2l = W2l(video.name, audio.name, checkpoint, no_smooth, resize_factor, pad_top, pad_bottom, pad_left, pad_right)
+            w2l = W2l(video.name, audio.name, checkpoint, no_smooth, resize_factor, pad_top, pad_bottom, pad_left,
+                      pad_right)
             w2l.execute()
 
-            w2luhq = Wav2LipUHQ(video.name, audio.name)
-            w2luhq.execute()
-            return str(Path("extensions/sd-wav2lip-uhq/scripts/wav2lip/output/output_video.mp4"))
+            w2luhq = Wav2LipUHQ(video.name, audio.name, mouth_mask_dilatation, erode_face_mask, mask_blur, only_mouth,
+                                resize_factor, active_debug)
+
+            return w2luhq.execute()
 
         generate_btn.click(
-            generate, 
-            [video, audio, checkpoint, no_smooth, resize_factor, pad_top, pad_bottom, pad_left, pad_right], 
+            generate,
+            [video, audio, checkpoint, no_smooth, only_mouth, resize_factor, mouth_mask_dilatation, erode_face_mask,
+             mask_blur, pad_top, pad_bottom, pad_left, pad_right, active_debug],
             result)
 
         interrupt_btn.click(on_interrupt)
-        
+
     return [(wav2lip_uhq_interface, "Wav2lip Uhq", "wav2lip_uhq_interface")]
diff --git a/scripts/wav2lip/audio.py b/scripts/wav2lip/audio.py
@@ -1,7 +1,6 @@
 import librosa
 import librosa.filters
 import numpy as np
-# import tensorflow as tf
 from scipy import signal
 from scipy.io import wavfile
 from scripts.wav2lip.hparams import hparams as hp
@@ -13,7 +12,6 @@ def load_wav(path, sr):
 
 def save_wav(wav, path, sr):
     wav *= 32767 / max(0.01, np.max(np.abs(wav)))
-    # proposed by @dsmiller
     wavfile.write(path, sr, wav.astype(np.int16))