20230821

PlayVoice · Aug 21, 2023 · 2aa62f3 · 2aa62f3
1 parent fffb5c7
commit 2aa62f3
Show file tree

Hide file tree

Showing 61 changed files with 6,777 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -0,0 +1,198 @@
+<div align="center">
+<h1> Grad-SVC based Grad-TTS from HUAWEI Noah's Ark Lab </h1>
+
+This project is named as Grad-SVC, or GVC for short. Its core technology is diffusion, but so different from other diffusion based SVC models. Codes are adapted from Grad-TTS and so-vits-svc-5.0. So the features from so-vits-svc-5.0 will be used in this project.
+
+The project will be completed in the coming months ~~~
+</div>
+
+## Setup Environment
+1. Install project dependencies
+
+    ```shell
+    pip install -r requirements.txt
+    ```
+
+2. Download the Timbre Encoder: [Speaker-Encoder by @mueller91](https://drive.google.com/drive/folders/15oeBYf6Qn1edONkVLXe82MzdIi3O_9m3), put `best_model.pth.tar`  into `speaker_pretrain/`.
+
+3. Download [hubert_soft model](https://github.com/bshall/hubert/releases/tag/v0.1)，put `hubert-soft-0d54a1f4.pt` into `hubert_pretrain/`.
+
+4. Download pretrained [nsf_bigvgan_pretrain_32K.pth](https://github.com/PlayVoice/NSF-BigVGAN/releases/augment), and put it into `bigvgan_pretrain/`.
+
+5. Download pretrain model [gvc.pretrain.pth](), and put it into `grad_pretrain/`.
+    ```shell
+    python gvc_inference.py --config configs/base.yaml --model ./grad_pretrain/gvc.pretrain.pth --spk ./configs/singers/singer0001.npy --wave test.wav
+    ```
+
+## Dataset preparation
+Put the dataset into the `data_raw` directory following the structure below.
+```
+data_raw
+├───speaker0
+│   ├───000001.wav
+│   ├───...
+│   └───000xxx.wav
+└───speaker1
+    ├───000001.wav
+    ├───...
+    └───000xxx.wav
+```
+
+## Data preprocessing
+After preprocessing you will get an output with following structure.
+```
+data_gvc/
+└── waves-16k
+│    └── speaker0
+│    │      ├── 000001.wav
+│    │      └── 000xxx.wav
+│    └── speaker1
+│           ├── 000001.wav
+│           └── 000xxx.wav
+└── waves-32k
+│    └── speaker0
+│    │      ├── 000001.wav
+│    │      └── 000xxx.wav
+│    └── speaker1
+│           ├── 000001.wav
+│           └── 000xxx.wav
+└── mel
+│    └── speaker0
+│    │      ├── 000001.mel.pt
+│    │      └── 000xxx.mel.pt
+│    └── speaker1
+│           ├── 000001.mel.pt
+│           └── 000xxx.mel.pt
+└── pitch
+│    └── speaker0
+│    │      ├── 000001.pit.npy
+│    │      └── 000xxx.pit.npy
+│    └── speaker1
+│           ├── 000001.pit.npy
+│           └── 000xxx.pit.npy
+└── hubert
+│    └── speaker0
+│    │      ├── 000001.vec.npy
+│    │      └── 000xxx.vec.npy
+│    └── speaker1
+│           ├── 000001.vec.npy
+│           └── 000xxx.vec.npy
+└── speaker
+│    └── speaker0
+│    │      ├── 000001.spk.npy
+│    │      └── 000xxx.spk.npy
+│    └── speaker1
+│           ├── 000001.spk.npy
+│           └── 000xxx.spk.npy
+└── singer
+    ├── speaker0.spk.npy
+    └── speaker1.spk.npy
+```
+
+1.  Re-sampling
+    - Generate audio with a sampling rate of 16000Hz in `./data_gvc/waves-16k` 
+    ```
+    python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-16k -s 16000
+    ```
+    
+    - Generate audio with a sampling rate of 32000Hz in `./data_gvc/waves-32k`
+    ```
+    python prepare/preprocess_a.py -w ./data_raw -o ./data_gvc/waves-32k -s 32000
+    ```
+2. Use 16K audio to extract pitch
+    ```
+    python prepare/preprocess_f0.py -w data_gvc/waves-16k/ -p data_gvc/pitch
+    ```
+3. use 32k audio to extract mel
+    ```
+    python prepare/preprocess_spec.py -w data_gvc/waves-32k/ -s data_gvc/mel
+    ``` 
+4. Use 16K audio to extract hubert
+    ```
+    python prepare/preprocess_hubert.py -w data_gvc/waves-16k/ -v data_gvc/hubert
+    ```
+5. Use 16k audio to extract timbre code
+    ```
+    python prepare/preprocess_speaker.py data_gvc/waves-16k/ data_gvc/speaker
+    ```
+6. Extract the average value of the timbre code for inference
+    ```
+    python prepare/preprocess_speaker_ave.py data_gvc/speaker/ data_gvc/singer
+    ``` 
+8. Use 32k audio to generate training index
+    ```
+    python prepare/preprocess_train.py
+    ```
+9. Training file debugging
+    ```
+    python prepare/preprocess_zzz.py
+    ```
+
+## Train
+1. Start training
+   ```
+   python gvc_trainer.py
+   ``` 
+2. Resume training
+   ```
+   python gvc_trainer.py -p logs/grad_svc/grad_svc_***.pth
+   ```
+3. Log visualization
+   ```
+   tensorboard --logdir logs/
+   ```
+
+## Loss
+![grad_svc_loss](./assets/grad_svc_loss.jpg)
+
+![grad_svc_mel](./assets/grad_svc_mel.jpg)
+
+## Inference
+
+1. Export inference model
+   ```
+   python gvc_export.py --checkpoint_path logs/grad_svc/grad_svc_***.pt
+   ```
+
+2. Inference
+    - if there is no need to adjust `f0`, just run the following command.
+        ```
+        python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --shift 0
+        ```
+    - if `f0` will be adjusted manually, follow the steps:
+
+        1. use hubert to extract content vector
+            ```
+            python hubert/inference.py -w test.wav -v test.vec.npy
+            ```
+        2. extract the F0 parameter to the csv text format
+            ```
+            python pitch/inference.py -w test.wav -p test.csv
+            ```
+        3. final inference
+            ```
+            python gvc_inference.py --model gvc.pth --spk ./data_gvc/singer/your_singer.spk.npy --wave test.wav --vec test.vec.npy --pit test.csv --shift 0
+            ```
+
+3. Convert mel to wave
+    ```
+    python gvc_inference_wave.py --mel gvc_out.mel.pt --pit gvc_tmp.pit.csv
+    ```
+
+## Code sources and references
+
+https://github.com/huawei-noah/Speech-Backbones/blob/main/Grad-TTS
+
+https://github.com/facebookresearch/speech-resynthesis [paper](https://arxiv.org/abs/2104.00355)
+
+https://github.com/jaywalnut310/vits [paper](https://arxiv.org/abs/2106.06103)
+
+https://github.com/NVIDIA/BigVGAN [paper](https://arxiv.org/abs/2206.04658)
+
+https://github.com/mindslab-ai/univnet [paper](https://arxiv.org/abs/2106.07889)
+
+https://github.com/mozilla/TTS
+
+https://github.com/bshall/soft-vc
+
+https://github.com/maxrmorrison/torchcrepe
diff --git a/assets/grad_svc_loss.jpg b/assets/grad_svc_loss.jpg
diff --git a/assets/grad_svc_mel.jpg b/assets/grad_svc_mel.jpg
diff --git a/bigvgan/LICENSE b/bigvgan/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2022 PlayVoice
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/bigvgan/README.md b/bigvgan/README.md
@@ -0,0 +1,138 @@
+<div align="center">
+<h1> Neural Source-Filter BigVGAN </h1>
+    Just For Fun
+</div>
+
+![nsf_bigvgan_mel](https://github.com/PlayVoice/NSF-BigVGAN/assets/16432329/eebb8dca-a8d3-4e69-b02c-632a3a1cdd6a)
+
+## Dataset preparation
+
+Put the dataset into the data_raw directory according to the following file structure
+```shell
+data_raw
+├───speaker0
+│   ├───000001.wav
+│   ├───...
+│   └───000xxx.wav
+└───speaker1
+    ├───000001.wav
+    ├───...
+    └───000xxx.wav
+```
+
+## Install dependencies
+
+- 1 software dependency
+
+  > pip install -r requirements.txt
+
+- 2 download [release](https://github.com/PlayVoice/NSF-BigVGAN/releases/tag/debug) model, and test
+
+  > python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav
+
+## Data preprocessing
+
+- 1， re-sampling: 32kHz
+
+    > python prepare/preprocess_a.py -w ./data_raw -o ./data_bigvgan/waves-32k
+
+- 3， extract pitch
+
+    > python prepare/preprocess_f0.py -w data_bigvgan/waves-32k/ -p data_bigvgan/pitch
+
+- 4， extract mel: [100, length]
+
+    > python prepare/preprocess_spec.py -w data_bigvgan/waves-32k/ -s data_bigvgan/mel
+
+- 5， generate training index
+
+    > python prepare/preprocess_train.py
+
+```shell
+data_bigvgan/
+│
+└── waves-32k
+│    └── speaker0
+│    │      ├── 000001.wav
+│    │      └── 000xxx.wav
+│    └── speaker1
+│           ├── 000001.wav
+│           └── 000xxx.wav
+└── pitch
+│    └── speaker0
+│    │      ├── 000001.pit.npy
+│    │      └── 000xxx.pit.npy
+│    └── speaker1
+│           ├── 000001.pit.npy
+│           └── 000xxx.pit.npy
+└── mel
+     └── speaker0
+     │      ├── 000001.mel.pt
+     │      └── 000xxx.mel.pt
+     └── speaker1
+            ├── 000001.mel.pt
+            └── 000xxx.mel.pt
+
+```
+
+## Train
+
+- 1， start training
+
+    > python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan
+
+- 2， resume training
+
+    > python nsf_bigvgan_trainer.py -c configs/nsf_bigvgan.yaml -n nsf_bigvgan -p chkpt/nsf_bigvgan/***.pth
+
+- 3， view log
+
+    > tensorboard --logdir logs/
+
+
+## Inference
+
+- 1， export inference model
+
+    > python nsf_bigvgan_export.py --config configs/maxgan.yaml --checkpoint_path chkpt/nsf_bigvgan/***.pt
+
+- 2， extract mel
+
+    > python spec/inference.py -w test.wav -m test.mel.pt
+
+- 3， extract F0
+
+    > python pitch/inference.py -w test.wav -p test.csv
+
+- 4， infer
+
+    > python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --wave test.wav
+
+    or
+
+    > python nsf_bigvgan_inference.py --config configs/nsf_bigvgan.yaml --model nsf_bigvgan_g.pth --mel test.mel.pt --pit test.csv
+
+## Augmentation of mel
+For the over smooth output of acoustic model, we use gaussian blur for mel when train vocoder
+```
+# gaussian blur
+model_b = get_gaussian_kernel(kernel_size=5, sigma=2, channels=1).to(device)
+# mel blur
+mel_b = mel[:, None, :, :]
+mel_b = model_b(mel_b)
+mel_b = torch.squeeze(mel_b, 1)
+mel_r = torch.rand(1).to(device) * 0.5
+mel_b = (1 - mel_r) * mel_b + mel_r * mel
+# generator
+optim_g.zero_grad()
+fake_audio = model_g(mel_b, pit)
+```
+![mel_gaussian_blur](https://github.com/PlayVoice/NSF-BigVGAN/assets/16432329/7fa96ef7-5e3b-4ae6-bc61-9b6da3b9d0b9)
+
+## Source of code and References
+
+https://github.com/nii-yamagishilab/project-NN-Pytorch-scripts/tree/master/project/01-nsf
+
+https://github.com/mindslab-ai/univnet [[paper]](https://arxiv.org/abs/2106.07889)
+
+https://github.com/NVIDIA/BigVGAN [[paper]](https://arxiv.org/abs/2206.04658)