Vctrl_add_cases (#1030)

Co-authored-by: xuzhang <[email protected]>
PaddlePaddle · Feb 12, 2025 · aebbbac · aebbbac
1 parent c126fed
commit aebbbac
Show file tree

Hide file tree

Showing 11 changed files with 192 additions and 227 deletions.
diff --git a/ppdiffusers/examples/ppvctrl/README.md b/ppdiffusers/examples/ppvctrl/README.md
@@ -21,94 +21,29 @@ These design features make PP-VCtrl suitable for a wide range of video generatio
 - [ ] PP-VCtrl v2 model weights 
 
 ## 📷 Quick Demos
+### Wonderful Demos Generated by PP-VCtrl-I2V 
+First, extract the video control sequences (edges, masks, and poses) from the source video. Then, use ControlNet to regenerate the first frame of the video. Input the video control sequences and the newly generated first frame into PP-VCtrl-I2V to generate the new video.
+
+### 1. PP-VCtrl-I2V-Canny
+| Input Video               | Control Video               | Reference       Image             | Output   Video             |
+|---------------------------|-----------------------------|-----------------------|--------------------------|
+<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case1_pixel.gif" >|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case1_guide.gif"> </img>|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case1_sub1.jpg">|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case1_sub1.gif" > </img>|
+<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case2_pixel.gif" >|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case2_guide.gif"> </img>|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case2_sub1.jpg">|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/canny/canny_case2_sub1.gif" > </img>|
+
+
+
+### 2. PP-VCtrl-I2V-Mask
+| Input Video               | Control Video               | Reference      Image      | Output  Video             |
+|---------------------------|-----------------------------|---------------------------|---------------------------|
+<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case1_pixel.gif" >|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case1_guide.gif"> </img>|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case1_sub1.jpg">|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case1_sub1.gif" > </img>|
+<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case2_pixel.gif" >|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case2_guide.gif"> </img>|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case2_sub2.jpg">|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/mask/mask_case2_sub2.gif" > </img>|
+
+### 3.PP-VCtrl-I2V-Pose
+| Input Video               | Control Video               | Reference      Image      | Output  Video             |
+|----------------------|-----------------------|----------------------|-----------------------|
+<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case1_pixel.gif" >|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case1_guide.gif"> </img>|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case1_sub1.jpg">|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case1_sub1.gif" > </img>|
+<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case2_pixel.gif" >|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case2_guide.gif"> </img>|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case2_sub1.jpg">|<img src="https://raw.githubusercontent.com/Hammingbo/Hammingbo.github.io/refs/heads/main/static/gif/pose/pose_case2_sub1.gif" > </img>|
 
-### 1. PP-VCtrl with Canny Edge :
-
-<table class="center">
-    <thead>
-        <tr>
-            <th>Prompt</th> <!-- 新增的列标题，在最左边 -->
-            <th>Reference Image</th>
-            <th>Control Videos</th>
-            <th>Ours (PP-VCtrl-5B-T2V)</th>
-            <th>Ours (PP-VCtrl-5B-I2V)</th>
-        </tr>
-    </thead>
-    <tbody>
-        <tr>
-            <td>Group of fishes swimming in aquarium.</td> <!-- 新增的文本描述，在最左边 -->
-            <td><img src="assets/figures/canny_case1_reference.jpg" alt="Reference " width="160"></td>
-            <td><img src="assets/figures/canny_case1_control_image.gif" alt="Conrotl Videos" width="160"></td>
-            <td><img src="assets/figures/canny_case1_ours_t2v.gif" alt="Ours T2V" width="160"></td>
-            <td><img src="assets/figures/canny_case1_ours_i2v.gif" alt="Ours I2V" width="160"></td>
-        </tr>
-        <tr>
-            <td>A boat with a flag on it is sailing on the sea.</td> <!-- 第二行的文本描述 -->
-            <td><img src="assets/figures/canny_case2_reference.jpg" alt="Reference" width="160"></td>
-            <td><img src="assets/figures/canny_case2_control_image.gif" alt="Control Videos" width="160"></td>
-            <td><img src="assets/figures/canny_case2_ours_t2v.gif" alt="Ours T2v" width="160"></td>
-            <td><img src="assets/figures/canny_case2_ours_i2v.gif" alt="Ours I2v" width="160"></td>
-        </tr>
-        <!-- 可以继续添加更多行 -->
-    </tbody>
-</table>
-
-### 2. PP-VCtrl with Mask Map :
-<table class="center">
-    <thead>
-        <tr>
-            <th>Prompt</th> <!-- 新增的列标题，在最左边 -->
-            <th>Reference Image</th>
-            <th>Control Videos</th>
-            <th>Ours (PP-VCtrl-5B-T2V)</th>
-            <th>Ours (PP-VCtrl-5B-I2V)</th>
-        </tr>
-    </thead>
-    <tbody>
-        <tr>
-            <td>A rider in a dark helmet and white breeches is atop a chestnut horse...</td> <!-- 新增的文本描述，在最左边 -->
-            <td><img src="assets/figures/mask_case1_reference.jpg" alt="Reference " width="160"></td>
-            <td><img src="assets/figures/mask_case1_control_image.gif" alt="Conrotl Videos" width="160"></td>
-            <td><img src="assets/figures/mask_case1_ours_t2v.gif" alt="Ours T2V" width="160"></td>
-            <td><img src="assets/figures/mask_case1_ours_i2v.gif" alt="Ours I2V" width="160"></td>
-        </tr>
-        <tr>
-            <td>A dark gray Mini Cooper is parked on a city street...</td> <!-- 第二行的文本描述 -->
-            <td><img src="assets/figures/mask_case2_reference.jpg" alt="Reference" width="160"></td>
-            <td><img src="assets/figures/mask_case2_control_image.gif" alt="Control Videos" width="160"></td>
-            <td><img src="assets/figures/mask_case2_ours_t2v.gif" alt="Ours T2v" width="160"></td>
-            <td><img src="assets/figures/mask_case2_ours_i2v.gif" alt="Ours I2v" width="160"></td>
-        </tr>
-        <!-- 可以继续添加更多行 -->
-    </tbody>
-</table>
-
-### 3. PP-VCtrl with Human Pose Map：
-<table class="center">
-    <thead>
-        <tr>
-            <th>Prompt</th> <!-- 新增的列标题，在最左边 -->
-            <th>Reference Image</th> <!-- 新增的列标题，在最左边 -->
-            <th>Pose Videos</th>
-            <th>Ours (PP-VCtrl-5B-I2V)</th>
-        </tr>
-    </thead>
-    <tbody>
-        <tr>
-            <td>A young man with curly hair and a red t-shirt featuring a white logo is seen in various states of motion... </td>  
-            <td><img src="assets/figures/pose_case1_reference1.jpg" alt="Reference 1" width="160"></td> 
-           <td><img src="assets/figures/pose_case1_control_image.gif" alt="Pose Videos" width="160"></td>
-            <td><img src="assets/figures/pose_case1_ours_1.gif" alt="Ours 1" width="160"></td>
-        </tr>
-        <tr>
-            <td>A woman models an Adrianna Papell women's gown, featuring a sleeveless...</td> 
-            <td><img src="assets/figures/pose_case2_reference2.jpg" alt="Reference 1" width="160"></td> 
-            <td><img src="assets/figures/pose_case2_control_image.gif" alt="Pose Videos" width="160"></td>
-            <td><img src="assets/figures/pose_case2_ours_2.gif" alt="Ours 1" width="160"></td>
-        </tr>
-        <!-- 可以继续添加更多行 -->
-    </tbody>
-</table>
 
 ## 🚀 Quick Start
 ***Note:*** 
@@ -220,8 +155,8 @@ bash anchor/extract_canny.sh
 
 ```bash
 #download sam2
-mkdir -p anchor/checkpoint/mask
-wget -P anchor/checkpoint/mask https://bj.bcebos.com/v1/paddlenlp/models/community/Sam/Sam2/sam2.1_hiera_large.pdparams
+mkdir -p anchor/checkpoints/mask
+wget -P anchor/checkpoints/mask https://bj.bcebos.com/v1/paddlenlp/models/community/Sam/Sam2/sam2.1_hiera_large.pdparams
 #mask
 bash anchor/extract_mask.sh
 ```
@@ -268,22 +203,18 @@ The final inference results of the model can be found in the **/infer_outputs**
 ### 1. Generate with Canny Map
 ```bash
 ##i2v
-mkdir -p infer_outputs/canny/i2v
 bash scripts/infer_cogvideox_i2v_canny_vctrl.sh
 
 ##t2v
-mkdir -p infer_outputs/canny/t2v
 bash scripts/infer_cogvideox_t2v_canny_vctrl.sh
 ```
 
 ### 2. Generate with Mask Map
 ```bash
 ##i2v
-mkdir -p infer_outputs/mask/i2v
 bash scripts/infer_cogvideox_i2v_mask_vctrl.sh
 
 ##t2v
-mkdir -p infer_outputs/mask/t2v
 bash scripts/infer_cogvideox_t2v_mask_vctrl.sh
 ```
 **Note**: The edge and mask control models can support both t2v (text-to-video) and i2v (image-to-video) models simultaneously.
@@ -292,7 +223,6 @@ bash scripts/infer_cogvideox_t2v_mask_vctrl.sh
 
 ```bash
 ##i2v
-mkdir -p infer_outputs/pose/i2v
 bash scripts/infer_cogvideox_i2v_pose_vctrl.sh
 ```
 
@@ -347,4 +277,28 @@ These strategies are integrated into the unified video generation control framew
 In the quantitative evaluation of edge control video generation (Canny), human pose control video generation (Pose), and mask control video generation (Mask) tasks, the PPVCtrl model excels or surpasses existing open-source task-specific methods in both control ability and video quality metrics.
 <img src="assets/models/eval1.png" style="width:100%">
 
-We conducted manual evaluation experiments, inviting multiple evaluators to score videos generated by different methods. The
+We conducted manual evaluation experiments, inviting multiple evaluators to score videos generated by different methods. The evaluation dimensions included overall video quality, temporal consistency, and more. The results showed that PPVCtrl outperformed existing open-source methods in all evaluation dimensions.
+<img src="assets/models/eval2.png" style="width:100%">
+
+<!-- 
+## More version
+<details close>
+<summary>Model Versions</summary>
+</details>
+-->
+<!-- 
+## Contact us
+Users: [[email protected]]([email protected])  
+-->
+<!-- 
+ ## BibTex
+
+```
+@article{guo2023animatediff,
+  title={AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning},
+  author={Guo, Yuwei and Yang, Ceyuan and Rao, Anyi and Liang, Zhengyang and Wang, Yaohui and Qiao, Yu and Agrawala, Maneesh and Lin, Dahua and Dai, Bo},
+  journal={International Conference on Learning Representations},
+  year={2025}
+}
+
+```上面的代码打印了一条消息 -->