Skip to content

Adapting the official implementation of Flexible Motion In-betweening with Diffusion Models (SIGGRAPH 2024) to take custom skeletons

License

Notifications You must be signed in to change notification settings

icedwater/CondMDI

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

58 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Flexible Motion In-betweening with Diffusion Models

arXiv

The official PyTorch implementation of the paper "Flexible Motion In-betweening with Diffusion Models".

For more details, visit our project page.

teaser

News

📢 21/June/24 - First release.

Bibtex

If you find this code useful in your research, please cite:

@article{cohan2024flexible,
  title={Flexible Motion In-betweening with Diffusion Models},
  author={Cohan, Setareh and Tevet, Guy and Reda, Daniele and Peng, Xue Bin and van de Panne, Michiel},
  journal={arXiv preprint arXiv:2405.11126},
  year={2024}
}

Getting started

This code was developed on Ubuntu 20.04 LTS with Python 3.7, CUDA 11.7 and PyTorch 1.13.1. The current requirements.txt was set up with Python 3.9, CUDA 11.3, PyTorch 1.12.1.

1. Setup environment

Install ffmpeg (if not already installed):

sudo apt update
sudo apt install ffmpeg

For windows use this instead.

2. Install dependencies

This codebase shares a large part of its base dependencies with GMD. We recommend installing our dependencies from scratch to avoid version differences.

Setup virtual env:

python3 -m venv .env_condmdi      # pick your preferred name here
source .env_condmdi/bin/activate  # and use that name in place of .env_condmdi
pip install torch==1.12.1+cu113 torchvision==0.13.1+cu113 torchaudio==0.12.1 --extra-index-url https://download.pytorch.org/whl/cu113
pip install -r requirements.txt   # updated to include spacy and clip configuration

Download dependencies:

Text to Motion
bash prepare/download_smpl_files.sh
bash prepare/download_glove.sh
bash prepare/download_t2m_evaluators.sh
Unconstrained
bash prepare/download_smpl_files.sh
bash prepare/download_recognition_unconstrained_models.sh

2. Get data

There are two paths to get the data:

(a) **Generation only** with pretrained text-to-motion model without training or evaluating
#### a. Generation only (text only)

**HumanML3D** - Clone HumanML3D, then copy the data dir to our repository:

```shell
cd ..
git clone https://github.com/EricGuo5513/HumanML3D.git
unzip ./HumanML3D/HumanML3D/texts.zip -d ./HumanML3D/HumanML3D/
cp -r HumanML3D/HumanML3D diffusion-motion-inbetweening/dataset/HumanML3D
cd CondMDI
cp -a dataset/HumanML3D_abs/. dataset/HumanML3D/
```
(b) **Get full data** to train and evaluate the model.
#### b. Full data (text + motion capture)

**HumanML3D** - Follow the instructions in [HumanML3D](https://github.com/EricGuo5513/HumanML3D.git),
then copy the result dataset to our repository:

**[Important !]**
Following GMD, the representation of the root joint has been changed from relative to absolute. Therefore, when setting up HumanML3D, please
run GMD's version of `motion_representation.ipynb` and `cal_mean_variance.ipynb` instead to get the absolute-root data. These files are made
available in `./dataset/HumanML3D_abs/`.

```shell
cp -r ../HumanML3D/HumanML3D ./dataset/HumanML3D
```

3. Download the pretrained models

Download the model(s) you wish to use, then unzip and place them in ./save/.

Our models are all trained on the HumanML3D dataset.

Conditionally trained on randomly sampled frames and joints (CondMDI)

Conditionally trained on randomly sampled frames

Unconditionally (no keyframes) trained

Motion Synthesis

Text to Motion - Without spatial conditioning

This part is a standard text-to-motion generation.

Generate from test set prompts

using the unconditioned model

python -m sample.synthesize --model_path ./save/condmdi_uncond/model000500000.pt --num_samples 10 --num_repetitions 3

using the conditional model

python -m sample.conditional_synthesis --model_path ./save/condmdi_randomframes/model000750000.pt --edit_mode uncond --num_samples 10 --num_repetitions 3
  • You can use --no_text to sample from the conditional model without text conditioning.

Generate from a single prompt

using the unconditioned model

python -m sample.synthesize --model_path ./save/condmdi_uncond/model000500000.pt --num_samples 10 --num_repetitions 1 --text_prompt "a person is exercising and jumping"

using the conditional model

python -m sample.conditional_synthesis --model_path ./save/condmdi_randomframes/model000750000.pt --edit_mode uncond --num_samples 10 --num_repetitions 3 --text_prompt "a person is exercising and jumping"

example

Text to Motion - With keyframe conditioning

Generate from a single prompt - condition on keyframe locations

using the unconditioned model

python -m sample.edit --model_path ./save/condmdi_uncond/model000500000.pt --edit_mode benchmark_sparse --transition_length 5 --num_samples 10 --num_repetitions 3 --imputate --stop_imputation_at 1 --reconstruction_guidance --reconstruction_weight 20 --text_condition "a person throws a ball"
  • You can remove --text_condition to generate samples conditioned only on keyframes (not text).

using the conditional model

python -m sample.conditional_synthesis --model_path ./save/condmdi_randomframes/model000750000.pt --edit_mode benchmark_sparse --transition_length 5 --num_samples 10 --num_repetitions 3 --text_prompt "a person throws a ball"

Generate from test set prompts - condition on keyframe locations

using the conditional model

python -m sample.conditional_synthesis --model_path ./save/condmdi_randomframes/model000750000.pt --edit_mode benchmark_sparse --transition_length 5 --num_samples 10 --num_repetitions 3
  • You can use --no_text to sample from the conditional model without text conditioning.

(In development) Using the --interactive flag will start an interactive window that allows you to choose the keyframes yourself. The interactive pattern will override the predefined pattern. example

Useful flags for spatial conditioning:

  • --edit_mode to indicate the type of spatial condition.
  • --imputation to use imputation/inpainting for inference-time conditioning.
    • stop_imputation_at to indicate the diffusion step to stop replacement. Default is 0.
  • --reconstruction_guidance to use reconstruction guidance for inference-time conditioning.
    • --reconstruction_weight to indicate the reconstruction guidance weight ($w_r$ in Algorithm 3)

You may also define:

  • --device id.
  • --seed to sample different prompts.
  • --motion_length (text-to-motion only) in seconds (maximum is 9.8[sec]).
  • --progress to save the denoising progress.

Running those will get you:

  • results.npy file with text prompts and xyz positions of the generated animation
  • sample##_rep##.mp4 - a stick figure animation for each generated motion. You can stop here, or render the SMPL mesh using the following script.

Render SMPL mesh

To create SMPL mesh per frame run:

python -m visualize.render_mesh --input_path /path/to/mp4/stick/figure/file

This script outputs:

  • sample##_rep##_smpl_params.npy - SMPL parameters (thetas, root translations, vertices and faces)
  • sample##_rep##_obj - Mesh per frame in .obj format.

Notes:

  • The .obj can be integrated into Blender/Maya/3DS-MAX and rendered using them.
  • This script is running SMPLify and needs GPU as well (can be specified with the --device flag).
  • Important - Do not change the original .mp4 path before running the script.

Notes for 3d makers:

  • You have two ways to animate the sequence:
    1. Use the SMPL add-on and the theta parameters saved to sample##_rep##_smpl_params.npy (we always use beta=0 and the gender-neutral model).
    2. A more straightforward way is using the mesh data itself. All meshes have the same topology (SMPL), so you just need to keyframe vertex locations. Since the OBJs are not preserving vertices order, we also save this data to the sample##_rep##_smpl_params.npy file for your convenience.

Training

Our model is trained on the HumanML3D dataset.

Conditional Model

python -m train.train_condmdi --keyframe_conditioned
  • You can remove --keyframe_conditioned to train a unconditioned model.
  • Use --device to define GPU id.

Evaluate

All evaluations are done on the HumanML3D dataset.

Text to Motion - With keyframe conditioning

  • Takes about 20 hours (on a single GPU)
  • The output of this script for the pre-trained models (as was reported in the paper) is provided in the checkpoints zip file.
  • For each prompt, 5 keyframes are sampled from the ground truth motion. The ground locations of the root joint in those frames are used as conditions.

on the unconditioned model

python -m eval.eval_humanml_condmdi --model_path ./save/condmdi_uncond/model000500000.pt --edit_mode gmd_keyframes --imputate --stop_imputation_at 1
  • Above prompt evaluates the inference-time imputation for keyframe conditioning.

on the conditional model

python -m eval.eval_humanml_condmdi --model_path ./save/condmdi_randomframes/model000750000.pt --edit_mode gmd_keyframes --keyframe_guidance_param 1

Acknowledgments

We would like to thank the following contributors for the great foundation that we build upon: GMD, MDM, guided-diffusion, MotionCLIP, text-to-motion, actor, joints2smpl, MoDi.

License

This code is distributed under an MIT LICENSE.

Note that our code depends on other libraries, including CLIP, SMPL, SMPL-X, PyTorch3D, and uses datasets that each have their own respective licenses that must also be followed.

About

Adapting the official implementation of Flexible Motion In-betweening with Diffusion Models (SIGGRAPH 2024) to take custom skeletons

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 97.1%
  • Jupyter Notebook 2.6%
  • Shell 0.3%