Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SVD did not converge during the evaluation on 3DPW dataset #4

Open
guxm2021 opened this issue Oct 12, 2021 · 3 comments
Open

SVD did not converge during the evaluation on 3DPW dataset #4

guxm2021 opened this issue Oct 12, 2021 · 3 comments

Comments

@guxm2021
Copy link

Thanks for sharing your code. Yesterday I was running the evaluation experiment on 3DPW dataset according to the instruction of doc/EXP.md. Here I only use one GPU, so my command is as below:
CUDA_VISIBLE_DEVICES=2 python -m torch.distributed.launch --nproc_per_node=1
src/tools/run_gphmer_bodymesh.py
--val_yaml 3dpw/test_has_gender.yaml
--arch hrnet-w64
--num_workers 4
--per_gpu_eval_batch_size 25
--num_hidden_layers 4
--num_attention_heads 4
--input_feat_dim 2051,512,128
--hidden_feat_dim 1024,256,64
--run_eval_only
--resume_checkpoint ./models/graphormer_release/graphormer_3dpw_state_dict.bin
But the error messages said that SVD did not converge. I also tracked the batch when appeared the error and found that the batch was not fixed (sometimes 389 as below, sometimes 200+ or 1100+). Could you tell me how to fix it? I also run your code of MeshTransformer (CVPR 2021), and I didn't meet this kind of error.

[current batch / total batch: 382/1421]
[current batch / total batch: 383/1421]
[current batch / total batch: 384/1421]
[current batch / total batch: 385/1421]
[current batch / total batch: 386/1421]
[current batch / total batch: 387/1421]
[current batch / total batch: 388/1421]
Traceback (most recent call last):
File "src/tools/run_gphmer_bodymesh.py", line 753, in
main(args)
File "src/tools/run_gphmer_bodymesh.py", line 740, in main
run_eval_general(args, val_dataloader, _model, smpl, mesh_sampler)
File "src/tools/run_gphmer_bodymesh.py", line 374, in run_eval_general
mesh_sampler)
File "src/tools/run_gphmer_bodymesh.py", line 438, in run_validate
error_joints_pa = reconstruction_error(pred_3d_joints_from_smpl.cpu().numpy(), gt_3d_joints[:,:,:3].cpu().numpy(), reduction=None)
File "/home/guxm/lab_rotation/shape/MeshGraphormer/src/utils/metric_pampjpe.py", line 70, in reconstruction_error
S1_hat = compute_similarity_transform_batch(S1, S2)
File "/home/guxm/lab_rotation/shape/MeshGraphormer/src/utils/metric_pampjpe.py", line 65, in compute_similarity_transform_batch
S1_hat[i] = compute_similarity_transform(S1[i], S2[i])
File "/home/guxm/lab_rotation/shape/MeshGraphormer/src/utils/metric_pampjpe.py", line 39, in compute_similarity_transform
U, s, Vh = np.linalg.svd(K)
File "<array_function internals>", line 6, in svd
File "/home/guxm/anaconda/anaconda3/envs/gphmr/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 1660, in svd
u, s, vh = gufunc(a, signature=signature, extobj=extobj)
File "/home/guxm/anaconda/anaconda3/envs/gphmr/lib/python3.7/site-packages/numpy/linalg/linalg.py", line 97, in _raise_linalgerror_svd_nonconvergence
raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge
Traceback (most recent call last):
File "/home/guxm/anaconda/anaconda3/envs/gphmr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/guxm/anaconda/anaconda3/envs/gphmr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/guxm/anaconda/anaconda3/envs/gphmr/lib/python3.7/site-packages/torch/distributed/launch.py", line 263, in
main()
File "/home/guxm/anaconda/anaconda3/envs/gphmr/lib/python3.7/site-packages/torch/distributed/launch.py", line 259, in main
cmd=cmd)
subprocess.CalledProcessError: Command '['/home/guxm/anaconda/anaconda3/envs/gphmr/bin/python', '-u', 'src/tools/run_gphmer_bodymesh.py', '--local_rank=0', '--val_yaml', '3dpw/test_has_gender.yaml', '--arch', 'hrnet-w64', '--num_workers', '4', '--per_gpu_eval_batch_size', '25', '--num_hidden_layers', '4', '--num_attention_heads', '4', '--input_feat_dim', '2051,512,128', '--hidden_feat_dim', '1024,256,64', '--run_eval_only', '--resume_checkpoint', './models/graphormer_release/graphormer_3dpw_state_dict.bin']' returned non-zero exit status 1.

@guxm2021
Copy link
Author

This problem seems to be solved when I changed to another gpu. But could you please provide some possible reasons behind this?

@kevinlin311tw
Copy link
Member

Thanks for pointing out this issue. I didn't encounter this problem before. I am not sure if this is due to hardware-specific setting?

@PomIsBest
Copy link

Hi~
I also want to evaluate the experiment, but I can't download the datasets from azcopy. It shows me an error which is " Login Credentials missing. No SAS token or OAuth token is present and the resource is not public".Could you please tell me how to download the datasets or share your datasets?
Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants