PyTorch versions #338

terryfrankcombe · 2023-05-26T05:12:37Z

terryfrankcombe
May 26, 2023

I just noticed that the installation instructions state that we need pytorch <= 1.11.*.

I've been running apparently happily with 1.12.0. Are there known problems with this?

(For additional context, I'm trying to get things running on A100, having used V100 up to now. But my pytorch doesn't like sm_80. Installing 1.12.0+cu113 gives a tool that consistently hangs after the second batch of epoch 0. Should I be looking for something in 1.11? Conversely, might pytorch 2.0 work?)

Ciao
Terry

Linux-cpp-lisp · 2023-05-27T18:46:28Z

Linux-cpp-lisp
May 27, 2023
Maintainer

Hi Terry,

This is a consequence of some very strange upstream bugs in PyTorch we've been affected by: #311. The linked issue has the details; basically training/inference will get progressively slower and slower over time. With 1.12 I had also noticed some other irregularities due to the new nvFuser JIT backend. I would be careful with 1.12, but if it is working for you that is good to know.

If you have (or have not) experienced the slowdowns in the linked issue, would you mind posting the details in that thread? We're still trying to understand when and why that issue occurs so we can hopefully work around it.

Thanks!

5 replies

terryfrankcombe May 30, 2023
Author

I've been running with 1.12.0, with whatever cuda gets configured by default (installed via pip with no explicit cuda version). I haven't notice a slowdown, but I haven't been looking and would not have noticed. This is running on V100.

Linux-cpp-lisp May 31, 2023
Maintainer

Interesting; the linked issue (and my tests) were all on A100s. Maybe that is part of the necessary conditions for the bug.

You might still want to check and ensure your trainings aren't getting slower over time. (You can easily plot walltime as a function of epoch, in wandb or by hand.)

terryfrankcombe May 31, 2023
Author

Of course. Wall time vs. Epoch is a nice straight line for all of the runs I have it recorded for.

Linux-cpp-lisp May 31, 2023
Maintainer

Huh, interesting. Glad to hear, and thanks for the info Terry!

Linux-cpp-lisp Jul 11, 2023
Maintainer

@terryfrankcombe , and whoever else sees this:

I've heard from the developer of another MLFF package that they get numerically incorrect derivatives in PyTorch 1.12+. We have not yet had any reports of this and it could be a bug only hit by their code, but please be aware of this and carefully check PyTorch 1.12+ results. We continue to only recommend PyTorch 1.11 for our users until these various issues can be more conclusively addressed, and we appreciate any reports of things working (in a well-checked way) or not working on PyTorch 1.12+.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyTorch versions #338

{{title}}

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

PyTorch versions #338

terryfrankcombe May 26, 2023

Replies: 1 comment · 5 replies

Linux-cpp-lisp May 27, 2023 Maintainer

terryfrankcombe May 30, 2023 Author

Linux-cpp-lisp May 31, 2023 Maintainer

terryfrankcombe May 31, 2023 Author

Linux-cpp-lisp May 31, 2023 Maintainer

Linux-cpp-lisp Jul 11, 2023 Maintainer

terryfrankcombe
May 26, 2023

Replies: 1 comment 5 replies

Linux-cpp-lisp
May 27, 2023
Maintainer

terryfrankcombe May 30, 2023
Author

Linux-cpp-lisp May 31, 2023
Maintainer

terryfrankcombe May 31, 2023
Author

Linux-cpp-lisp May 31, 2023
Maintainer

Linux-cpp-lisp Jul 11, 2023
Maintainer