Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected Error with Increasing MPI ranks related to search alg. #589

Open
nkzk-stan opened this issue Jun 4, 2022 · 2 comments
Open

Comments

@nkzk-stan
Copy link

I am facing unexpected behavior from Nalu. In short, I am rotating a square with a sliding mesh interface at a reasonably low omega (1.5) relative to other cases I have ran for a circle and ellipse.

When I deploy on 1 node and 16 cpus, I ran without a problem.

When I deploy on 1 node and 32 cpus. It provided an error 159:
_Throw number = 159

Throw test that evaluated to true: !std::isfinite(Teuchos::ScalarTraits::magnitude(omega))

Prolongator damping factor needs to be finite.
MueLu::Exceptions::RuntimeError'
what(): /shared/nalu/build/packages/Trilinos/packages/muelu/src/Transfers/Smoothed-Aggregation/MueLu_SaPFactory_def.hpp:228:_

I resubmitted the job with 32 cpus. It provided an error 160:
_Throw number = 160

Throw test that evaluated to true: true

Belos::StatusTestImpResNorm::checkStatus(): One or more of the current implicit residual norms is NaN.
Belos::StatusTestError'
what(): /shared/nalu/build/packages/Trilinos/packages/belos/src/BelosStatusTestImpResNorm.hpp:635:_

This issue was corrected by changing
search_tolerance: 0.05
activate_dynamic_search_algorithm: no

I have attached the input file ( had to change the file to pdf so it would be attached - so just remove the .pdf to access it)

In addition, for another simulation for a square at slower omega (.707), NALU is freezing at a the same timestep. This occured on both 16 and 32 cpus. This has the same input file as the above case with just the omega and timestep changed. This issue was also corrected by using the above fix. This is the last output when NALU would stall:


Time Step Count: 1075 Current Time: 15.5672
dtN: 0.0149393 dtNm1: 0.0149967 gammas: 1.49904 -1.99617 0.497129
Volume 796 min: 0.000463178 max: 0.00877652
NonConformal alg will ghost a new number of entities: 14 and remove 84 entities from ghosting.
DgInfo size overview for name: Current_surface_5__Opposing_surface_55

dgSquare_R1.i.pdf

@spdomin
Copy link
Contributor

spdomin commented Jun 4, 2022

When I run this case on this resource,

Currently Loaded Modules:

  1. tbb/2021.1.1 3) compiler/2021.1.1 5) impi/2018
  2. compiler-rt/2021.1.1 4) intel/2021.1.1 6) gnu8/8.3.0

I also see:

Quad42DSCS::general_face_grad_op: issue..

This was the clue that prompted the suggestion to remove the dynamic search algorithm since the issue rests in serving up a poor opposing element from which the face grad op is required.

Fixes are as follows:

  1. At the very least, we should throw in the element method before we allow the linear system to be assembled that ultimate causes a NAN.
  2. We need to re-visit the dynamic tolerance parallel search algorithm. The MPI rank dependency is due to the coarse search no adequately serving up the full set of elements. There may be some corner case with mesh spacing/shape the drives this issue.
  3. Although the coarse parallel search is processor-count dependent, as long as we have the full set of proper candidates served up, the fine search should find the best candidate.

Best,

@spdomin
Copy link
Contributor

spdomin commented Jun 4, 2022

Adding input file in non-pdf form.

dgSquare_R1n.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants