-
Notifications
You must be signed in to change notification settings - Fork 376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[wip] aurora config, clean up #6916
base: master
Are you sure you want to change the base?
Conversation
this commit is running on sunspot
|
i should have not committed to here turning off compose and mam, will remove once this PR is finalized. |
Looks good. A couple of points:
|
if (Kokkos_ENABLE_SYCL) | ||
#enable_language(SYCL) | ||
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE) | ||
set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)
@@ -23,7 +23,8 @@ set(BUILD_HOMME_PREQX_KOKKOS OFF CACHE BOOL "") | |||
set(BUILD_HOMME_PESE OFF CACHE BOOL "") | |||
set(BUILD_HOMME_SWIM OFF CACHE BOOL "") | |||
set(BUILD_HOMME_PRIM OFF CACHE BOOL "") | |||
set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | |||
#set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "") | |||
set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting here this is what needs to be removed from this PR.
|
||
#<<<<<<< HEAD | ||
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib) | ||
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.
i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch. this is another issue to sort out (but i would not know how). |
Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383 @bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM? If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch. More generally, testing across machines will help to flush out issues, if there are any, including any with ekat. |
there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks. |
There are a lot of fails on chrysalis, but they all seem to be in the form of
which points to compose finalize: call slmm_finalize() . The code gets stuck there and runs out of time. I reverted the kokkos sha in my branch to 4.2 (the one that is used in master now), and these fails do not appear. there are 12 total fails with new kokkos. There are also diffs, but i will investigate those separately. @ambrad would you want to look at this compose finalize issue? thnaks. |
Yes, I'll troubleshoot the problem early this week. Update 1: I have a solution for Chrysalis and now need to test it on GPU machines. The issue was something introduced in Kokkos 4.4 for a View of Views: SequentialHostInit. |
For Kokkos 4.5.01, I'm having my runs on PM-GPU hang, and standalone tests hang on Weaver. |
@tcclevenger it maybe the same issue as i see with compose finalize, see if you can check in logs |
I'm testing these changes and will report back: oksanaguba/eamxx/spot-clean...ambrad:E3SM:spot-clean-amb1 |
@oksanaguba: With my changes, the hang is gone but (separately from the fix) I'm getting DIFFs on Chrysalis. The end result is that e3sm_atm_developer and homme_integration are clean besides the diffs. Since the hangs are gone, do you want me to push the following commits to this branch? |
@ambrad I'm assuming you are compiling with Kokkos::Serial as a backend, right? The CI fails when configuring eamxx are exposing an old bug in ekat (and eamxx) that shows up when openmp is on. Oksana pulled in the commit that fixes things in ekat, but we need a fix in eamxx. @oksanaguba I can push the fix straight to your branch, if you want. Or I can send you the patch. |
What is the purpose of introducing 4 new machines? sunspot-pvc, sunspotcpu, sunspot-gen, and auroracpu ? |
@bartgol do you have a branch with your fix in it? That said, I'm confident the fix I made to |
I don't. The fix depends on certain stranded ekat commit that got cherry picked in the ekat-kokkos-4.5 branch, so I kind of need that ekat branch to include them. I didn't want to create multiple branches where ekat is updated, so I may just piggy-back a fix to this branch. Again, the fix only shows up when Kokkos_ENABLE_OPENMP=TRUE AND a commit from ekat is present. This was a case of double-negative: two bugs canceling each other out by luck... I haven't looked at your changes, so I can't comment. All I'm saying is that the fact that you were able to build makes me believe that Kokkos_ENABLE_OPENMP=FALSE in your build (I think in cime builds you get that to be ON if you ask for compile_threaded, not sure what the cime syntax is for asking that option) |
@bartgol one thing that might be confusing here is I'm fixing the F90 dycore used with EAM, not the C++ dycore used with EAMxx. So EKAT is irrelevant in my case. |
@ambrad yes, please push the fix, this branch is useless without 4.5.01 kokkos. regarding diffs -- if you gave a guess, it would be great. if not, i was going to revert a few changes to try to narrow it down, prob using standalone homme. regarding your conversation above with Luca -- ii do not follow. but one test that uses hommexx dycore, ERS_D.ne4pg2_oQU480.F2010.chrysalis_intel.eam-hommexx , also freezes at the same spot in compose. i assume it is still not related to Luca's issue. i will try to re-read messages tomorrow to understand Luca's point better. |
@rljacob Re: "What is the purpose of introducing 4 new machines? sunspot-pvc, sunspotcpu, sunspot-gen, and auroracpu ?" it is much easier to debug setups early on when they are separated. this is related to the issue that Az raised, that this should be cleaned up eventually. |
@oksanaguba I've pushed the commits. I'm somewhat sure that the diffs occur only in the SL code; I'm currently hunting for the exact cause. Update 1: At this point I have provisional evidence that at least the non-_D tests in e3sm_atm_developer will pass when the branch is rebased on current master. I'm testing that guess now. For Chrysalis, that will leave HOMMEBFB in homme_integration to figure out; the other test, HOMME, is already OK. Update 2: Yes, e3sm_atm_developer passes against baselines with current master merged in (including a _D test), except for eam-hommexx. I'm going to tackle HOMMEBFB now and then come back to eam-hommexx if it doesn't get fixed along the way. Update 3: I believe I have isolated the HOMMEBFB diff source. I will fix that, then retest everything on Chrysalis. If all is well, I will push the commit to this branch. Next, I'll move to pm-gpu and Frontier to address a GPU-specific issue. Update 4: homme_integration passes on Chrysalis. Now moving on to GPU stuff. Also running e3sm_atm_integration against baselines on Chrysalis with upstream/master merged into this branch; will report back when that's done. Update 5: homme_integration and e3sm_atm_integration (with master merged to branch) pass against baselines on Chrysalis. So I think Chrysalis is done. Update 6: I found another way to handle the view of views in compose that avoids the hang while not requiring SequentialHostInit. Chrysalis is still fine, and the dycore now builds on Frontier, but there are issues it looks like at the EAMxx level that I'll look into next. |
option(SCREAM_ENABLE_MAM "Whether to enable MAM aerosol support" ON) | ||
if (Kokkos_ENABLE_SYCL) | ||
option(SCREAM_ENABLE_MAM "Whether to enable MAM aerosol support" OFF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure this will work as you intend. You may have to simply do set(SCREAM_ENABLE_MAM OFF)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, but that one unlike SET worked.
//Whether to enable MAM aerosol support
SCREAM_ENABLE_MAM:BOOL=OFF
if you have a preference i will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm puzzled. This
option(BLAH "" ON)
message ("BLAH: ${BLAH}")
option(BLAH "" OFF)
message ("BLAH: ${BLAH}")
prints
BLAH: ON
BLAH: ON
Are you sure you did not have another line before those that set SCREAM_ENABLE_MAM to OFF? Maybe in the mach file? Also, was it a clean build?
My preference is for using set(... CACHE ..)
(or option(...)
) only once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this in Professional CMake:
An important difference between normal and cache variables is that the set() command will only
overwrite a cache variable if the FORCE keyword is present, unlike normal variables where the set()
command will always overwrite a pre-existing value. The set() command acts more like set-if-not-
set when used to define cache variables, as does the option() command (which has no FORCE
capability). The main reason for this is that cache variables are primarily intended as a
customization point for developers.
which makes me think something else may have caused SCREAM_ENABLE_MAM=OFF in your cache file...
@bartgol did you say you have a fix for
This is on Chrysalis, where I'm trying to build an EAMxx medres test with this branch. If you have a fix, feel free to push it to this PR branch. |
On Frontier, it seems that flags.make for certain EAMxx interface files gets a spurious
One sees it in flags.make here: |
I was getting |
@tcclevenger I just updated my comment with another example. rrtmgp itself has nothing to do with it; it just comes first. As for your |
No description provided.