Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[wip] aurora config, clean up #6916

Draft
wants to merge 31 commits into
base: master
Choose a base branch
from
Draft

Conversation

oksanaguba
Copy link
Contributor

No description provided.

@oksanaguba oksanaguba mentioned this pull request Jan 19, 2025
@oksanaguba
Copy link
Contributor Author

this commit is running on sunspot

~/ess/v1-jan19/run/atm.log.11365537.amn-0001.250119-204314 

@oksanaguba
Copy link
Contributor Author

oksanaguba commented Jan 19, 2025

i should have not committed to here turning off compose and mam, will remove once this PR is finalized.

@ambrad ambrad requested a review from bartgol January 19, 2025 21:29
@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

Looks good. A couple of points:

  1. In addition to any usual pre-merge testing, I strongly recommend running e3sm_scream_v1_medres with this PR on Frontier, pm-gpu, and Chrysalis, as well as homme_integration on Chrysalis.
  2. I'm thinking the EKAT update isn't right. No, it's fine. I looked more closely. I'm seeing some commits from Luca that diff, but I suspect that's intended. Edit 2: Ok, maybe my initial analysis was right: there seems to be some unintended diffs in ekat.

if (Kokkos_ENABLE_SYCL)
#enable_language(SYCL)
set (EAMXX_ENABLE_GPU TRUE CACHE BOOL "" FORCE)
set (SYCL_BUILD TRUE CACHE BOOL "" FORCE) #needed for yakl if kokkos vars are not visible there?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could turn off COMPOSE here, I believe. (I'll fix the COMPOSE issues once I'm on the machine.)

@@ -23,7 +23,8 @@ set(BUILD_HOMME_PREQX_KOKKOS OFF CACHE BOOL "")
set(BUILD_HOMME_PESE OFF CACHE BOOL "")
set(BUILD_HOMME_SWIM OFF CACHE BOOL "")
set(BUILD_HOMME_PRIM OFF CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
#set(HOMME_ENABLE_COMPOSE ON CACHE BOOL "")
set(HOMME_ENABLE_COMPOSE OFF CACHE BOOL "")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting here this is what needs to be removed from this PR.


#<<<<<<< HEAD
#find_library(NETCDF_C netcdf HINTS $ENV{NETCDF_C_PATH}/lib)
#target_link_libraries(scream_rrtmgp_yakl ${NETCDF_C} rrtmgp scream_share)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain this change? The second line I'm guessing is due to Kokkos, but what about the NETCDF_C line?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i need to ask Jim about this -- iirc i had to set io libs in general, but also some cmake variables for io for rrtmgp separately. so i left this file messy, because i still need to sort it out.

@oksanaguba
Copy link
Contributor Author

i id not understant EKAT comment -- i have an ekat branch, oksanaguba/spot, which i maintained only to point to a different kokkos. so i periodically merge master into it, but not too frequently, so it may be behind the ekat used in master, but its kokkos points not to e3sm kokkos, but to kokkos develop branch.

this is another issue to sort out (but i would not know how).

@ambrad
Copy link
Member

ambrad commented Jan 19, 2025

i id not understant EKAT comment

Re: EKAT, it appears there are missing commits on your branch w.r.t. the submodule point in E3SM: E3SM-Project/EKAT@oksanaguba/spot...4231383

@bartgol am I seeing that correctly? Also, is current EKAT master ready to be used in E3SM?

If so, Oksana, I recommend rebasing your ekat branch on master, then pointing the ekat submodule to that cleaned-up branch.

More generally, testing across machines will help to flush out issues, if there are any, including any with ekat.

@oksanaguba
Copy link
Contributor Author

there are some December commits in the diff you posted, yes, at least those are expected. my ekat branch was not updated too frequently. will do what you suggested, thanks.

@oksanaguba
Copy link
Contributor Author

There are a lot of fails on chrysalis, but they all seem to be in the form of

/home/ac.onguba/ess/SMS_P12x2.ne4pg2_oQU480.WCYCL1850NS.chrysalis_intel.allactive-mach_mods.C.20250125_131133_jckk51/run/e3sm.log.672561.250125-144047:10: e3sm.exe           0000000002F54994  compose_mod_mp_co         418  compose_mod.F90

which points to compose finalize: call slmm_finalize() . The code gets stuck there and runs out of time. I reverted the kokkos sha in my branch to 4.2 (the one that is used in master now), and these fails do not appear. there are 12 total fails with new kokkos. There are also diffs, but i will investigate those separately.

@ambrad would you want to look at this compose finalize issue? thnaks.

@ambrad
Copy link
Member

ambrad commented Jan 26, 2025

Yes, I'll troubleshoot the problem early this week.

Update 1: I have a solution for Chrysalis and now need to test it on GPU machines. The issue was something introduced in Kokkos 4.4 for a View of Views: SequentialHostInit.

@tcclevenger
Copy link
Contributor

For Kokkos 4.5.01, I'm having my runs on PM-GPU hang, and standalone tests hang on Weaver.

@oksanaguba
Copy link
Contributor Author

oksanaguba commented Jan 27, 2025

@tcclevenger it maybe the same issue as i see with compose finalize, see if you can check in logs

@ambrad
Copy link
Member

ambrad commented Jan 27, 2025

I'm testing these changes and will report back: oksanaguba/eamxx/spot-clean...ambrad:E3SM:spot-clean-amb1

@ambrad
Copy link
Member

ambrad commented Jan 27, 2025

@oksanaguba: With my changes, the hang is gone but (separately from the fix) I'm getting DIFFs on Chrysalis. The end result is that e3sm_atm_developer and homme_integration are clean besides the diffs. Since the hangs are gone, do you want me to push the following commits to this branch?
oksanaguba/eamxx/spot-clean...ambrad:E3SM:spot-clean-amb1
Separately, I'll see if there's a way to avoid the baseline diffs.

@bartgol
Copy link
Contributor

bartgol commented Jan 27, 2025

@ambrad I'm assuming you are compiling with Kokkos::Serial as a backend, right? The CI fails when configuring eamxx are exposing an old bug in ekat (and eamxx) that shows up when openmp is on. Oksana pulled in the commit that fixes things in ekat, but we need a fix in eamxx.

@oksanaguba I can push the fix straight to your branch, if you want. Or I can send you the patch.

@rljacob
Copy link
Member

rljacob commented Jan 27, 2025

What is the purpose of introducing 4 new machines? sunspot-pvc, sunspotcpu, sunspot-gen, and auroracpu ?

@ambrad
Copy link
Member

ambrad commented Jan 28, 2025

@bartgol do you have a branch with your fix in it? That said, I'm confident the fix I made to compose is needed: I see the change in behavior necessitating this fix was made in Kokkos 4.4.

@bartgol
Copy link
Contributor

bartgol commented Jan 28, 2025

@bartgol do you have a branch with your fix in it? That said, I'm confident the fix I made to compose is needed: I see the change in behavior necessitating this fix was made in Kokkos 4.4.

I don't. The fix depends on certain stranded ekat commit that got cherry picked in the ekat-kokkos-4.5 branch, so I kind of need that ekat branch to include them. I didn't want to create multiple branches where ekat is updated, so I may just piggy-back a fix to this branch. Again, the fix only shows up when Kokkos_ENABLE_OPENMP=TRUE AND a commit from ekat is present. This was a case of double-negative: two bugs canceling each other out by luck...

I haven't looked at your changes, so I can't comment. All I'm saying is that the fact that you were able to build makes me believe that Kokkos_ENABLE_OPENMP=FALSE in your build (I think in cime builds you get that to be ON if you ask for compile_threaded, not sure what the cime syntax is for asking that option)

@ambrad
Copy link
Member

ambrad commented Jan 28, 2025

@bartgol one thing that might be confusing here is I'm fixing the F90 dycore used with EAM, not the C++ dycore used with EAMxx. So EKAT is irrelevant in my case.

@oksanaguba
Copy link
Contributor Author

@ambrad yes, please push the fix, this branch is useless without 4.5.01 kokkos. regarding diffs -- if you gave a guess, it would be great. if not, i was going to revert a few changes to try to narrow it down, prob using standalone homme.

regarding your conversation above with Luca -- ii do not follow. but one test that uses hommexx dycore, ERS_D.ne4pg2_oQU480.F2010.chrysalis_intel.eam-hommexx , also freezes at the same spot in compose. i assume it is still not related to Luca's issue. i will try to re-read messages tomorrow to understand Luca's point better.

@oksanaguba
Copy link
Contributor Author

@rljacob Re: "What is the purpose of introducing 4 new machines? sunspot-pvc, sunspotcpu, sunspot-gen, and auroracpu ?" it is much easier to debug setups early on when they are separated. this is related to the issue that Az raised, that this should be cleaned up eventually.

@ambrad
Copy link
Member

ambrad commented Jan 28, 2025

@oksanaguba I've pushed the commits. I'm somewhat sure that the diffs occur only in the SL code; I'm currently hunting for the exact cause.

Update 1: At this point I have provisional evidence that at least the non-_D tests in e3sm_atm_developer will pass when the branch is rebased on current master. I'm testing that guess now. For Chrysalis, that will leave HOMMEBFB in homme_integration to figure out; the other test, HOMME, is already OK.

Update 2: Yes, e3sm_atm_developer passes against baselines with current master merged in (including a _D test), except for eam-hommexx. I'm going to tackle HOMMEBFB now and then come back to eam-hommexx if it doesn't get fixed along the way.

Update 3: I believe I have isolated the HOMMEBFB diff source. I will fix that, then retest everything on Chrysalis. If all is well, I will push the commit to this branch. Next, I'll move to pm-gpu and Frontier to address a GPU-specific issue.

Update 4: homme_integration passes on Chrysalis. Now moving on to GPU stuff. Also running e3sm_atm_integration against baselines on Chrysalis with upstream/master merged into this branch; will report back when that's done.

Update 5: homme_integration and e3sm_atm_integration (with master merged to branch) pass against baselines on Chrysalis. So I think Chrysalis is done.

Update 6: I found another way to handle the view of views in compose that avoids the hang while not requiring SequentialHostInit. Chrysalis is still fine, and the dycore now builds on Frontier, but there are issues it looks like at the EAMxx level that I'll look into next.

option(SCREAM_ENABLE_MAM "Whether to enable MAM aerosol support" ON)
if (Kokkos_ENABLE_SYCL)
option(SCREAM_ENABLE_MAM "Whether to enable MAM aerosol support" OFF)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this will work as you intend. You may have to simply do set(SCREAM_ENABLE_MAM OFF)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, but that one unlike SET worked.

//Whether to enable MAM aerosol support
SCREAM_ENABLE_MAM:BOOL=OFF

if you have a preference i will change it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm puzzled. This

option(BLAH "" ON) 
message ("BLAH: ${BLAH}")
option(BLAH "" OFF)
message ("BLAH: ${BLAH}")

prints

BLAH: ON
BLAH: ON

Are you sure you did not have another line before those that set SCREAM_ENABLE_MAM to OFF? Maybe in the mach file? Also, was it a clean build?

My preference is for using set(... CACHE ..) (or option(...)) only once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this in Professional CMake:

An important difference between normal and cache variables is that the set() command will only
overwrite a cache variable if the FORCE keyword is present, unlike normal variables where the set()
command will always overwrite a pre-existing value. The set() command acts more like set-if-not-
set when used to define cache variables, as does the option() command (which has no FORCE
capability). The main reason for this is that cache variables are primarily intended as a
customization point for developers.

which makes me think something else may have caused SCREAM_ENABLE_MAM=OFF in your cache file...

@ambrad
Copy link
Member

ambrad commented Jan 30, 2025

@bartgol did you say you have a fix for

CMake Error at /gpfs/fs1/home/ac.ambradl/e3smdev1/externals/ekat/cmake/EkatSetCompilerFlags.cmake:294 (message):
  [SetOmpFlags] You need to specify at least one language

This is on Chrysalis, where I'm trying to build an EAMxx medres test with this branch. If you have a fix, feel free to push it to this PR branch.

@ambrad
Copy link
Member

ambrad commented Jan 30, 2025

On Frontier, it seems that flags.make for certain EAMxx interface files gets a spurious -x that leads to

clang-15: error: language not recognized: '-std=gnu++17'

One sees it in flags.make here: -x hip -fno-gpu-rdc -x --offload-arch=gfx900. The second -x shouldn't be there. I haven't found what's injecting it. An example is bld/cmake-bld/eamxx/src/physics/rrtmgp/CMakeFiles/scream_rrtmgp.dir/flags.make. Another: bld/cmake-bld/eamxx/src/mct_coupling/CMakeFiles/atm.dir/flags.make.

@tcclevenger
Copy link
Contributor

tcclevenger commented Jan 30, 2025

On Frontier, it seems that bld/cmake-bld/eamxx/src/physics/rrtmgp/CMakeFiles/scream_rrtmgp.dir/flags.make gets a spurious -x that leads to

clang-15: error: language not recognized: '-std=gnu++17'

One sees it in flags.make here: -x hip -fno-gpu-rdc -x --offload-arch=gfx900. The second -x shouldn't be there. I haven't found what's injecting it.

I was getting language not recognized: '-Md'. Also coming from rrtmgp.

@ambrad
Copy link
Member

ambrad commented Jan 30, 2025

@tcclevenger I just updated my comment with another example. rrtmgp itself has nothing to do with it; it just comes first. As for your -Md case, check the build line for a spurious -x, possibly right before the -Md. As far as I can tell, the language not recognized message is meaningless and seems to get emitted due to bailing on a -x with no actual argument to -x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Machine Files wip work in progress
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants