Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant #109

Merged
merged 2 commits into from
Jan 20, 2025

Conversation

MichaelSt98
Copy link
Contributor

Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant.

Problem with alignment for single-precision, due to the MAX(SIZEOF(...), 8).
I just removed the call to MAX(), however, it would have been possible to just do MAX(SIZEOF(...), 4).
The Loki generated SCC-STACK still doesn't work for SP for the same reason. However, I/we need to think about how we generally fix this issue and adapt the corresponding Loki transformation ...

Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've just tried this and it works fine. However, we should also be able to apply the same change in the driver module where ISTSZ is derived, otherwise you allocate twice the stack size for single precision runs. I did just try this by replacing 8 by 4, but removing entirely should be fine, too.

The reason this MAX(type_bytes, 8) is there at all is to ensure alignment - and working with units of at least 8 bytes was the easiest way to do this. In practice, what we really need, though, is to make sure we align at 8 byte boundaries with the full expression, for example:

KLON*KLEV*MAX(C_SIZEOF(REAL(1, kind=JPRB)), 8)

In C, we would write this as

((KLON*KLEV + 7) >> 3) << 3

i.e., add 7, divide by 8 without rest and multiply by 8.
I'm always a bit uncertain about how to translate this to fortran, but in principle we should be able to do something with ISHFT:

ISHFT(ISHFT(KLON*KLEV + 7, -3), 3)

Readability is not exactly improved but it's probably not much worse than the current expression?

@reuterbal
Copy link
Collaborator

A quick suggests this works:

$ cat test_shft.F90
program test_shft
implicit none
integer :: j

do j=1,33
  write(*,*) 'j=',j,' becomes',ishft(ishft(j + 7, -3), 3)
end do
end program
$ gfortran -o test_shft test_shft.F90
$ ./test_shft
 j=           1  becomes           8
 j=           2  becomes           8
 j=           3  becomes           8
 j=           4  becomes           8
 j=           5  becomes           8
 j=           6  becomes           8
 j=           7  becomes           8
 j=           8  becomes           8
 j=           9  becomes          16
 j=          10  becomes          16
 j=          11  becomes          16
 j=          12  becomes          16
 j=          13  becomes          16
 j=          14  becomes          16
 j=          15  becomes          16
 j=          16  becomes          16
 j=          17  becomes          24
 j=          18  becomes          24
 j=          19  becomes          24
 j=          20  becomes          24
 j=          21  becomes          24
 j=          22  becomes          24
 j=          23  becomes          24
 j=          24  becomes          24
 j=          25  becomes          32
 j=          26  becomes          32
 j=          27  becomes          32
 j=          28  becomes          32
 j=          29  becomes          32
 j=          30  becomes          32
 j=          31  becomes          32
 j=          32  becomes          32
 j=          33  becomes          40

(That would be a way to fix the pool_allocator trafo in Loki - here we have NPROMA-aligned blocks and the entire aligning is redundant)

@reuterbal reuterbal merged commit 06dc3f9 into develop Jan 20, 2025
32 checks passed
@reuterbal reuterbal deleted the nams-acc-stack-sp branch January 20, 2025 13:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants