Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant #109

MichaelSt98 · 2025-01-10T10:29:57Z

Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant.

Problem with alignment for single-precision, due to the MAX(SIZEOF(...), 8).
I just removed the call to MAX(), however, it would have been possible to just do MAX(SIZEOF(...), 4).
The Loki generated SCC-STACK still doesn't work for SP for the same reason. However, I/we need to think about how we generally fix this issue and adapt the corresponding Loki transformation ...

reuterbal

Thanks, I've just tried this and it works fine. However, we should also be able to apply the same change in the driver module where ISTSZ is derived, otherwise you allocate twice the stack size for single precision runs. I did just try this by replacing 8 by 4, but removing entirely should be fine, too.

The reason this MAX(type_bytes, 8) is there at all is to ensure alignment - and working with units of at least 8 bytes was the easiest way to do this. In practice, what we really need, though, is to make sure we align at 8 byte boundaries with the full expression, for example:

KLON*KLEV*MAX(C_SIZEOF(REAL(1, kind=JPRB)), 8)

In C, we would write this as

((KLON*KLEV + 7) >> 3) << 3

i.e., add 7, divide by 8 without rest and multiply by 8.
I'm always a bit uncertain about how to translate this to fortran, but in principle we should be able to do something with ISHFT:

ISHFT(ISHFT(KLON*KLEV + 7, -3), 3)

Readability is not exactly improved but it's probably not much worse than the current expression?

reuterbal · 2025-01-17T15:37:21Z

A quick suggests this works:

$ cat test_shft.F90
program test_shft
implicit none
integer :: j

do j=1,33
  write(*,*) 'j=',j,' becomes',ishft(ishft(j + 7, -3), 3)
end do
end program
$ gfortran -o test_shft test_shft.F90
$ ./test_shft
 j=           1  becomes           8
 j=           2  becomes           8
 j=           3  becomes           8
 j=           4  becomes           8
 j=           5  becomes           8
 j=           6  becomes           8
 j=           7  becomes           8
 j=           8  becomes           8
 j=           9  becomes          16
 j=          10  becomes          16
 j=          11  becomes          16
 j=          12  becomes          16
 j=          13  becomes          16
 j=          14  becomes          16
 j=          15  becomes          16
 j=          16  becomes          16
 j=          17  becomes          24
 j=          18  becomes          24
 j=          19  becomes          24
 j=          20  becomes          24
 j=          21  becomes          24
 j=          22  becomes          24
 j=          23  becomes          24
 j=          24  becomes          24
 j=          25  becomes          32
 j=          26  becomes          32
 j=          27  becomes          32
 j=          28  becomes          32
 j=          29  becomes          32
 j=          30  becomes          32
 j=          31  becomes          32
 j=          32  becomes          32
 j=          33  becomes          40

(That would be a way to fix the pool_allocator trafo in Loki - here we have NPROMA-aligned blocks and the entire aligning is redundant)

reuterbal requested changes Jan 17, 2025

View reviewed changes

MichaelSt98 added 2 commits January 20, 2025 12:17

fix SP for GPU OpenACC SCC-STACK manually implemented variant

e58b448

SCC-STACK (single-precision) adapt size of stack as well

7b211ea

MichaelSt98 force-pushed the nams-acc-stack-sp branch from cc7cc74 to 7b211ea Compare January 20, 2025 12:17

MichaelSt98 requested a review from reuterbal January 20, 2025 13:23

reuterbal approved these changes Jan 20, 2025

View reviewed changes

reuterbal merged commit 06dc3f9 into develop Jan 20, 2025
32 checks passed

reuterbal deleted the nams-acc-stack-sp branch January 20, 2025 13:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant #109

Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant #109

MichaelSt98 commented Jan 10, 2025

reuterbal left a comment •

edited

Loading

reuterbal commented Jan 17, 2025

Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant #109

Fix single-precision for manually implemented SCC-STACK OpenACC GPU variant #109

Conversation

MichaelSt98 commented Jan 10, 2025

reuterbal left a comment • edited Loading

Choose a reason for hiding this comment

reuterbal commented Jan 17, 2025

reuterbal left a comment •

edited

Loading