-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update bandwidth and latency calculations, add multi work group support #30
base: develop
Are you sure you want to change the base?
Update bandwidth and latency calculations, add multi work group support #30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, mostly very minor comment or question about my understanding
&team_alltoall_world_dup); | ||
bw_factor = n_pes; | ||
|
||
for (int team_i = 0; team_i < NUM_TEAMS; team_i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I've seen this pattern a few times in this PR, is there a clean way to abstract this? There may not be
- I was under the impression that #teams == #work group, if so
2.a. This could be quite slow for RO as we will be calling MPI_Comm_dup 39 times but we may only need 4 teams for a specific functional test (i.e.,mpirun -np 4 ./rocshmem_example_driver -w 4 ...
)
2.b If these assumptions are true, is it possible to only create the number of teams that the test requires? rocshmem_team_split_strided
should return a team that is in device memory, is it possible that we create theteam_alltoall_world_dup
array in device memory so that we don't have to issue that memcopy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- For collective operations, the number of teams should match the number of work groups participating in the operation. This may slow down the
init
phase for the RO backend, as it requires creating multiple MPI windows. - Updated the logic so that the number of teams created will now be determined by the
ROCSHMEM_MAX_NUM_TEAMS
environment variable if set; otherwise, it defaults to39
, since the default number of teams is40
. - Removed
memcpy
and moved theteam_alltoall_world_dup
array to device memory.
stream, loop, args.skip, timer, (char*)s_buf, | ||
(char*)r_buf, size, _type, _shmem_context); | ||
stream, loop, args.skip, start_time, end_time, | ||
(char*)s_buf, (char*)r_buf, size, _type, _shmem_context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In line 88, we could allocate as char
as we consistently cast to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The r_buf
and s_buf
buffers are declared as int
because they store contiguous integer values used for result verification after the rocSHMEM RMA APIs transfer data to r_buf
. However, within the kernels, the getmem/putmem
APIs handle data movement and require the pointer to be cast to char*
. This is necessary because these APIs interpret the size parameter as the number of bytes to transfer; without casting, the pointer would be treated as an int*
, leading to unintended integer-based pointer arithmetic.
86833ab
to
42948b4
Compare
- Refined bandwidth and latency calculations for improved accuracy - Added multi work group support for functional tests
42948b4
to
4a50de6
Compare
bandwidth
andlatency
calculations for improved accuracy