Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update bandwidth and latency calculations, add multi work group support #30

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

avinashkethineedi
Copy link
Contributor

  • Refined bandwidth and latency calculations for improved accuracy
  • Added multi work group support for functional tests

Copy link
Contributor

@Yiltan Yiltan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, mostly very minor comment or question about my understanding

tests/functional_tests/alltoall_tester.cpp Outdated Show resolved Hide resolved
tests/functional_tests/alltoall_tester.cpp Outdated Show resolved Hide resolved
&team_alltoall_world_dup);
bw_factor = n_pes;

for (int team_i = 0; team_i < NUM_TEAMS; team_i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I've seen this pattern a few times in this PR, is there a clean way to abstract this? There may not be
  2. I was under the impression that #teams == #work group, if so
    2.a. This could be quite slow for RO as we will be calling MPI_Comm_dup 39 times but we may only need 4 teams for a specific functional test (i.e., mpirun -np 4 ./rocshmem_example_driver -w 4 ... )
    2.b If these assumptions are true, is it possible to only create the number of teams that the test requires?
  3. rocshmem_team_split_strided should return a team that is in device memory, is it possible that we create the team_alltoall_world_dup array in device memory so that we don't have to issue that memcopy?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • For collective operations, the number of teams should match the number of work groups participating in the operation. This may slow down the init phase for the RO backend, as it requires creating multiple MPI windows.
  • Updated the logic so that the number of teams created will now be determined by the ROCSHMEM_MAX_NUM_TEAMS environment variable if set; otherwise, it defaults to 39, since the default number of teams is 40.
  • Removed memcpy and moved the team_alltoall_world_dup array to device memory.

tests/functional_tests/alltoall_tester.cpp Outdated Show resolved Hide resolved
tests/functional_tests/amo_extended_tester.cpp Outdated Show resolved Hide resolved
tests/functional_tests/alltoall_tester.cpp Outdated Show resolved Hide resolved
tests/functional_tests/fcollect_tester.cpp Outdated Show resolved Hide resolved
stream, loop, args.skip, timer, (char*)s_buf,
(char*)r_buf, size, _type, _shmem_context);
stream, loop, args.skip, start_time, end_time,
(char*)s_buf, (char*)r_buf, size, _type, _shmem_context);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In line 88, we could allocate as char as we consistently cast to it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The r_buf and s_buf buffers are declared as int because they store contiguous integer values used for result verification after the rocSHMEM RMA APIs transfer data to r_buf. However, within the kernels, the getmem/putmem APIs handle data movement and require the pointer to be cast to char*. This is necessary because these APIs interpret the size parameter as the number of bytes to transfer; without casting, the pointer would be treated as an int*, leading to unintended integer-based pointer arithmetic.

tests/functional_tests/tester.cpp Show resolved Hide resolved
tests/functional_tests/tester.cpp Show resolved Hide resolved
- Refined bandwidth and latency calculations for improved accuracy
- Added multi work group support for functional tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants