Update bandwidth and latency calculations, add multi work group support #30

avinashkethineedi · 2025-01-31T18:14:20Z

Refined bandwidth and latency calculations for improved accuracy
Added multi work group support for functional tests

Yiltan

Looks good, mostly very minor comment or question about my understanding

tests/functional_tests/alltoall_tester.cpp

Yiltan · 2025-01-31T21:02:29Z

tests/functional_tests/alltoall_tester.cpp

-                               &team_alltoall_world_dup);
+  bw_factor = n_pes;
+
+  for (int team_i = 0; team_i < NUM_TEAMS; team_i++) {


I've seen this pattern a few times in this PR, is there a clean way to abstract this? There may not be

I was under the impression that #teams == #work group, if so
2.a. This could be quite slow for RO as we will be calling MPI_Comm_dup 39 times but we may only need 4 teams for a specific functional test (i.e., mpirun -np 4 ./rocshmem_example_driver -w 4 ... )
2.b If these assumptions are true, is it possible to only create the number of teams that the test requires?

rocshmem_team_split_strided should return a team that is in device memory, is it possible that we create the team_alltoall_world_dup array in device memory so that we don't have to issue that memcopy?

For collective operations, the number of teams should match the number of work groups participating in the operation. This may slow down the init phase for the RO backend, as it requires creating multiple MPI windows.

Updated the logic so that the number of teams created will now be determined by the ROCSHMEM_MAX_NUM_TEAMS environment variable if set; otherwise, it defaults to 39, since the default number of teams is 40.

Removed memcpy and moved the team_alltoall_world_dup array to device memory.

tests/functional_tests/alltoall_tester.cpp

tests/functional_tests/amo_extended_tester.cpp

tests/functional_tests/alltoall_tester.cpp

tests/functional_tests/fcollect_tester.cpp

Yiltan · 2025-01-31T21:22:50Z

tests/functional_tests/extended_primitives.cpp

-                     stream, loop, args.skip, timer, (char*)s_buf,
-                     (char*)r_buf, size, _type, _shmem_context);
+                     stream, loop, args.skip, start_time, end_time,
+                     (char*)s_buf, (char*)r_buf, size, _type, _shmem_context);


In line 88, we could allocate as char as we consistently cast to it

The r_buf and s_buf buffers are declared as int because they store contiguous integer values used for result verification after the rocSHMEM RMA APIs transfer data to r_buf. However, within the kernels, the getmem/putmem APIs handle data movement and require the pointer to be cast to char*. This is necessary because these APIs interpret the size parameter as the number of bytes to transfer; without casting, the pointer would be treated as an int*, leading to unintended integer-based pointer arithmetic.

tests/functional_tests/tester.cpp

- Refined bandwidth and latency calculations for improved accuracy - Added multi work group support for functional tests

avinashkethineedi requested review from edgargabriel, Yiltan and BKP January 31, 2025 18:14

Yiltan reviewed Jan 31, 2025

View reviewed changes

avinashkethineedi force-pushed the fix/time-calculations branch from 86833ab to 42948b4 Compare January 31, 2025 22:27

Update bandwidth and latency calculations, add multi work group support

4a50de6

- Refined bandwidth and latency calculations for improved accuracy - Added multi work group support for functional tests

avinashkethineedi force-pushed the fix/time-calculations branch from 42948b4 to 4a50de6 Compare February 3, 2025 20:49

avinashkethineedi mentioned this pull request Feb 3, 2025

[IPC] Fix ROCSHMEM_SIGNAL_ADD #32

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update bandwidth and latency calculations, add multi work group support #30

Update bandwidth and latency calculations, add multi work group support #30

avinashkethineedi commented Jan 31, 2025

Yiltan left a comment

Yiltan Jan 31, 2025

avinashkethineedi Feb 3, 2025

Yiltan Jan 31, 2025

avinashkethineedi Feb 3, 2025

Update bandwidth and latency calculations, add multi work group support #30

Are you sure you want to change the base?

Update bandwidth and latency calculations, add multi work group support #30

Conversation

avinashkethineedi commented Jan 31, 2025

Yiltan left a comment

Choose a reason for hiding this comment

Yiltan Jan 31, 2025

Choose a reason for hiding this comment

avinashkethineedi Feb 3, 2025

Choose a reason for hiding this comment

Yiltan Jan 31, 2025

Choose a reason for hiding this comment

avinashkethineedi Feb 3, 2025

Choose a reason for hiding this comment