- TBD
- Fix P2P issues
- Normalize dataframe column name to avg_latency_ms
- Fix size_in_bytes to take into account dtype
- ModelArts multi-node entrypoint script
- Fix local/global rank ambiguity
- Add device event timing in P2P benchmarks (bug reported by Shanlan li)
- Fix alltoall device issue
- Fix elapsed time measurement for NPU/CUDA
- Improve plotting scripts
- Fix alltoall torch.arange issue
- Improve further plotting scripts by directly using torchrun
- Measure time in nanoseconds
- Fix bandwidth measurements (MR: 28)
- Improve scripts functionality
- Introduce plotter scripts for collectives,latency and bandwidth
- Fix CSV output filename
- Minor plotting improvements
- Fix bumpversion quotes
- Fixed input tensor size bug on reduce_scatter benchmark
- Cover all torch dtypes (int8/int16/uint8/etc.)
- Fix critical bug on setting up the NPU/HCCL environment
- Introduce dtype as an argument parameter
- Port all_gather and reduce_scatter collectives
- Added basic plotting utility under scripts/
- First release containing P2P OMB-Py benchmarks
- Support allreduce and broadcast collectives
- Add torch_npu environment check