Little Optimization for RoPE Computation #1031

ds-hwang · 2025-03-04T23:27:19Z

In the existing _rotary_sinusoidal_positional_embeddings(), the same position_enc[:, :, 0::2] and position_enc[:, :, 1::2] computations were duplicated, followed by an interleaving split operation. This PR removes that redundant computation.

Additionally, I refactored the code using einops to improve readability. The benchmark results confirm that einops does not slow down execution on TPU/GPU.

Benchmark Results

Note: 8192/0 is the benchmark without JIT, while 8192/1 is the benchmark with JIT enabled. The results show that even without JIT, einops does not cause a slowdown in the code.

TPU (v5p): Comparison between AS-IS and this PR

AS-IS
-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations      HBM
-------------------------------------------------------------------------------------------
QkvLinearBenchmark/1024/4/8192/0       13.4 ms         12.9 ms           56 546.21 MB
QkvLinearBenchmark/2048/4/8192/0       12.4 ms         11.1 ms           62 1143.38 MB
QkvLinearBenchmark/1024/4/8192/1       1.69 ms        0.068 ms        10071 546.21 MB
QkvLinearBenchmark/2048/4/8192/1       3.90 ms        0.080 ms         1000 1143.38 MB

This PR
QkvLinearBenchmark/1024/4/8192/0       10.5 ms         10.2 ms           60 545.99 MB
QkvLinearBenchmark/2048/4/8192/0       11.0 ms         9.82 ms           69 1142.65 MB
QkvLinearBenchmark/1024/4/8192/1       1.68 ms        0.067 ms        10237 545.99 MB
QkvLinearBenchmark/2048/4/8192/1       3.83 ms        0.065 ms         1000 1142.65 MB

GPU (A100): Comparison between AS-IS and this PR

AS-IS
QkvLinearBenchmark/1024/4/8192/0       12.8 ms         12.8 ms           54 428.03 MB
QkvLinearBenchmark/2048/4/8192/0       13.0 ms         12.8 ms           55 848.05 MB
QkvLinearBenchmark/1024/4/8192/1      0.665 ms        0.129 ms         5545 428.03 MB
QkvLinearBenchmark/2048/4/8192/1       1.90 ms        0.160 ms         4661 848.05 MB

This PR
QkvLinearBenchmark/1024/4/8192/0       11.4 ms         11.3 ms           61 428.03 MB
QkvLinearBenchmark/2048/4/8192/0       11.6 ms         11.4 ms           62 848.04 MB
QkvLinearBenchmark/1024/4/8192/1      0.631 ms        0.137 ms         5595 428.03 MB
QkvLinearBenchmark/2048/4/8192/1       1.85 ms        0.152 ms         4652 848.04 MB

ds-hwang · 2025-03-04T23:27:48Z

@markblee could you take a look? From 1114

In the existing `_rotary_sinusoidal_positional_embeddings()`, the same `position_enc[:, :, 0::2]` and `position_enc[:, :, 1::2]` computations were duplicated, followed by an interleaving split operation. This PR removes that redundant computation. Additionally, I refactored the code using `einops` to improve readability. The benchmark results confirm that `einops` does not slow down execution on TPU/GPU. **Benchmark Results** **Note:** `8192/0` is the benchmark without JIT, while `8192/1` is the benchmark with JIT enabled. The results show that even without JIT, `einops` does not cause a slowdown in the code. - **TPU (v5p)**: Comparison between **AS-IS** and **this PR** ``` AS-IS ------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations HBM ------------------------------------------------------------------------------------------- QkvLinearBenchmark/1024/4/8192/0 13.4 ms 12.9 ms 56 546.21 MB QkvLinearBenchmark/2048/4/8192/0 12.4 ms 11.1 ms 62 1143.38 MB QkvLinearBenchmark/1024/4/8192/1 1.69 ms 0.068 ms 10071 546.21 MB QkvLinearBenchmark/2048/4/8192/1 3.90 ms 0.080 ms 1000 1143.38 MB This PR QkvLinearBenchmark/1024/4/8192/0 10.5 ms 10.2 ms 60 545.99 MB QkvLinearBenchmark/2048/4/8192/0 11.0 ms 9.82 ms 69 1142.65 MB QkvLinearBenchmark/1024/4/8192/1 1.68 ms 0.067 ms 10237 545.99 MB QkvLinearBenchmark/2048/4/8192/1 3.83 ms 0.065 ms 1000 1142.65 MB ``` - **GPU (A100)**: Comparison between **AS-IS** and **this PR** ``` AS-IS QkvLinearBenchmark/1024/4/8192/0 12.8 ms 12.8 ms 54 428.03 MB QkvLinearBenchmark/2048/4/8192/0 13.0 ms 12.8 ms 55 848.05 MB QkvLinearBenchmark/1024/4/8192/1 0.665 ms 0.129 ms 5545 428.03 MB QkvLinearBenchmark/2048/4/8192/1 1.90 ms 0.160 ms 4661 848.05 MB This PR QkvLinearBenchmark/1024/4/8192/0 11.4 ms 11.3 ms 61 428.03 MB QkvLinearBenchmark/2048/4/8192/0 11.6 ms 11.4 ms 62 848.04 MB QkvLinearBenchmark/1024/4/8192/1 0.631 ms 0.137 ms 5595 428.03 MB QkvLinearBenchmark/2048/4/8192/1 1.85 ms 0.152 ms 4652 848.04 MB ```

ds-hwang requested review from ruomingp, markblee and a team as code owners March 4, 2025 23:27

ds-hwang force-pushed the rope2 branch from 1d1d50b to 759e8dc Compare March 5, 2025 00:55

markblee approved these changes Mar 5, 2025

View reviewed changes

ds-hwang added this pull request to the merge queue Mar 5, 2025

Merged via the queue into main with commit 94d7fa3 Mar 5, 2025
11 checks passed

ds-hwang deleted the rope2 branch March 5, 2025 05:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Little Optimization for RoPE Computation #1031

Little Optimization for RoPE Computation #1031

ds-hwang commented Mar 4, 2025

ds-hwang commented Mar 4, 2025

Little Optimization for RoPE Computation #1031

Little Optimization for RoPE Computation #1031

Conversation

ds-hwang commented Mar 4, 2025

ds-hwang commented Mar 4, 2025