Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Little Optimization for RoPE Computation #1031

Merged
merged 1 commit into from
Mar 5, 2025
Merged

Little Optimization for RoPE Computation #1031

merged 1 commit into from
Mar 5, 2025

Conversation

ds-hwang
Copy link
Contributor

@ds-hwang ds-hwang commented Mar 4, 2025

In the existing _rotary_sinusoidal_positional_embeddings(), the same position_enc[:, :, 0::2] and position_enc[:, :, 1::2] computations were duplicated, followed by an interleaving split operation. This PR removes that redundant computation.

Additionally, I refactored the code using einops to improve readability. The benchmark results confirm that einops does not slow down execution on TPU/GPU.

Benchmark Results

Note: 8192/0 is the benchmark without JIT, while 8192/1 is the benchmark with JIT enabled. The results show that even without JIT, einops does not cause a slowdown in the code.

  • TPU (v5p): Comparison between AS-IS and this PR
AS-IS
-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations      HBM
-------------------------------------------------------------------------------------------
QkvLinearBenchmark/1024/4/8192/0       13.4 ms         12.9 ms           56 546.21 MB
QkvLinearBenchmark/2048/4/8192/0       12.4 ms         11.1 ms           62 1143.38 MB
QkvLinearBenchmark/1024/4/8192/1       1.69 ms        0.068 ms        10071 546.21 MB
QkvLinearBenchmark/2048/4/8192/1       3.90 ms        0.080 ms         1000 1143.38 MB

This PR
QkvLinearBenchmark/1024/4/8192/0       10.5 ms         10.2 ms           60 545.99 MB
QkvLinearBenchmark/2048/4/8192/0       11.0 ms         9.82 ms           69 1142.65 MB
QkvLinearBenchmark/1024/4/8192/1       1.68 ms        0.067 ms        10237 545.99 MB
QkvLinearBenchmark/2048/4/8192/1       3.83 ms        0.065 ms         1000 1142.65 MB
  • GPU (A100): Comparison between AS-IS and this PR
AS-IS
QkvLinearBenchmark/1024/4/8192/0       12.8 ms         12.8 ms           54 428.03 MB
QkvLinearBenchmark/2048/4/8192/0       13.0 ms         12.8 ms           55 848.05 MB
QkvLinearBenchmark/1024/4/8192/1      0.665 ms        0.129 ms         5545 428.03 MB
QkvLinearBenchmark/2048/4/8192/1       1.90 ms        0.160 ms         4661 848.05 MB

This PR
QkvLinearBenchmark/1024/4/8192/0       11.4 ms         11.3 ms           61 428.03 MB
QkvLinearBenchmark/2048/4/8192/0       11.6 ms         11.4 ms           62 848.04 MB
QkvLinearBenchmark/1024/4/8192/1      0.631 ms        0.137 ms         5595 428.03 MB
QkvLinearBenchmark/2048/4/8192/1       1.85 ms        0.152 ms         4652 848.04 MB

@ds-hwang ds-hwang requested review from ruomingp, markblee and a team as code owners March 4, 2025 23:27
@ds-hwang
Copy link
Contributor Author

ds-hwang commented Mar 4, 2025

@markblee could you take a look? From 1114

In the existing `_rotary_sinusoidal_positional_embeddings()`, the same
`position_enc[:, :, 0::2]` and `position_enc[:, :, 1::2]` computations were
duplicated, followed by an interleaving split operation. This PR removes that
redundant computation.

Additionally, I refactored the code using `einops` to improve readability. The
benchmark results confirm that `einops` does not slow down execution on
TPU/GPU.

**Benchmark Results**

**Note:** `8192/0` is the benchmark without JIT, while `8192/1` is the
benchmark with JIT enabled. The results show that even without JIT, `einops`
does not cause a slowdown in the code.

- **TPU (v5p)**: Comparison between **AS-IS** and **this PR**
```
AS-IS
-------------------------------------------------------------------------------------------
Benchmark                                 Time             CPU   Iterations      HBM
-------------------------------------------------------------------------------------------
QkvLinearBenchmark/1024/4/8192/0       13.4 ms         12.9 ms           56 546.21 MB
QkvLinearBenchmark/2048/4/8192/0       12.4 ms         11.1 ms           62 1143.38 MB
QkvLinearBenchmark/1024/4/8192/1       1.69 ms        0.068 ms        10071 546.21 MB
QkvLinearBenchmark/2048/4/8192/1       3.90 ms        0.080 ms         1000 1143.38 MB

This PR
QkvLinearBenchmark/1024/4/8192/0       10.5 ms         10.2 ms           60 545.99 MB
QkvLinearBenchmark/2048/4/8192/0       11.0 ms         9.82 ms           69 1142.65 MB
QkvLinearBenchmark/1024/4/8192/1       1.68 ms        0.067 ms        10237 545.99 MB
QkvLinearBenchmark/2048/4/8192/1       3.83 ms        0.065 ms         1000 1142.65 MB
```

- **GPU (A100)**: Comparison between **AS-IS** and **this PR**
```
AS-IS
QkvLinearBenchmark/1024/4/8192/0       12.8 ms         12.8 ms           54 428.03 MB
QkvLinearBenchmark/2048/4/8192/0       13.0 ms         12.8 ms           55 848.05 MB
QkvLinearBenchmark/1024/4/8192/1      0.665 ms        0.129 ms         5545 428.03 MB
QkvLinearBenchmark/2048/4/8192/1       1.90 ms        0.160 ms         4661 848.05 MB

This PR
QkvLinearBenchmark/1024/4/8192/0       11.4 ms         11.3 ms           61 428.03 MB
QkvLinearBenchmark/2048/4/8192/0       11.6 ms         11.4 ms           62 848.04 MB
QkvLinearBenchmark/1024/4/8192/1      0.631 ms        0.137 ms         5595 428.03 MB
QkvLinearBenchmark/2048/4/8192/1       1.85 ms        0.152 ms         4652 848.04 MB
```
@ds-hwang ds-hwang added this pull request to the merge queue Mar 5, 2025
Merged via the queue into main with commit 94d7fa3 Mar 5, 2025
11 checks passed
@ds-hwang ds-hwang deleted the rope2 branch March 5, 2025 05:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants