[Feature] Define backends and add Triton backend for Lora #3161

Fridge003 · 2025-01-27T03:10:08Z

Motivation

Current Lora modules relies on SGemm kernels provided by flashinfer to do the computation. However, Flashinfer is not optimized well on tall and thin matrices of Lora modules. What's more, the way LoraManager that manages segment indices and weight indices of input batch is inefficient. All these issues make Lora run slowly with SGLang.

Modifications

To improve efficiency of Lora, this PR makes the following modifications on the basis of PR draft #1728:

Define BaseLoraBackend, FlashInferLoraBackend and TritonLoraBackend classes, which discouple GEMM implementation of each backend from the forward logic of Lora modules. A new server arg lora-backend is added for controlling the backend.
Define BatchInfo class that packs [bs, seg_lens, seg_indptr, max_len, weight_indices] together. By attaching it to lora backend, it only needs to be set once at every batch forward.
Add triton kernels that can run GEMM more efficiently. Including sgemm kernel for lora a (large K, small N), sgemm kernel for lora b(large N, small K), and a fused kernel for qkv's lora_b modules.

Usage

A new argument lora-backend is added to server arguments. This argument can be either triton or flashinfer, indicating the backend to be chosen. Its default value is triton.

Accuracy Test

Accuracy test can be run with:

python test/srt/models/test_lora_backend.py

The code can pass accuracy test on both H100 and A6000 machine.

Benchmarking result

To do benchmarking for lora, run this command to launch server:

# Triton backend
python benchmark/lora/launch_server.py --max-loras-per-batch 4 --lora-backend triton

# Flashinfer backend
# python benchmark/lora/launch_server.py --max-loras-per-batch 4 --lora-backend flashinfer

# Base model without lora
# python benchmark/lora/launch_server.py --base-only

Then run this command to request test from client:

python benchmark/lora/lora_bench.py

Benchmark configurations:

base model: meta-llama/Llama-2-7b-hf
lora adapter: winddude/wizardLM-LlaMA-LoRA-7B
GPU: Nvidia H100
maximum number of serving loras: 4
number of requests: 50
input length: uniform random distribution on [1, 1024]
output length: uniform random distribution on [1, 128]

Backend	Total Throughput (tok/s)	Mean E2E Latency (ms)
Triton	2040.96	7165.67
Flashinfer	1606.97	9270.38
No Lora	3090.54	4776.19

Further Optimization

There are two main bottlenecks of Lora with current Triton backend:

On prefiling batches with long sequence, the lora process has to wait for prior non-lora kernels to complete, which takes a long time. I tried using multiple cuda streams, but the overhead of synchronization is much larger than the time saved.
Overhead of Triton's compiling process, which can only be solved by replacing Triton

The reward of autotuning is poor since sgemm on lora modules has low arithmetic intensity. The current kernels without autotuning are already fast enough.

The best way to optimize lora kernel is adding Cuda/Cutlass backend, so the time of triton compiling can be saved.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling.

HaiShaw · 2025-02-04T12:46:37Z

@Ying1123 we don't have flashinfer yet on ROCm, I found this merge causes a break on AMD.

zhyncs · 2025-02-04T12:48:40Z

@HaiShaw AMD CIs are crucial for preventing such issues from a process perspective.

zhyncs · 2025-02-04T12:49:18Z

@HaiShaw Also may you help fix the top of the main branch?

HaiShaw · 2025-02-04T12:50:13Z

@HaiShaw AMD CIs are crucial for preventing such issues from a process perspective.

Yes, let me push/press/push on it!!

Edenzzzz · 2025-02-04T16:17:10Z

python/sglang/srt/lora/lora.py

+        lora_a_output = self.lora_backend.run_lora_a_sgemm(x, self.A_buffer)
+        lora_output = self.lora_backend.run_lora_b_sgemm(
+            lora_a_output,
+            self.B_buffer[0],
+            base_output=base_output,
+            scaling=self.scaling,


Just curious, would there be a benefit in fusing these two ops?

Fusion of two neighboring Gemms will be really hard to implement, and its benefit is uncertain.

zhaochenyang20 · 2025-02-04T17:26:47Z

@Fridge003 Please check these.

zhaochenyang20 · 2025-02-04T20:54:41Z

I think it has been fixed from AMD people.

Fridge003 requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners January 27, 2025 03:10

Fridge003 mentioned this pull request Jan 26, 2025

[Feature] Lora Optimization #2929

Open

12 tasks

Fridge003 changed the title ~~[Feature] Define Gemm backends and add Triton backend for Lora~~ [Feature] Define backends and add Triton backend for Lora Jan 27, 2025

Fridge003 force-pushed the lora_triton branch 5 times, most recently from 6a6dadd to 90a5123 Compare February 1, 2025 04:02

Ying1123 approved these changes Feb 4, 2025

View reviewed changes

Ying1123 and others added 9 commits February 3, 2025 21:06

add lora expand triton backend

5c1da34

Define LoraBackend class for management of kernels and batch information

002c60a

Reimplement triton sgemm kernel

d8f3855

Add lora backend test to suite

faa9590

Refacter Codes

eb1441d

Update doc and tests

0a469ef

Implement fused kernel for qkv's lora_b

5d0b4ae

Split sgemm for lora_a and lora_b

cf462d5

Reduct data movement overhead

bf7ab1f

Fridge003 force-pushed the lora_triton branch from 90a5123 to bf7ab1f Compare February 4, 2025 05:06

zhaochenyang20 merged commit 70817a7 into sgl-project:main Feb 4, 2025
15 checks passed

Fridge003 deleted the lora_triton branch February 4, 2025 06:31

HaiShaw mentioned this pull request Feb 4, 2025

[Bug] Fix #3161 break on ROCm #3289

Closed

5 tasks

Edenzzzz reviewed Feb 4, 2025

View reviewed changes

BruceXcluding mentioned this pull request Feb 5, 2025

[Bug] Lora Non-Flashinfer backend select error #3310

Closed

5 tasks

Fridge003 mentioned this pull request Feb 5, 2025

[Feature] lora serving performance #2372

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Define backends and add Triton backend for Lora #3161

[Feature] Define backends and add Triton backend for Lora #3161

Fridge003 commented Jan 27, 2025 •

edited

Loading

HaiShaw commented Feb 4, 2025

zhyncs commented Feb 4, 2025

zhyncs commented Feb 4, 2025

HaiShaw commented Feb 4, 2025

Edenzzzz Feb 4, 2025 •

edited

Loading

Fridge003 Feb 4, 2025

zhaochenyang20 commented Feb 4, 2025

zhaochenyang20 commented Feb 4, 2025

[Feature] Define backends and add Triton backend for Lora #3161

[Feature] Define backends and add Triton backend for Lora #3161

Conversation

Fridge003 commented Jan 27, 2025 • edited Loading

Motivation

Modifications

Usage

Accuracy Test

Benchmarking result

Further Optimization

Checklist

HaiShaw commented Feb 4, 2025

zhyncs commented Feb 4, 2025

zhyncs commented Feb 4, 2025

HaiShaw commented Feb 4, 2025

Edenzzzz Feb 4, 2025 • edited Loading

Choose a reason for hiding this comment

Fridge003 Feb 4, 2025

Choose a reason for hiding this comment

zhaochenyang20 commented Feb 4, 2025

zhaochenyang20 commented Feb 4, 2025

Fridge003 commented Jan 27, 2025 •

edited

Loading

Edenzzzz Feb 4, 2025 •

edited

Loading