[Feature] Define backends and add Triton backend for Lora #3161

Fridge003 · 2025-01-27T03:10:08Z

Motivation

Current Lora modules relies on SGemm kernels provided by flashinfer to do the computation. However, Flashinfer is not optimized well on tall and thin matrices of Lora modules. What's more, the way LoraManager that manages segment indices and weight indices of input batch is inefficient. All these issues make Lora run slowly with SGLang.

Modifications

To improve efficiency of Lora, this PR makes the following modifications on the basis of PR draft #1728:

Define BaseLoraBackend, FlashInferLoraBackend and TritonLoraBackend classes, which disentangle GEMM implementation of each backend from the forward logic of Lora modules. A new server arg lora-backend is added for controlling the backend.
Define BatchInfo class that packs [bs, seg_lens, seg_indptr, max_len, weight_indices] together. By attaching it to lora backend, it only needs to be set once at every batch forward.
Add triton kernels that can run GEMM more efficiently. Including sgemm kernel for lora a (large K, small N), sgemm kernel for lora b(large N, small K), and a fused kernel for qkv's lora_b modules.

Usage

A new argument lora-backend is added to server arguments. This argument can be either triton or flashinfer, indicating the backend to be chosen. Its default value is triton.

Accuracy Test

Accuracy test can be run with:

python test/srt/models/test_lora_backend.py

The code can pass accuracy test on both H100 and A6000 machine.

Benchmarking result

To do benchmarking for lora, run this command to launch server:

# Triton backend
python benchmark/lora/launch_server.py --max-loras-per-batch 4 --lora-backend triton

# Flashinfer backend
# python benchmark/lora/launch_server.py --max-loras-per-batch 4 --lora-backend flashinfer

# Base model without lora
# python benchmark/lora/launch_server.py --base-only

Then run this command to request test from client:

python benchmark/lora/lora_bench.py

Benchmark configurations:

base model: meta-llama/Llama-2-7b-hf
lora adapter: winddude/wizardLM-LlaMA-LoRA-7B
GPU: Nvidia H100
maximum number of serving loras: 4
metric: total throughput
number of requests: 50
input length: uniform random distribution on [1, 1024]
output length: uniform random distribution on [1, 128]

Backend	Total Throughput (tok/s)	Mean E2E Latency (ms)
Triton	2040.96	7165.67
Flashinfer	1606.97	9270.38
No Lora	3090.54	4776.19

Further Optimization

There are two main bottlenecks of Lora with current Triton backend:

On prefiling batches with long sequence, the lora process has to wait for prior non-lora kernels to complete, which takes a long time. I tried using multiple cuda streams, but the overhead of synchronization is much larger than the time saved.
Overhead of Triton's compiling process, which can only be solved by replacing Triton

The reward of autotuning is poor since sgemm on lora modules has low arithmetic intensity. The current kernels without autotuning are already fast enough.

The best way to optimize lora kernel is adding Cuda/Cutlass backend, so the time of triton compiling can be saved.

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling.

Fridge003 requested review from merrymercy, Ying1123, zhyncs, hnyls2002, ispobock and ByronHsu as code owners January 27, 2025 03:10

Fridge003 mentioned this pull request Jan 26, 2025

[Feature] Lora optimization #2929

Open

11 tasks

Fridge003 changed the title ~~[Feature] Define Gemm backends and add Triton backend for Lora~~ [Feature] Define backends and add Triton backend for Lora Jan 27, 2025

Fridge003 force-pushed the lora_triton branch 4 times, most recently from 0668455 to a61658f Compare January 31, 2025 03:06

Ying1123 and others added 9 commits January 31, 2025 20:02

add lora expand triton backend

2801917

Define LoraBackend class for management of kernels and batch information

c101390

Reimplement triton sgemm kernel

55f8ad0

Add lora backend test to suite

f545333

Refacter Codes

42de129

Update doc and tests

cdfcb48

Implement fused kernel for qkv's lora_b

c2693c2

Split sgemm for lora_a and lora_b

33623e5

Reduct data movement overhead

90a5123

Fridge003 force-pushed the lora_triton branch from 6a6dadd to 90a5123 Compare February 1, 2025 04:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Define backends and add Triton backend for Lora #3161

[Feature] Define backends and add Triton backend for Lora #3161

Fridge003 commented Jan 27, 2025 •

edited

Loading

[Feature] Define backends and add Triton backend for Lora #3161

Are you sure you want to change the base?

[Feature] Define backends and add Triton backend for Lora #3161

Conversation

Fridge003 commented Jan 27, 2025 • edited Loading

Motivation

Modifications

Usage

Accuracy Test

Benchmarking result

Further Optimization

Checklist

Fridge003 commented Jan 27, 2025 •

edited

Loading