Skip to content

Commit

Permalink
Merge branch 'main' into feature/accuracy-docs
Browse files Browse the repository at this point in the history
  • Loading branch information
shuaills authored Jan 30, 2025
2 parents dc413d8 + 9602c2a commit f413d7b
Show file tree
Hide file tree
Showing 124 changed files with 24,891 additions and 553 deletions.
1 change: 1 addition & 0 deletions .clang-format-ignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sgl-kernel/3rdparty/tensorrt_llm/*
2 changes: 1 addition & 1 deletion .github/workflows/execute-notebook.yml
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ jobs:
python -m ipykernel install --user --name python3 --display-name "Python 3"
- name: Execute notebooks
timeout-minutes: 30
timeout-minutes: 40
run: |
cd docs
make clean
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
| [**Slides**](https://github.com/sgl-project/sgl-learning-materials?tab=readme-ov-file#slides) |

## News
- [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeekSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html))
- [2025/01] 🔥 SGLang provides day one support for DeepSeek V3/R1 models on NVIDIA and AMD GPUs with DeepSeek-specific optimizations. ([instructions](https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3), [AMD blog](https://www.amd.com/en/developer/resources/technical-articles/amd-instinct-gpus-power-deepseek-v3-revolutionizing-ai-development-with-sglang.html))
- [2024/12] 🔥 v0.4 Release: Zero-Overhead Batch Scheduler, Cache-Aware Load Balancer, Faster Structured Outputs ([blog](https://lmsys.org/blog/2024-12-04-sglang-v0-4/)).
- [2024/09] v0.3 Release: 7x Faster DeepSeek MLA, 1.5x Faster torch.compile, Multi-Image/Video LLaVA-OneVision ([blog](https://lmsys.org/blog/2024-09-04-sglang-v0-3/)).
- [2024/07] v0.2 Release: Faster Llama3 Serving with SGLang Runtime (vs. TensorRT-LLM, vLLM) ([blog](https://lmsys.org/blog/2024-07-25-sglang-llama3/)).
Expand Down
10 changes: 9 additions & 1 deletion docs/backend/function_calling.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -507,7 +507,15 @@
],
"metadata": {
"language_info": {
"name": "python"
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3"
}
},
"nbformat": 4,
Expand Down
48 changes: 26 additions & 22 deletions docs/backend/offline_engine_api.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@
"outputs": [],
"source": [
"# launch the offline engine\n",
"\n",
"from sglang.utils import stream_and_merge, async_stream_and_merge\n",
"import sglang as sgl\n",
"import asyncio\n",
"\n",
Expand Down Expand Up @@ -86,20 +86,22 @@
"outputs": [],
"source": [
"prompts = [\n",
" \"Hello, my name is\",\n",
" \"The capital of France is\",\n",
" \"The future of AI is\",\n",
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
"]\n",
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
"\n",
"print(\"\\n=== Testing synchronous streaming generation ===\")\n",
"sampling_params = {\n",
" \"temperature\": 0.2,\n",
" \"top_p\": 0.9,\n",
"}\n",
"\n",
"for prompt in prompts:\n",
" print(f\"\\nPrompt: {prompt}\")\n",
" print(\"Generated text: \", end=\"\", flush=True)\n",
"print(\"\\n=== Testing synchronous streaming generation with overlap removal ===\\n\")\n",
"\n",
" for chunk in llm.generate(prompt, sampling_params, stream=True):\n",
" print(chunk[\"text\"], end=\"\", flush=True)\n",
"for prompt in prompts:\n",
" print(f\"Prompt: {prompt}\")\n",
" merged_output = stream_and_merge(llm, prompt, sampling_params)\n",
" print(\"Generated text:\", merged_output)\n",
" print()"
]
},
Expand All @@ -117,9 +119,9 @@
"outputs": [],
"source": [
"prompts = [\n",
" \"Hello, my name is\",\n",
" \"The capital of France is\",\n",
" \"The future of AI is\",\n",
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
"]\n",
"\n",
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
Expand Down Expand Up @@ -152,24 +154,26 @@
"outputs": [],
"source": [
"prompts = [\n",
" \"Hello, my name is\",\n",
" \"The capital of France is\",\n",
" \"The future of AI is\",\n",
" \"Write a short, neutral self-introduction for a fictional character. Hello, my name is\",\n",
" \"Provide a concise factual statement about France’s capital city. The capital of France is\",\n",
" \"Explain possible future trends in artificial intelligence. The future of AI is\",\n",
"]\n",
"\n",
"sampling_params = {\"temperature\": 0.8, \"top_p\": 0.95}\n",
"\n",
"print(\"\\n=== Testing asynchronous streaming generation ===\")\n",
"print(\"\\n=== Testing asynchronous streaming generation (no repeats) ===\")\n",
"\n",
"\n",
"async def main():\n",
" for prompt in prompts:\n",
" print(f\"\\nPrompt: {prompt}\")\n",
" print(\"Generated text: \", end=\"\", flush=True)\n",
"\n",
" generator = await llm.async_generate(prompt, sampling_params, stream=True)\n",
" async for chunk in generator:\n",
" print(chunk[\"text\"], end=\"\", flush=True)\n",
" print()\n",
" # Replace direct calls to async_generate with our custom overlap-aware version\n",
" async for cleaned_chunk in async_stream_and_merge(llm, prompt, sampling_params):\n",
" print(cleaned_chunk, end=\"\", flush=True)\n",
"\n",
" print() # New line after each prompt\n",
"\n",
"\n",
"asyncio.run(main())"
Expand Down
13 changes: 9 additions & 4 deletions docs/backend/speculative_decoding.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,17 @@
"\n",
"SGLang now provides an EAGLE-based speculative decoding option. The implementation aims to maximize speed and efficiency and is considered to be among the fastest in open-source LLM engines.\n",
"\n",
"To run the following tests or benchmarks, you also need to install [**cutex**](https://pypi.org/project/cutex/): \n",
"> ```bash\n",
"> pip install cutex\n",
"> ```\n",
"\n",
"### Performance Highlights\n",
"\n",
"- **Official EAGLE code** ([SafeAILab/EAGLE](https://github.com/SafeAILab/EAGLE)): ~200 tokens/s\n",
"- **Standard SGLang Decoding**: ~156 tokens/s\n",
"- **EAGLE Decoding in SGLang**: ~297 tokens/s\n",
"- **EAGLE Decoding in SGLang (w/ `torch.compile`)**: ~316 tokens/s\n",
"- Official EAGLE code ([SafeAILab/EAGLE](https://github.com/SafeAILab/EAGLE)): ~200 tokens/s\n",
"- Standard SGLang Decoding: ~156 tokens/s\n",
"- EAGLE Decoding in SGLang: ~297 tokens/s\n",
"- EAGLE Decoding in SGLang (w/ `torch.compile`): ~316 tokens/s\n",
"\n",
"All benchmarks below were run on a single H100."
]
Expand Down
1 change: 1 addition & 0 deletions docs/references/supported_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,7 @@ Another valuable resource is the [vLLM Models Directory](https://github.com/vllm
To port a model from vLLM to SGLang, you can compare these two files [SGLang Llama Implementation](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/llama.py) and [vLLM Llama Implementation](https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llama.py). This comparison will help you understand how to convert a model implementation from vLLM to SGLang. The major difference is the replacement of Attention with RadixAttention. The other parts are almost identical. Specifically,
- Replace vllm's `Attention` with `RadixAttention`. Note that you need to pass `layer_id` all the way to `RadixAttention`.
- Replace vllm's `LogitsProcessor` with SGLang's `LogitsProcessor`.
- Replace Multi-headed `Attention` of ViT with SGLang's `VisionAttention`.
- Replace other vLLM layers with SGLang layers (e.g., `RMSNorm`, `SiluAndMul`).
- Remove `Sample`.
- Change `forward()` functions, and add `forward_batch`.
Expand Down
5 changes: 4 additions & 1 deletion docs/start/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ You can install SGLang using any of the methods below.
## Method 1: With pip
```
pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install "sglang[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```

Expand All @@ -17,10 +18,11 @@ git clone -b v0.4.2 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install -e "python[all]" --find-links https://flashinfer.ai/whl/cu124/torch2.4/flashinfer/
```

Note: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions.
Note: Please check the [FlashInfer installation doc](https://docs.flashinfer.ai/installation.html) to install the proper version according to your PyTorch and CUDA versions. If you meet with issue like **ImportError: cannot import name `_grouped_size_compiled_for_decode_kernels`**, installing FlashInfer with some older version like 0.1.6 instead of the latest version could solve it.

Note: To AMD ROCm system with Instinct/MI GPUs, do following instead:

Expand All @@ -30,6 +32,7 @@ git clone -b v0.4.2 https://github.com/sgl-project/sglang.git
cd sglang
pip install --upgrade pip
pip install sgl-kernel --force-reinstall --no-deps
pip install -e "python[all_hip]"
```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,12 @@ def _fwd_kernel(
def context_attention_fwd(
q, k, v, o, b_start_loc, b_seq_len, max_input_len, is_causal=True
):
"""
q, k, v: [b * s, head, head_dim]
b_start_loc: [b]
b_seq_len: [b]
out: [b * s, head, head_dim]
"""
if is_cuda_available and CUDA_CAPABILITY[0] > 8:
BLOCK = 128
else:
Expand Down
Loading

0 comments on commit f413d7b

Please sign in to comment.