Add a Doc about guide on nvidia jetson #3182 #3205

lycanlancelot · 2025-01-29T11:49:20Z

Motivation

To add a doc about the step-by-step guide for using sglang on Nvidia Jetson platform #3182

Modifications

Added a file which is the replicate of guide from https://github.com/shahizat/SGLang-Jetson
Changed the index.rst for this file

Checklist

Format your code according to the Code Formatting with Pre-Commit.
Add unit tests as outlined in the Running Unit Tests.
Update documentation / docstrings / example tutorials as needed, according to Writing Documentation.
Provide throughput / latency benchmark results and accuracy evaluation results as needed, according to Benchmark and Profiling.

shuaills

Great work!

shuaills · 2025-01-31T18:11:37Z

docs/references/nvidia_jetson.md

+Step 1: Build FlashInfer from Source
+------------------------------------
+
+1.  Clone the FlashInfer repository:
+```bash
+git clone -b v0.2.0 https://github.com/flashinfer-ai/flashinfer.git --recursive
+cd flashinfer
+```
+
+2.  Build FlashInfer:
+```bash
+python3 setup.py --verbose bdist_wheel --dist-dir ./
+```
+3.  Install FlashInfer:
+```bash
+pip3 install flashinfer-0.2.0-py3-none-any.whl
+```
+Expected Output:
+```
+Successfully installed flashinfer-0.2.0
+```
+
+Step 2: Build SGLang from Source
+--------------------------------
+Clone the SGLang repository:
+```bash
+git clone https://github.com/sgl-project/sglang
+cd sglang
+```
+Build from source:
+```bash
+cd python
+pip install -e .
+```
+Build the SGLang kernel:
+```bash
+cd sgl-kernel
+python3 setup.py --verbose bdist_wheel --dist-dir ./
+pip3 install sgl_kernel-0.0.2.post14-cp310-cp310-linux_aarch64.whl
+```
+* * * * *


Why install FlashInfer separately? Is it a specific version?
If not, this can be referred to the installation link: [Installation Guide]

shuaills · 2025-01-31T18:17:35Z

docs/references/nvidia_jetson.md

+Launch the server:
+```bash
+python -m sglang.launch_server \
+  --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --device cuda \
+  --dtype half \
+  --attention-backend flashinfer \
+  --mem-fraction-static 0.8 \
+  --context-length 8192
+```


--device cuda, --attention-backend flashinfer are default settings.
By default context-length would be the value from the model's config.json. is there a special reason for 8192?
--dtype half explain why do we need quantization

shuaills · 2025-01-31T18:21:11Z

docs/references/nvidia_jetson.md

+Running Inference with Torch Native Backend
+-------------------------------------------
+
+Launch the server:
+```bash
+python -m sglang.launch_server \
+      --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+      --device cuda \
+      --attention-backend torch_native \
+      --mem-fraction-static 0.8 \
+      --context-length 8192
+```
+Output
+```bash
+[2025-01-26 23:26:39 TP0] Decode batch. #running-req: 1, #token: 221, token usage: 0.00, gen throughput (token/s): 8.04, #queue-req: 0
+```
+
+* * * * *
+
+Running Inference with Triton Backend
+-------------------------------------
+Launch the server:
+```bash
+python -m sglang.launch_server \
+      --model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+      --device cuda \
+      --attention-backend triton \
+      --mem-fraction-static 0.8 \
+      --context-length 8192
+```
+Output
+```bash
+[2025-01-26 23:31:24 TP0] Decode batch. #running-req: 1, #token: 101, token usage: 0.00, gen throughput (token/s): 11.34, #queue-req:
+```


This part is repeating.
You can switch using --attention-backend and compare the performance.

shuaills · 2025-01-31T18:23:01Z

docs/references/nvidia_jetson.md

+* * * * *
+Running quantization with TorchAO
+-------------------------------------
+Launch the server with the best configuration:
+```bash
+python -m sglang.launch_server \
+    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
+    --device cuda \
+    --dtype bfloat16 \
+    --attention-backend flashinfer \
+    --mem-fraction-static 0.8 \
+    --context-length 8192 \
+    --torchao-config int4wo-128
+```
+This enables TorchAO's int4 weight-only quantization with 128-group size.
+
+```bash
+[2025-01-27 00:06:47 TP0] Decode batch. #running-req: 1, #token: 115, token usage: 0.00, gen throughput (token/s): 30.84, #queue-req:
+```


Refer to the quantization documentation.

shuaills · 2025-01-31T18:23:37Z

docs/references/nvidia_jetson.md

+* * * * *
+Structured output with XGrammar
+-------------------------------
+
+Install XGrammar:
+```bash
+git clone --recursive https://github.com/mlc-ai/xgrammar.git && cd xgrammar
+pre-commit install
+mkdir build && cd build
+cmake .. -G Ninja
+ninja
+cd ../python
+python3 -m pip install .
+```
+Launch the server with XGrammar:
+```bash
+python -m sglang.launch_server \
+    --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
+    --device cuda \
+    --dtype bfloat16 \
+    --attention-backend triton \
+    --mem-fraction-static 0.8 \
+    --context-length 8192 \
+    --torchao-config int4wo-128 \
+    --grammar-backend xgrammar
+```
+Run a structured output script:
+```bash
+import openai
+
+client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
+
+response = client.chat.completions.create(
+    model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
+    messages=[
+        {"role": "system", "content": "You are a helpful assistant."},
+        {"role": "user", "content": "Introduce yourself in JSON briefly."},
+    ],
+    temperature=0,
+    max_tokens=500,
+    stream=True  # Enable streaming
+)
+
+for chunk in response:
+    if chunk.choices[0].delta.content:
+        print(chunk.choices[0].delta.content, end="", flush=True)
+```
+Output
+```bash
+{
+    "name": "Knowledge Assistant",
+    "description": "A helpful AI assistant",
+    "function": "Provide information and answer questions",
+    "language": "English"
+}
+```
+* * * * *


Refer to the Structured output documentation.

shuaills · 2025-01-31T18:24:45Z

docs/references/nvidia_jetson.md

+Run a sample inference script:\
+    Create a Python script (e.g., `inference.py`) with the following content:
+```bash
+import openai
+
+client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
+
+response = client.chat.completions.create(
+    model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
+    messages=[
+        {"role": "user", "content": "A pizza is divided into 8 slices. If 3 people eat 2 slices each, how many slices are left?"},
+    ],
+    temperature=0,
+    max_tokens=500,
+    stream=True  # Enable streaming
+)
+
+for chunk in response:
+    if chunk.choices[0].delta.content:
+        print(chunk.choices[0].delta.content, end="", flush=True)
+```
+
+
+
+Output
+
+```bash
+<think>
+Alright , so I 've got this pizza problem here . Let me read it again to make sure I understand what 's being asked . It says , " A pizza is divided into 8 slices . If 3 people eat 2 slices each , how many slices are left ?" Okay , so we have 8 slices in total , and three people are each eating 2 slices . The question is asking how many slices are left after these three people have eaten their share .
+
+First , I need to figure out the total number of slices that the three people will eat . Each person eats 2 slices , and there are 3 people . So , I can calculate the total slices eaten by multiplying the number of people by the number of slices each person eats . That would be 3 people times 2 slices per person , which equals 6 slices in total .
+
+Now , since the pizza has 8 slices and 6 slices have been eaten , I need to find out how many slices are left . To do this , I 'll subtract the total number of slices eaten from the total number of slices available . So , 8 slices minus 6 slices eaten equals 2 slices remaining .
+
+Wait , let me double -check that . If each of the three people eats 2 slices , that 's 2 + 2 + 2 , which is indeed 6 slices . And since the pizza only has 8 slices to begin with , subtract ing the 6 slices eaten leaves us with 2 slices . Yeah , that makes sense .
+
+I 'm also thinking about whether there 's any possibility that the slices could be uneven ly distributed or if there 's something tricky with the way the pizza is divided . But the problem doesn 't mention anything like that , so I assume that the slices are uniform and that each person eats exactly 2 whole slices each .
+
+Another thing to consider is if everyone eats their slices simultaneously or one after another . If they eat simultaneously , then all 6 slices are eaten at once , leaving 2 slices . If they eat one after another , the result would still be the same because each person is eating 2 slices , regardless of the order in which they eat them . So , the total eaten remains 6 slices , leaving 2 slices .
+
+I 'm also ponder ing if there 's any alternative method to approach this problem . Maybe instead of calculating the total eaten first , I could subtract the number of slices eaten by each person from the total . So , if I take the total slices ( 8 ) and subtract 2 slices for the first person , that leaves me with 6 slices . Then subtract another 2 slices for the second person , leaving 4 slices . Then subtract another 2 slices for the third person , leaving me with 2 slices . This method also confirms the same answer .
+
+Both methods lead me to the same conclusion , which reinforces that the number of slices left is indeed 2 .
+
+I 'm also thinking about real -life scenarios where this might apply . Maybe at a party , a pizza is cut into slices , and several people each take their share . Understanding how many slices are left helps in planning for the next course or ensuring that everyone gets a fair share .
+
+Additionally , this problem could be extended by introducing variables and algebra ic expressions to represent the number of people , slices per person , and total slices . For example , if P is the number of people , S is the number of slices per person , and T is the total number of slices , then the equation would be :
+
+Total slices eaten = P * S
+
+Slices left = T - ( P * S )
+
+Using this formula , pl ugging in the given values :
+
+Slices left = 8 - ( 3 * 2 ) = 8 - 6 = 2
+
+This algebra ic approach provides a systematic way to solve similar problems in the future .
+
+I also wonder what would happen if the number of people or slices per person changes . For instance , if there were 4 people each eating 2 slices , the total eaten would be 8 slices , leaving 0 slices . Or if there were 2 people eating 4 slices each , the total eaten would be 8 slices , leaving 0 slices . These variations could be interesting to explore as they showcase how the number of people and slices per person affect the number of slices left .
+
+Furthermore , this problem could be visual ized with diagrams or graphs to provide a more intuitive understanding . For example , a bar graph showing the number of slices eaten by each person or a pie chart representing the total slices eaten relative to the whole pizza . Visual aids can often make mathematical concepts more accessible , especially for younger learners or those new to problem -solving techniques .
+
+In summary , by carefully analyzing the problem , using multiple methods to verify the solution , and considering extensions and visual representations , I 'm confident that the number of slices left after 3 people each eat 2 slices is 2 .
+
+After carefully analyzing the problem and verifying through multiple methods and considerations , we can confidently determine the number of slices left after 3 people each eat 2
+</think>
+
+```
+Performance metrics
+```bash
+[2025-01-26 21:32:18 TP0] Decode batch. #running-req: 1, #token: 1351, token usage: 0.01, gen throughput (token/s): 11.19, #queue-req: 0
+```
+* * * * *


Refer to the OpenAI APIs - Completions documentation.

docs about guide on nvidia jetson

c13993f

zhaochenyang20 requested a review from shuaills January 29, 2025 22:01

zhaochenyang20 self-assigned this Jan 29, 2025

zhaochenyang20 added the documentation Improvements or additions to documentation label Jan 29, 2025

fix: added trailing space for a markdown file

e04a5f3

zhaochenyang20 mentioned this pull request Jan 30, 2025

[Feature] Step-by-Step Guide to Use SGLang on NVIDIA Jetson Orin platform #3182

Open

2 tasks

shuaills added 2 commits January 30, 2025 19:21

Merge branch 'main' into feature/docs_change_jetson

102ac26

Merge branch 'main' into feature/docs_change_jetson

27fffb3

shuaills requested changes Jan 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a Doc about guide on nvidia jetson #3182 #3205

Add a Doc about guide on nvidia jetson #3182 #3205

lycanlancelot commented Jan 29, 2025 •

edited

Loading

shuaills left a comment

shuaills Jan 31, 2025

shuaills Jan 31, 2025

shuaills Jan 31, 2025

shuaills Jan 31, 2025

shuaills Jan 31, 2025

shuaills Jan 31, 2025

Add a Doc about guide on nvidia jetson #3182 #3205

Are you sure you want to change the base?

Add a Doc about guide on nvidia jetson #3182 #3205

Conversation

lycanlancelot commented Jan 29, 2025 • edited Loading

Motivation

Modifications

Checklist

shuaills left a comment

Choose a reason for hiding this comment

shuaills Jan 31, 2025

Choose a reason for hiding this comment

shuaills Jan 31, 2025

Choose a reason for hiding this comment

shuaills Jan 31, 2025

Choose a reason for hiding this comment

shuaills Jan 31, 2025

Choose a reason for hiding this comment

shuaills Jan 31, 2025

Choose a reason for hiding this comment

shuaills Jan 31, 2025

Choose a reason for hiding this comment

lycanlancelot commented Jan 29, 2025 •

edited

Loading