Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Doc about guide on nvidia jetson #3182 #3205

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

lycanlancelot
Copy link

@lycanlancelot lycanlancelot commented Jan 29, 2025

Motivation

To add a doc about the step-by-step guide for using sglang on Nvidia Jetson platform #3182

Modifications

Added a file which is the replicate of guide from https://github.com/shahizat/SGLang-Jetson
Changed the index.rst for this file

Checklist

@zhaochenyang20 zhaochenyang20 self-assigned this Jan 29, 2025
@zhaochenyang20 zhaochenyang20 added the documentation Improvements or additions to documentation label Jan 29, 2025
Copy link
Collaborator

@shuaills shuaills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

Comment on lines +22 to +62
Step 1: Build FlashInfer from Source
------------------------------------

1. Clone the FlashInfer repository:
```bash
git clone -b v0.2.0 https://github.com/flashinfer-ai/flashinfer.git --recursive
cd flashinfer
```

2. Build FlashInfer:
```bash
python3 setup.py --verbose bdist_wheel --dist-dir ./
```
3. Install FlashInfer:
```bash
pip3 install flashinfer-0.2.0-py3-none-any.whl
```
Expected Output:
```
Successfully installed flashinfer-0.2.0
```

Step 2: Build SGLang from Source
--------------------------------
Clone the SGLang repository:
```bash
git clone https://github.com/sgl-project/sglang
cd sglang
```
Build from source:
```bash
cd python
pip install -e .
```
Build the SGLang kernel:
```bash
cd sgl-kernel
python3 setup.py --verbose bdist_wheel --dist-dir ./
pip3 install sgl_kernel-0.0.2.post14-cp310-cp310-linux_aarch64.whl
```
* * * * *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why install FlashInfer separately? Is it a specific version?
If not, this can be referred to the installation link: [Installation Guide]

Comment on lines +67 to +76
Launch the server:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--dtype half \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--device cuda, --attention-backend flashinfer are default settings.
By default context-length would be the value from the model's config.json. is there a special reason for 8192?
--dtype half explain why do we need quantization

Comment on lines +152 to +185
Running Inference with Torch Native Backend
-------------------------------------------

Launch the server:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--attention-backend torch_native \
--mem-fraction-static 0.8 \
--context-length 8192
```
Output
```bash
[2025-01-26 23:26:39 TP0] Decode batch. #running-req: 1, #token: 221, token usage: 0.00, gen throughput (token/s): 8.04, #queue-req: 0
```

* * * * *

Running Inference with Triton Backend
-------------------------------------
Launch the server:
```bash
python -m sglang.launch_server \
--model-path deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--device cuda \
--attention-backend triton \
--mem-fraction-static 0.8 \
--context-length 8192
```
Output
```bash
[2025-01-26 23:31:24 TP0] Decode batch. #running-req: 1, #token: 101, token usage: 0.00, gen throughput (token/s): 11.34, #queue-req:
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part is repeating.
You can switch using --attention-backend and compare the performance.

Comment on lines +187 to +205
* * * * *
Running quantization with TorchAO
-------------------------------------
Launch the server with the best configuration:
```bash
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend flashinfer \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128
```
This enables TorchAO's int4 weight-only quantization with 128-group size.

```bash
[2025-01-27 00:06:47 TP0] Decode batch. #running-req: 1, #token: 115, token usage: 0.00, gen throughput (token/s): 30.84, #queue-req:
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the quantization documentation.

Comment on lines +207 to +263
* * * * *
Structured output with XGrammar
-------------------------------

Install XGrammar:
```bash
git clone --recursive https://github.com/mlc-ai/xgrammar.git && cd xgrammar
pre-commit install
mkdir build && cd build
cmake .. -G Ninja
ninja
cd ../python
python3 -m pip install .
```
Launch the server with XGrammar:
```bash
python -m sglang.launch_server \
--model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--device cuda \
--dtype bfloat16 \
--attention-backend triton \
--mem-fraction-static 0.8 \
--context-length 8192 \
--torchao-config int4wo-128 \
--grammar-backend xgrammar
```
Run a structured output script:
```bash
import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Introduce yourself in JSON briefly."},
],
temperature=0,
max_tokens=500,
stream=True # Enable streaming
)

for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
```
Output
```bash
{
"name": "Knowledge Assistant",
"description": "A helpful AI assistant",
"function": "Provide information and answer questions",
"language": "English"
}
```
* * * * *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the Structured output documentation.

Comment on lines +78 to +150
Run a sample inference script:\
Create a Python script (e.g., `inference.py`) with the following content:
```bash
import openai

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
messages=[
{"role": "user", "content": "A pizza is divided into 8 slices. If 3 people eat 2 slices each, how many slices are left?"},
],
temperature=0,
max_tokens=500,
stream=True # Enable streaming
)

for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
```



Output

```bash
<think>
Alright , so I 've got this pizza problem here . Let me read it again to make sure I understand what 's being asked . It says , " A pizza is divided into 8 slices . If 3 people eat 2 slices each , how many slices are left ?" Okay , so we have 8 slices in total , and three people are each eating 2 slices . The question is asking how many slices are left after these three people have eaten their share .

First , I need to figure out the total number of slices that the three people will eat . Each person eats 2 slices , and there are 3 people . So , I can calculate the total slices eaten by multiplying the number of people by the number of slices each person eats . That would be 3 people times 2 slices per person , which equals 6 slices in total .

Now , since the pizza has 8 slices and 6 slices have been eaten , I need to find out how many slices are left . To do this , I 'll subtract the total number of slices eaten from the total number of slices available . So , 8 slices minus 6 slices eaten equals 2 slices remaining .

Wait , let me double -check that . If each of the three people eats 2 slices , that 's 2 + 2 + 2 , which is indeed 6 slices . And since the pizza only has 8 slices to begin with , subtract ing the 6 slices eaten leaves us with 2 slices . Yeah , that makes sense .

I 'm also thinking about whether there 's any possibility that the slices could be uneven ly distributed or if there 's something tricky with the way the pizza is divided . But the problem doesn 't mention anything like that , so I assume that the slices are uniform and that each person eats exactly 2 whole slices each .

Another thing to consider is if everyone eats their slices simultaneously or one after another . If they eat simultaneously , then all 6 slices are eaten at once , leaving 2 slices . If they eat one after another , the result would still be the same because each person is eating 2 slices , regardless of the order in which they eat them . So , the total eaten remains 6 slices , leaving 2 slices .

I 'm also ponder ing if there 's any alternative method to approach this problem . Maybe instead of calculating the total eaten first , I could subtract the number of slices eaten by each person from the total . So , if I take the total slices ( 8 ) and subtract 2 slices for the first person , that leaves me with 6 slices . Then subtract another 2 slices for the second person , leaving 4 slices . Then subtract another 2 slices for the third person , leaving me with 2 slices . This method also confirms the same answer .

Both methods lead me to the same conclusion , which reinforces that the number of slices left is indeed 2 .

I 'm also thinking about real -life scenarios where this might apply . Maybe at a party , a pizza is cut into slices , and several people each take their share . Understanding how many slices are left helps in planning for the next course or ensuring that everyone gets a fair share .

Additionally , this problem could be extended by introducing variables and algebra ic expressions to represent the number of people , slices per person , and total slices . For example , if P is the number of people , S is the number of slices per person , and T is the total number of slices , then the equation would be :

Total slices eaten = P * S

Slices left = T - ( P * S )

Using this formula , pl ugging in the given values :

Slices left = 8 - ( 3 * 2 ) = 8 - 6 = 2

This algebra ic approach provides a systematic way to solve similar problems in the future .

I also wonder what would happen if the number of people or slices per person changes . For instance , if there were 4 people each eating 2 slices , the total eaten would be 8 slices , leaving 0 slices . Or if there were 2 people eating 4 slices each , the total eaten would be 8 slices , leaving 0 slices . These variations could be interesting to explore as they showcase how the number of people and slices per person affect the number of slices left .

Furthermore , this problem could be visual ized with diagrams or graphs to provide a more intuitive understanding . For example , a bar graph showing the number of slices eaten by each person or a pie chart representing the total slices eaten relative to the whole pizza . Visual aids can often make mathematical concepts more accessible , especially for younger learners or those new to problem -solving techniques .

In summary , by carefully analyzing the problem , using multiple methods to verify the solution , and considering extensions and visual representations , I 'm confident that the number of slices left after 3 people each eat 2 slices is 2 .

After carefully analyzing the problem and verifying through multiple methods and considerations , we can confidently determine the number of slices left after 3 people each eat 2
</think>

```
Performance metrics
```bash
[2025-01-26 21:32:18 TP0] Decode batch. #running-req: 1, #token: 1351, token usage: 0.01, gen throughput (token/s): 11.19, #queue-req: 0
```
* * * * *
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to the OpenAI APIs - Completions documentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants