[Kernel] add triton fused moe kernel for gptq/awq #12185

jinzhen-lin · 2025-01-18T11:21:09Z

The current only option for using moe+gptq/awq is the Marlin kernel, but for the Marlin kernel, a single marlin_gemm_moe would launching num_experts CUDA kernels at least, while the fused_moe triton kernel only needs to launch one cuda kernel. This makes the Marlin kernel significantly slower than the fused_moe triton kernel.

This PR adds support for fused_moe triton kernel with gptq/awq.

Generation speed of deepseek-v3-awq (8*A100-SXM4-80GB, bs=1, short prompt)

	marlin moe kernel	triton fused moe kernel
w/o #12222	5.4tok/s	10.0tok/s
w/ #12222	11.1tok/s	29.6 tok/s

Note:

[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) #12222 enable cuda graph and add shared memory moe_align_block_size kernel support for deepseek-v3
to enable this kernel

python -m vllm.entrypoints.openai.api_server \
    --served-model-name model \
    --model cognitivecomputations/DeepSeek-V3-AWQ \
    --tensor-parallel-size 8 \
    --trust-remote-code \
    --max-model-len 24576 \
    --dtype half \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.96 \
    --quantization moe_wna16

github-actions · 2025-01-18T11:21:20Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

Signed-off-by: Jinzhen Lin <[email protected]>

casper-hansen · 2025-01-20T11:17:29Z

@mgoin @robertgshaw2-redhat Could we expedite this PR + #12036 (not sure if #12204 is needed too or has overlap) now that DeepSeek has released their full lineup?

jinzhen-lin · 2025-01-20T12:51:39Z

@mgoin @robertgshaw2-redhat Could we expedite this PR + #12036 (not sure if #12204 is needed too or has overlap) now that DeepSeek has released their full lineup?

I created a new PR with better moe_align_block_size just now, you can take a look at it. #12222

casper-hansen · 2025-01-20T14:17:17Z

I think this PR could be closed in favor of #12222. Thanks for your work @jinzhen-lin

jinzhen-lin · 2025-01-20T15:04:38Z

I think this PR could be closed in favor of #12222. Thanks for your work @jinzhen-lin

#12222 is an optimiztion over #12036 or #12204, it can be combined with this PR to get a better performance.

mgoin · 2025-01-20T15:29:44Z

Thank you for the work! We will take a look now

vllm/model_executor/layers/fused_moe/fused_moe.py

vllm/model_executor/layers/quantization/moe_quant_int.py

vllm/model_executor/layers/fused_moe/fused_moe.py

mgoin · 2025-01-20T19:20:40Z

Considering that this is allowing for "another option" to run quantized moe models, maybe we should consider writing a documentation page specifically for moe quantization.

I think the best case for this kernel to be used more broadly would be to have a heuristic on the number of experts or some configuration to decide whether to use the triton or marlin kernel

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin · 2025-01-21T06:15:51Z

Considering that this is allowing for "another option" to run quantized moe models, maybe we should consider writing a documentation page specifically for moe quantization.

I think the best case for this kernel to be used more broadly would be to have a heuristic on the number of experts or some configuration to decide whether to use the triton or marlin kernel

I test with small moe model (https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4) just now, triton kernel seems much faster than marlin kernel too. Besides, marlin kernel seems generate wrong result for Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4.

Test result on A100 * 1:

marlin kernel:

$time curl -X POST http://127.0.0.1:8000/v1/chat/completions     -H 'Content-Type: application/json'     -d '{ "model": "model", "temperature": 0.0, "messages": [ { "role": "user", "content": "write a very long article" } ], "stream": false, "max_tokens": 512, "min_tokens": 512}'
{"id":"chatcmpl-99d4b90d4ee14c15a82bd278b3cbcfd1","object":"chat.completion","created":1737439221,"model":"model","choices":[{"index":0,"message":{"role":"assistant","content":"数，数，数，数，数数，数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数数","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":24,"total_tokens":536,"completion_tokens":512,"prompt_tokens_details":null},"prompt_logprobs":null}
real    0m6.618s
user    0m0.002s
sys     0m0.003s

triton kernel

$time curl -X POST http://127.0.0.1:8000/v1/chat/completions     -H 'Content-Type: application/json'     -d '{ "model": "model", "temperature": 0.0, "messages": [ { "role": "user", "content": "write a very long article" } ], "stream": false, "max_tokens": 512, "min_tokens": 512}'
{"id":"chatcmpl-d2f6f9d66c4b40c0b7dad3e1c9fc3d74","object":"chat.completion","created":1737439295,"model":"model","choices":[{"index":0,"message":{"role":"assistant","content":"The Benefits of Regular Exercise: A Comprehensive Guide\n\nIntroduction\n\nRegular exercise is an essential component of a healthy lifestyle. It not only helps maintain a healthy weight, but also has numerous other benefits for both physical and mental health. In this article, we will explore the various benefits of regular exercise, including improved cardiovascular health, increased strength and endurance, better mental health, and a reduced risk of chronic diseases. We will also discuss the different types of exercises and how to incorporate them into your routine for maximum effectiveness.\n\nImproved Cardiovascular Health\n\nRegular exercise strengthens the heart and improves its efficiency, reducing the risk of heart disease. Engaging in activities like brisk walking, running, cycling, or swimming can help lower blood pressure, cholesterol levels, and triglyceride levels. This, in turn, reduces the risk of heart attack and stroke. Additionally, regular exercise can also help maintain a healthy weight, which further reduces the strain on the heart.\n\nIncreased Strength and Endurance\n\nExercise helps build muscle strength and endurance, which is crucial for maintaining independence and mobility as we age. Regular strength training, such as weightlifting or bodyweight exercises, can help increase muscle mass, improve bone density, and enhance overall physical performance. This, in turn, can lead to an increased sense of well-being and confidence.\n\nBetter Mental Health\n\nExercise has been shown to have a positive impact on mental health. Physical activity releases endorphins, which are natural mood-boosting chemicals in the brain. Regular exercise can help reduce symptoms of depression and anxiety, improve self-esteem, and increase overall happiness. Additionally, engaging in activities like yoga or meditation can help reduce stress and promote relaxation.\n\nReduced Risk of Chronic Diseases\n\nRegular exercise can significantly reduce the risk of developing chronic diseases such as type 2 diabetes, certain types of cancer, and osteoporosis. Physical activity helps regulate blood sugar levels, which is particularly beneficial for those with diabetes. Exercise also helps maintain a healthy weight, which reduces the risk of developing these diseases. Furthermore, regular physical activity can improve bone density, reducing the risk of osteoporosis.\n\nIncorporating Exercise into Your Routine\n\nTo maximize the benefits of exercise, it's essential to incorporate a variety of activities into your routine. This can include:\n\n1. Cardiovascular exercises: Activities like running, cycling, or swimming can help improve cardiovascular health and burn calories.\n2. Strength training: Incorporating weightlifting or bodyweight exercises can help build muscle mass, increase bone density, and improve overall physical performance.\n3. Flexibility and balance exercises","tool_calls":[]},"logprobs":null,"finish_reason":"length","stop_reason":null}],"usage":{"prompt_tokens":24,"total_tokens":536,"completion_tokens":512,"prompt_tokens_details":null},"prompt_logprobs":null}
real    0m4.016s
user    0m0.000s
sys     0m0.005s

Maybe we should set triton kernel as default moe gptq/awq kernel? But I am not sure how to do this, gptq-marlin-moe is a part of gpt-marlin quanzation method, if I change moe kernel of gptq-marlin method, user cannot use gptq-marlin-moe anyway. Is that ok?

jinzhen-lin · 2025-01-22T16:37:35Z

@jinzhen-lin I tried loading an awq mixtral model and it failed to pass the right kwargs through to AWQMarlin

vllm serve hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 --quantization moe_wna16 
...
  File "/home/mgoin/code/vllm/vllm/model_executor/models/mixtral_quant.py", line 197, in __init__
    self.qkv_proj = QKVParallelLinear(
                    ^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/linear.py", line 723, in __init__
    super().__init__(input_size=input_size,
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/linear.py", line 291, in __init__
    super().__init__(input_size, output_size, skip_bias_add, params_dtype,
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/linear.py", line 177, in __init__
    self.quant_method = quant_config.get_quant_method(self,
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/mgoin/code/vllm/vllm/model_executor/layers/quantization/moe_wna16.py", line 144, in get_quant_method
    return quant_method_cls(quant_config_cls(self.full_config))
                            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: AWQMarlinConfig.__init__() missing 3 required positional arguments: 'group_size', 'zero_point', and 'lm_head_quantized'

Since many of the config initializers don't have extra kwargs, you will likely need to check the named args of each initializer to attempt to prune the full_config before passing in unzipped

Sorry, the commit serveral hours ago introduced this bug. It should be quant_config_cls.from_config(self.full_config).

Signed-off-by: Jinzhen Lin <[email protected]>

mgoin · 2025-01-22T17:11:37Z

Thank you, it seems to work fine now.

I ran a 128/128 benchmark at 10QPS for the mixtral awq model on H100 and found that the marlin kernels were more performant. In the future we could make a heuristic to choose the kernel based on model configuration, but for now let's keep the kernel as opt-in

awq_marlin

vllm serve hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 --disable-log-requests

python benchmarks/benchmark_serving.py --backend openai-chat --base-url http://0.0.0.0:8000/v1 --endpoint /chat/completions --model hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 --dataset-name random --random-input-len 128 --random-output-len 128  --num_prompts 300 --request-rate 10
INFO 01-22 17:07:43 __init__.py:179] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url='http://0.0.0.0:8000/v1', host='localhost', port=8000, endpoint='/chat/completions', dataset=None, dataset_name='random', dataset_path=None, max_concurrency=None, model='hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=128, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:34<00:00,  8.73it/s]
============ Serving Benchmark Result ============
Successful requests:                     300       
Benchmark duration (s):                  34.37     
Total input tokens:                      38400     
Total generated tokens:                  27905     
Request throughput (req/s):              8.73      
Output token throughput (tok/s):         811.82    
Total Token throughput (tok/s):          1928.96   
---------------Time to First Token----------------
Mean TTFT (ms):                          101.11    
Median TTFT (ms):                        97.63     
P99 TTFT (ms):                           270.17    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          72.93     
Median TPOT (ms):                        81.64     
P99 TPOT (ms):                           94.08     
---------------Inter-token Latency----------------
Mean ITL (ms):                           72.38     
Median ITL (ms):                         49.60     
P99 ITL (ms):                            309.09    
==================================================

moe_wna16

vllm serve hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 --disable-log-requests --quantization moe_wna16

python benchmarks/benchmark_serving.py --backend openai-chat --base-url http://0.0.0.0:8000/v1 --endpoint /chat/completions --model hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4 --dataset-name random --random-input-len 128 --random-output-len 128  --num_prompts 300 --request-rate 10
INFO 01-22 17:04:51 __init__.py:179] Automatically detected platform cuda.
Namespace(backend='openai-chat', base_url='http://0.0.0.0:8000/v1', host='localhost', port=8000, endpoint='/chat/completions', dataset=None, dataset_name='random', dataset_path=None, max_concurrency=None, model='hugging-quants/Mixtral-8x7B-Instruct-v0.1-AWQ-INT4', tokenizer=None, best_of=1, use_beam_search=False, num_prompts=300, logprobs=None, request_rate=10.0, burstiness=1.0, seed=0, trust_remote_code=False, disable_tqdm=False, profile=False, save_result=False, metadata=None, result_dir=None, result_filename=None, ignore_eos=False, percentile_metrics='ttft,tpot,itl', metric_percentiles='99', goodput=None, sonnet_input_len=550, sonnet_output_len=150, sonnet_prefix_len=200, sharegpt_output_len=None, random_input_len=128, random_output_len=128, random_range_ratio=1.0, random_prefix_len=0, hf_subset=None, hf_split=None, hf_output_len=None, tokenizer_mode='auto')
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
Traffic request rate: 10.0
Burstiness factor: 1.0 (Poisson process)
Maximum request concurrency: None
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 300/300 [00:37<00:00,  8.02it/s]
============ Serving Benchmark Result ============
Successful requests:                     300       
Benchmark duration (s):                  37.40     
Total input tokens:                      38400     
Total generated tokens:                  28139     
Request throughput (req/s):              8.02      
Output token throughput (tok/s):         752.29    
Total Token throughput (tok/s):          1778.90   
---------------Time to First Token----------------
Mean TTFT (ms):                          207.58    
Median TTFT (ms):                        193.52    
P99 TTFT (ms):                           433.49    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          140.24    
Median TPOT (ms):                        142.47    
P99 TPOT (ms):                           218.78    
---------------Inter-token Latency----------------
Mean ITL (ms):                           137.56    
Median ITL (ms):                         85.32     
P99 ITL (ms):                            715.01    
==================================================

casper-hansen · 2025-01-22T17:19:47Z

Thank you, it seems to work fine now.

I ran a 128/128 benchmark at 10QPS for the mixtral awq model on H100 and found that the marlin kernels were more performant. In the future we could make a heuristic to choose the kernel based on model configuration, but for now let's keep the kernel as opt-in

This makes sense since Mixtral has few experts, excited to get this into main to test it out! The main thing I see optimized here is the number of kernel launches. It should still be more performant higher number of experts, not sure where the exact threshold is but 32 or 64 is probably a good minimum.

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin · 2025-01-29T14:04:03Z

@mgoin All checks have passed. Can you merge it?

mgoin

Thanks for getting this over the line!

Signed-off-by: Isotr0py <[email protected]>

mgoin · 2025-02-11T23:46:43Z

Hello @jinzhen-lin I found there are correctness issues when trying to use this method with DeepSeek MoE AWQ models. I started working on a PR (#13119) to enable awq marlin to be used for this model by falling back to awq for unsupported layers.

Here is the eval with moe_wna16 enabled

lm_eval --model vllm --model_args pretrained=TechxGenus/DeepSeek-Coder-V2-Lite-Instruct-AWQ,quantization=moe_wna16 --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto   
...
vllm (pretrained=TechxGenus/DeepSeek-Coder-V2-Lite-Instruct-AWQ,quantization=moe_wna16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0842|±  |0.0076|
|     |       |strict-match    |     5|exact_match|↑  |0.0478|±  |0.0059|

Compare this to running with the default awq_marlin backend

lm_eval --model vllm --model_args pretrained=TechxGenus/DeepSeek-Coder-V2-Lite-Instruct-AWQ --trust_remote_code --tasks gsm8k --num_fewshot 5 --batch_size auto
...
vllm (pretrained=TechxGenus/DeepSeek-Coder-V2-Lite-Instruct-AWQ,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.7111|±  |0.0125|
|     |       |strict-match    |     5|exact_match|↑  |0.6816|±  |0.0128|

As a smoke test with a model you already tried, I ran an eval with the Qwen MoE GPTQ model and that did seem to eval perfectly fine. So I think it is just a bad case.

vllm (pretrained=Qwen/Qwen1.5-MoE-A2.7B-Chat-GPTQ-Int4,quantization=moe_wna16,trust_remote_code=True), gen_kwargs: (None), limit: None, num_fewshot: 5, batch_size: auto
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.4708|±  |0.0137|
|     |       |strict-match    |     5|exact_match|↑  |0.3336|±  |0.0130|

Could you please help me debug this issue? This is necessary to enable the moe_wna16 kernel to be used in more places.

jinzhen-lin requested review from tlrmchlsmth and WoosukKwon as code owners January 18, 2025 11:21

jinzhen-lin added 7 commits January 18, 2025 20:07

add moe_quant_int quantization method

2f3ed3b

Signed-off-by: Jinzhen Lin <[email protected]>

use tl.float32 to dequantize

97f18ef

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

0530452

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

fb7bba5

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

4bd2c31

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

ac8ae24

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

87e191f

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 91b41c6 to 87e191f Compare January 18, 2025 12:08

jinzhen-lin added 2 commits January 18, 2025 20:11

fix format error

29df4d0

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

99f23f2

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 21c1d8d to 99f23f2 Compare January 18, 2025 12:14

mgoin self-requested a review January 18, 2025 21:47

fix error

15ae02b

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin force-pushed the triton_fused_moe_int4 branch from 55102d9 to 15ae02b Compare January 19, 2025 06:44

jinzhen-lin mentioned this pull request Jan 20, 2025

[Kernel] optimize moe_align_block_size for cuda graph and large num_experts (e.g. DeepSeek-V3) #12222

Merged

mgoin reviewed Jan 20, 2025

View reviewed changes

tlrmchlsmth reviewed Jan 20, 2025

View reviewed changes

vllm/model_executor/layers/fused_moe/fused_moe.py Show resolved Hide resolved

jinzhen-lin added 3 commits January 21, 2025 10:06

fix use_int4_w4a16 typo

ed878d9

Signed-off-by: Jinzhen Lin <[email protected]>

moe_quant_int -> moe_wna16

28b49ff

Signed-off-by: Jinzhen Lin <[email protected]>

add comment for gptq/awq weight conversion

218f31c

Signed-off-by: Jinzhen Lin <[email protected]>

jinzhen-lin added 2 commits January 23, 2025 00:38

fix bug

8d43a53

Signed-off-by: Jinzhen Lin <[email protected]>

fix mypy error

b56f6a0

Signed-off-by: Jinzhen Lin <[email protected]>

mgoin approved these changes Jan 22, 2025

View reviewed changes

mgoin added quantization moe ready ONLY add when PR is ready to merge/full CI is needed labels Jan 22, 2025

mgoin and others added 10 commits January 22, 2025 19:55

Merge branch 'main' into triton_fused_moe_int4

df887e2

Merge branch 'main' into triton_fused_moe_int4

c0315ac

make compliable with main

78d7035

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

7512143

Signed-off-by: Jinzhen Lin <[email protected]>

fix description

8dfb235

Signed-off-by: Jinzhen Lin <[email protected]>

Merge branch 'main' into triton_fused_moe_int4

073d67e

add description

9f944ed

Signed-off-by: Jinzhen Lin <[email protected]>

Merge branch 'main' into triton_fused_moe_int4

b4fff04

fix ci error

f125988

Signed-off-by: Jinzhen Lin <[email protected]>

fix format error

ea39702

Signed-off-by: Jinzhen Lin <[email protected]>

mgoin approved these changes Jan 29, 2025

View reviewed changes

mgoin merged commit 27b78c7 into vllm-project:main Jan 29, 2025
52 checks passed

rasmith pushed a commit to rasmith/vllm that referenced this pull request Jan 30, 2025

[Kernel] add triton fused moe kernel for gptq/awq (vllm-project#12185)

6126fa4

LagPixelLOL mentioned this pull request Feb 1, 2025

[Bug]: Assertion Error When Using moe_wna16 #12647

Closed

1 task

Isotr0py pushed a commit to Isotr0py/vllm that referenced this pull request Feb 2, 2025

[Kernel] add triton fused moe kernel for gptq/awq (vllm-project#12185)

52babdc

Signed-off-by: Isotr0py <[email protected]>

Xu-Chen mentioned this pull request Feb 7, 2025

[Feature] add support for deepseek v3 gptq / awq sgl-project/sglang#2706

Open

2 tasks

NickLucche pushed a commit to NickLucche/vllm that referenced this pull request Feb 7, 2025

[Kernel] add triton fused moe kernel for gptq/awq (vllm-project#12185)

abd54c3

ShangmingCai pushed a commit to ShangmingCai/vllm that referenced this pull request Feb 10, 2025

[Kernel] add triton fused moe kernel for gptq/awq (vllm-project#12185)

0258e45

Xu-Chen mentioned this pull request Feb 10, 2025

Fix deepseek awq v3 sgl-project/sglang#3450

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] add triton fused moe kernel for gptq/awq #12185

[Kernel] add triton fused moe kernel for gptq/awq #12185

jinzhen-lin commented Jan 18, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Jan 18, 2025

casper-hansen commented Jan 20, 2025

jinzhen-lin commented Jan 20, 2025

casper-hansen commented Jan 20, 2025

jinzhen-lin commented Jan 20, 2025

mgoin commented Jan 20, 2025

mgoin commented Jan 20, 2025

jinzhen-lin commented Jan 21, 2025

jinzhen-lin commented Jan 22, 2025

mgoin commented Jan 22, 2025

casper-hansen commented Jan 22, 2025

jinzhen-lin commented Jan 29, 2025

mgoin left a comment

mgoin commented Feb 11, 2025

[Kernel] add triton fused moe kernel for gptq/awq #12185

[Kernel] add triton fused moe kernel for gptq/awq #12185

Conversation

jinzhen-lin commented Jan 18, 2025 • edited by github-actions bot Loading

github-actions bot commented Jan 18, 2025

casper-hansen commented Jan 20, 2025

jinzhen-lin commented Jan 20, 2025

casper-hansen commented Jan 20, 2025

jinzhen-lin commented Jan 20, 2025

mgoin commented Jan 20, 2025

mgoin commented Jan 20, 2025

jinzhen-lin commented Jan 21, 2025

jinzhen-lin commented Jan 22, 2025

mgoin commented Jan 22, 2025

awq_marlin

moe_wna16

casper-hansen commented Jan 22, 2025

jinzhen-lin commented Jan 29, 2025

mgoin left a comment

Choose a reason for hiding this comment

mgoin commented Feb 11, 2025

jinzhen-lin commented Jan 18, 2025 •

edited by github-actions bot

Loading