Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] tensor_model_parallel_all_reduce' is not defined #2931

Open
bakch92 opened this issue Jan 17, 2025 · 2 comments
Open

[Bug] tensor_model_parallel_all_reduce' is not defined #2931

bakch92 opened this issue Jan 17, 2025 · 2 comments

Comments

@bakch92
Copy link

bakch92 commented Jan 17, 2025

Describe the bug

I attempted to serve the Phi-4 Lora Fine-tuning model by setting tensor parallel size 2 using the sglang framework, but the following error occurred.

[Error Log]

[2025-01-17 01:51:55 TP0] LoRA manager ready.
[2025-01-17 01:51:57 TP1] Load weight end. type=Phi3ForCausalLM, dtype=torch.float16, avail mem=15.70 GB
[2025-01-17 01:52:00 TP1] LoRA manager ready.
[2025-01-17 01:52:00 TP0] Memory pool end. avail mem=39.54 GB
[2025-01-17 01:52:02 TP1] Memory pool end. avail mem=13.43 GB
[2025-01-17 01:52:02 TP1] max_total_num_tokens=16384, max_prefill_tokens=16384, max_running_requests=2049, context_len=16384
[2025-01-17 01:52:02 TP0] max_total_num_tokens=16384, max_prefill_tokens=16384, max_running_requests=2049, context_len=16384
[2025-01-17 01:52:02] INFO:     Started server process [649817]
[2025-01-17 01:52:02] INFO:     Waiting for application startup.
[2025-01-17 01:52:02] INFO:     Application startup complete.
[2025-01-17 01:52:02] INFO:     Uvicorn running on http://0.0.0.0:8001 (Press CTRL+C to quit)
[2025-01-17 01:52:03] INFO:     127.0.0.1:47632 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-01-17 01:52:03 TP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, cache hit rate: 0.00%, token usage: 0.00, #running-req: 0, #queue-req: 0
[2025-01-17 01:52:11 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 101, in forward_thread_func
    self.forward_thread_func_()
  File "/home/work/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/managers/tp_worker_overlap_thread.py", line 132, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 154, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 679, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 648, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 337, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 288, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 237, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/models/llama.py", line 175, in forward
    output, _ = self.o_proj(attn_output)
                ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/.local/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/work/anaconda3/envs/unsloth_env/lib/python3.11/site-packages/sglang/srt/lora/lora.py", line 248, in forward
    output_ = tensor_model_parallel_all_reduce(output_parallel)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
NameError: name 'tensor_model_parallel_all_reduce' is not defined

Reproduction

Model Name: Microsoft Phi-4

nohup python -m sglang.launch_server --model-path /home/work/ai/Microsoft_Phi-4/phi-4_quantized_8bit --lora-paths lora=/home/work/ai/Microsoft_Phi-4/lora_tuning_1221 --port 8001 --mem-fraction-static 0.8 --host 0.0.0.0 --dtype auto --disable-radix-cache --disable-cuda-graph --quantization gptq_marlin --max-total-tokens 16384 --tp 2 &

Environment

Python: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2: CUDA GPU
GPU 0,1,2 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 11.8, V11.8.89
CUDA Driver Version: 535.54.03
PyTorch: 2.5.1+cu124
sglang: 0.4.0
flashinfer: 0.1.6+cu121torch2.4
triton: 3.1.0
transformers: 4.48.0
torchao: 0.6.1
numpy: 1.26.4
aiohttp: 3.11.8
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.27.0
interegular: 0.3.3
modelscope: 1.20.1
orjson: 3.10.12
packaging: 24.2
psutil: 6.1.0
pydantic: 2.10.4
multipart: 0.0.17
zmq: 26.2.0
uvicorn: 0.32.1
uvloop: 0.21.0
vllm: 0.6.4.post1
openai: 1.58.1
anthropic: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 NIC0 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PIX NODE SYS 1,3,5,7,9,11 1 N/A
GPU1 PIX X NODE SYS 1,3,5,7,9,11 1 N/A
GPU2 NODE NODE X SYS 1,3,5,7,9,11 1 N/A
NIC0 SYS SYS SYS X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0

ulimit soft: 1048576

@Fridge003
Copy link
Collaborator

Hi, currently Lora doesn't support tensor parallel in SGLang, so please set tp_size to 1 when using Lora.

But we are planning to fix this in the future. You can refer to #2929 to see our progress of developing Lora.

@zhaochenyang20
Copy link
Collaborator

Great. Please follow this issue! @Fridge003

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants