Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama.cpp fails on Fedora AMD - ROCm error #732

Open
vpavlin opened this issue Feb 4, 2025 · 8 comments
Open

Llama.cpp fails on Fedora AMD - ROCm error #732

vpavlin opened this issue Feb 4, 2025 · 8 comments

Comments

@vpavlin
Copy link

vpavlin commented Feb 4, 2025

Hi folks:-)

I am not saying this is a Ramalama issue, but I would appreciate your help/guidance, because thi is my first endeavour with local GPUs:-)

I just got my GMKtec K11 machine (https://www.gmktec.com/products/amd-ryzen%E2%84%A2-9-8945hs-nucbox-k11) and installed Fedora 41 on it + podman + ramalama (Installed via the curl .. | sh method from README)

$ ramalama --version
ramalama version 0

This is a result of ramalama run

vpavlin@localhost:~$ ramalama run llama3.2
> HI                                                                                                                                                                                                                                                                                                                                                                  
ggml_cuda_compute_forward: RMS_NORM failed
ROCm error: invalid device function
  current device: 0, in function ggml_cuda_compute_forward at /llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:2313
  err
/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:72: ROCm error
Memory critical error by agent node-0 (Agent handle: 0x10a3dc10) on address 0x7f3c19800000. Reason: Memory in use. 

It successfully finds the GPU (very cool)

...
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:   CPU_Mapped model buffer size =   308.23 MiB
load_tensors:        ROCm0 model buffer size =  1918.35 MiB
...

but tehn fails when I try to prompt the model

$ podman images
REPOSITORY               TAG         IMAGE ID      CREATED       SIZE
quay.io/ramalama/rocm    latest      8875feffdb87  16 hours ago  6.92 GB
docker.io/ollama/ollama  latest      f1fd985cee59  2 weeks ago   3.31 GB

Any ideas/thoughts are appreciated:) Happy to file an issue against llama.cpp, just wanted to make sure I am not missing something obvious (like packages installed or something)

@ericcurtin
Copy link
Collaborator

I think that GPU is gfx1103 . Can you check if the relevant file in in the container in /opt ? (It should have gfx1103 in the filename)

@ericcurtin
Copy link
Collaborator

You could just be simply running out of VRAM, how much VRAM does your GPU have?

@ericcurtin
Copy link
Collaborator

If:

llama3.2:1b

works, you are likely running out of VRAM, I think by default it's 3b.

@vpavlin
Copy link
Author

vpavlin commented Feb 4, 2025

ramalama --debug run llama3.2
run_cmd:  podman inspect quay.io/ramalama/rocm:0
Working directory: None
Ignore stderr: False
Ignore all: True
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_Hpxh57JSxc --pull=newer -t --device /dev/dri --device /dev/kfd -e HIP_VISIBLE_DEVICES=0 --mount=type=bind,src=/home/vpavlin/.local/share/ramalama/models/ollama/llama3.2:latest,destination=/mnt/models/model.file,ro quay.io/ramalama/rocm:latest llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
Loading modelggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no                                                                                                                                                                                                                                                                                                               
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 ROCm devices:
  Device 0: AMD Radeon Graphics, gfx1103 (0x1103), VMM: no, Wave Size: 32
llama_model_load_from_file_impl: using device ROCm0 (AMD Radeon Graphics) - 14364 MiB free

It's an iGPU and seems like it gets 50% of teh RAM

I am pulling qwen2.5:1.5b to check a smaller model, but my internet sucks, so it is gonna take a few more minutes....:D

@vpavlin
Copy link
Author

vpavlin commented Feb 4, 2025

qwen still fails with the same error

Not sure if these are the files you were looking for?

[root@45b541dd4f49 /]# find /opt -iname "*gfx1103*"
/opt/rocm-6.3.1/lib/llvm/lib/libdevice/libhostexec-gfx1103.bc
/opt/rocm-6.3.1/lib/llvm/lib/libomptarget-amdgpu-gfx1103.bc
/opt/rocm-6.3.1/lib/llvm/lib-debug/libdevice/libhostexec-gfx1103.bc
/opt/rocm-6.3.1/lib/llvm/lib-debug/libomptarget-amdgpu-gfx1103.bc

@vpavlin
Copy link
Author

vpavlin commented Feb 4, 2025

I also noticed this

  --gpu                 offload the workload to the GPU (default: False)

and since I am not specifying --gpu it should only use CPU by default, no? How do I turn off GPU offloading?

@vpavlin
Copy link
Author

vpavlin commented Feb 4, 2025

There is something generally weird happening - I checked BIOS and there was actually only 3GB VRAM assigned, so I bumped it up to 16GB (out of 32GB RAM total in the machine)

I can see the VRAM availeble now

Image

But it seems both ollama/ollama:rocm and ramalama results in GTT being used (which is now only 8GB rather than the number I reported above - 14364 MiB free

Again, this is my first experience with GPUs in general, so it is all very confusing - feel free to send me somewhere else:D

@vpavlin
Copy link
Author

vpavlin commented Feb 4, 2025

Curious if the memory error might be related to this: ollama/ollama#5471 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants