Using vllm runtime generates "unrecognized arguments" error #758

dwrobel · 2025-02-07T12:15:31Z

An attempt to use vllm runtime generates the following error:

error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file

Full log:

$ ramalama --debug --runtime vllm run llama3.2
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_PNB6UFIqIM --pull=newer -t --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --mount=type=bind,src=/home/dw/.local/share/ramalama/models/ollama/llama3.2:latest,destination=/mnt/models/model.file,ro quay.io/modh/vllm:rhoai-2.17-cuda llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
usage: __main__.py [-h] [--host HOST] [--port PORT] [--uvicorn-log-level {debug,info,warning,error,critical,trace}] [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS] [--allowed-methods ALLOWED_METHODS]
                   [--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY] [--lora-modules LORA_MODULES [LORA_MODULES ...]] [--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]] [--chat-template CHAT_TEMPLATE]
                   [--chat-template-content-format {auto,string,openai}] [--response-role RESPONSE_ROLE] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE] [--ssl-ca-certs SSL_CA_CERTS] [--ssl-cert-reqs SSL_CERT_REQS]
                   [--root-path ROOT_PATH] [--middleware MIDDLEWARE] [--return-tokens-as-token-ids] [--disable-frontend-multiprocessing] [--enable-request-id-headers] [--enable-auto-tool-choice]
                   [--tool-call-parser {granite-20b-fc,granite,hermes,internlm,jamba,llama3_json,mistral,pythonic} or name registered in --tool-parser-plugin] [--tool-parser-plugin TOOL_PARSER_PLUGIN] [--model MODEL]
                   [--task {auto,generate,embedding,embed,classify,score,reward}] [--tokenizer TOKENIZER] [--skip-tokenizer-init] [--revision REVISION] [--code-revision CODE_REVISION] [--tokenizer-revision TOKENIZER_REVISION]
                   [--tokenizer-mode {auto,slow,mistral}] [--trust-remote-code] [--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH] [--download-dir DOWNLOAD_DIR]
                   [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}] [--config-format {auto,hf,mistral}] [--dtype {auto,half,float16,bfloat16,float,float32}]
                   [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}] [--quantization-param-path QUANTIZATION_PARAM_PATH] [--max-model-len MAX_MODEL_LEN] [--guided-decoding-backend {outlines,lm-format-enforcer,xgrammar}]
                   [--logits-processor-pattern LOGITS_PROCESSOR_PATTERN] [--distributed-executor-backend {ray,mp}] [--worker-use-ray] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE] [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                   [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS] [--ray-workers-use-nsight] [--block-size {8,16,32,64,128}] [--enable-prefix-caching | --no-enable-prefix-caching] [--disable-sliding-window]
                   [--use-v2-block-manager] [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED] [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                   [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE] [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS] [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
                   [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
                   [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA] [--hf-overrides HF_OVERRIDES] [--enforce-eager] [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE] [--disable-custom-all-reduce]
                   [--tokenizer-pool-size TOKENIZER_POOL_SIZE] [--tokenizer-pool-type TOKENIZER_POOL_TYPE] [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG] [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
                   [--mm-processor-kwargs MM_PROCESSOR_KWARGS] [--disable-mm-preprocessor-cache] [--enable-lora] [--enable-lora-bias] [--max-loras MAX_LORAS] [--max-lora-rank MAX_LORA_RANK]
                   [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE] [--lora-dtype {auto,float16,bfloat16}] [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS] [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
                   [--enable-prompt-adapter] [--max-prompt-adapters MAX_PROMPT_ADAPTERS] [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN] [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
                   [--num-scheduler-steps NUM_SCHEDULER_STEPS] [--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]] [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR] [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                   [--speculative-model SPECULATIVE_MODEL]
                   [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
                   [--num-speculative-tokens NUM_SPECULATIVE_TOKENS] [--speculative-disable-mqa-scorer] [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                   [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN] [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE] [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                   [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN] [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                   [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD] [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                   [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]] [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG] [--ignore-patterns IGNORE_PATTERNS] [--preemption-mode PREEMPTION_MODE]
                   [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]] [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH] [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
                   [--collect-detailed-traces COLLECT_DETAILED_TRACES] [--disable-async-output-proc] [--scheduling-policy {fcfs,priority}] [--override-neuron-config OVERRIDE_NEURON_CONFIG]
                   [--override-pooler-config OVERRIDE_POOLER_CONFIG] [--compilation-config COMPILATION_CONFIG] [--kv-transfer-config KV_TRANSFER_CONFIG] [--worker-cls WORKER_CLS] [--generation-config GENERATION_CONFIG]
                   [--disable-log-requests] [--max-log-len MAX_LOG_LEN] [--disable-fastapi-docs] [--enable-prompt-tokens-details] [--model-name MODEL_NAME] [--max-sequence-length MAX_SEQUENCE_LENGTH] [--max-new-tokens MAX_NEW_TOKENS]
                   [--max-batch-size MAX_BATCH_SIZE] [--max-concurrent-requests MAX_CONCURRENT_REQUESTS] [--dtype-str DTYPE_STR] [--quantize {awq,gptq,squeezellm,None}] [--num-gpus NUM_GPUS] [--num-shard NUM_SHARD]
                   [--output-special-tokens OUTPUT_SPECIAL_TOKENS] [--default-include-stop-seqs DEFAULT_INCLUDE_STOP_SEQS] [--grpc-port GRPC_PORT] [--tls-cert-path TLS_CERT_PATH] [--tls-key-path TLS_KEY_PATH]
                   [--tls-client-ca-cert-path TLS_CLIENT_CA_CERT_PATH] [--adapter-cache ADAPTER_CACHE] [--prefix-store-path PREFIX_STORE_PATH] [--speculator-name SPECULATOR_NAME] [--speculator-n-candidates SPECULATOR_N_CANDIDATES]
                   [--speculator-max-batch-size SPECULATOR_MAX_BATCH_SIZE] [--enable-vllm-log-requests ENABLE_VLLM_LOG_REQUESTS] [--disable-prompt-logprobs DISABLE_PROMPT_LOGPROBS]
__main__.py: error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file

$ rpm -qv podman
podman-5.3.1-1.fc41.x86_64

$ rpm -qv python3-ramalama
python3-ramalama-0.5.5-1.fc41.noarch

$ rpm -qv golang-github-nvidia-container-toolkit
golang-github-nvidia-container-toolkit-1.16.2-1.fc41.x86_64

$ nvidia-ctk cdi list
INFO[0000] Found 3 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=GPU-9282fe1f-02bd-d793-11a8-5341a0858e3b
nvidia.com/gpu=all

The text was updated successfully, but these errors were encountered:

rhatdan · 2025-02-07T20:12:59Z

Yes vllm can only do serve at this point.

rhatdan · 2025-02-07T20:13:12Z

Not even sure how well that works either.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using vllm runtime generates "unrecognized arguments" error #758

Using vllm runtime generates "unrecognized arguments" error #758

dwrobel commented Feb 7, 2025

rhatdan commented Feb 7, 2025

rhatdan commented Feb 7, 2025

Using vllm runtime generates "unrecognized arguments" error #758

Using vllm runtime generates "unrecognized arguments" error #758

Comments

dwrobel commented Feb 7, 2025

rhatdan commented Feb 7, 2025

rhatdan commented Feb 7, 2025