Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using vllm runtime generates "unrecognized arguments" error #758

Open
dwrobel opened this issue Feb 7, 2025 · 2 comments
Open

Using vllm runtime generates "unrecognized arguments" error #758

dwrobel opened this issue Feb 7, 2025 · 2 comments

Comments

@dwrobel
Copy link

dwrobel commented Feb 7, 2025

An attempt to use vllm runtime generates the following error:

error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file

Full log:

$ ramalama --debug --runtime vllm run llama3.2
exec_cmd:  podman run --rm -i --label RAMALAMA --security-opt=label=disable --name ramalama_PNB6UFIqIM --pull=newer -t --device /dev/dri --device nvidia.com/gpu=all -e CUDA_VISIBLE_DEVICES=0 --mount=type=bind,src=/home/dw/.local/share/ramalama/models/ollama/llama3.2:latest,destination=/mnt/models/model.file,ro quay.io/modh/vllm:rhoai-2.17-cuda llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
usage: __main__.py [-h] [--host HOST] [--port PORT] [--uvicorn-log-level {debug,info,warning,error,critical,trace}] [--allow-credentials] [--allowed-origins ALLOWED_ORIGINS] [--allowed-methods ALLOWED_METHODS]
                   [--allowed-headers ALLOWED_HEADERS] [--api-key API_KEY] [--lora-modules LORA_MODULES [LORA_MODULES ...]] [--prompt-adapters PROMPT_ADAPTERS [PROMPT_ADAPTERS ...]] [--chat-template CHAT_TEMPLATE]
                   [--chat-template-content-format {auto,string,openai}] [--response-role RESPONSE_ROLE] [--ssl-keyfile SSL_KEYFILE] [--ssl-certfile SSL_CERTFILE] [--ssl-ca-certs SSL_CA_CERTS] [--ssl-cert-reqs SSL_CERT_REQS]
                   [--root-path ROOT_PATH] [--middleware MIDDLEWARE] [--return-tokens-as-token-ids] [--disable-frontend-multiprocessing] [--enable-request-id-headers] [--enable-auto-tool-choice]
                   [--tool-call-parser {granite-20b-fc,granite,hermes,internlm,jamba,llama3_json,mistral,pythonic} or name registered in --tool-parser-plugin] [--tool-parser-plugin TOOL_PARSER_PLUGIN] [--model MODEL]
                   [--task {auto,generate,embedding,embed,classify,score,reward}] [--tokenizer TOKENIZER] [--skip-tokenizer-init] [--revision REVISION] [--code-revision CODE_REVISION] [--tokenizer-revision TOKENIZER_REVISION]
                   [--tokenizer-mode {auto,slow,mistral}] [--trust-remote-code] [--allowed-local-media-path ALLOWED_LOCAL_MEDIA_PATH] [--download-dir DOWNLOAD_DIR]
                   [--load-format {auto,pt,safetensors,npcache,dummy,tensorizer,sharded_state,gguf,bitsandbytes,mistral,runai_streamer}] [--config-format {auto,hf,mistral}] [--dtype {auto,half,float16,bfloat16,float,float32}]
                   [--kv-cache-dtype {auto,fp8,fp8_e5m2,fp8_e4m3}] [--quantization-param-path QUANTIZATION_PARAM_PATH] [--max-model-len MAX_MODEL_LEN] [--guided-decoding-backend {outlines,lm-format-enforcer,xgrammar}]
                   [--logits-processor-pattern LOGITS_PROCESSOR_PATTERN] [--distributed-executor-backend {ray,mp}] [--worker-use-ray] [--pipeline-parallel-size PIPELINE_PARALLEL_SIZE] [--tensor-parallel-size TENSOR_PARALLEL_SIZE]
                   [--max-parallel-loading-workers MAX_PARALLEL_LOADING_WORKERS] [--ray-workers-use-nsight] [--block-size {8,16,32,64,128}] [--enable-prefix-caching | --no-enable-prefix-caching] [--disable-sliding-window]
                   [--use-v2-block-manager] [--num-lookahead-slots NUM_LOOKAHEAD_SLOTS] [--seed SEED] [--swap-space SWAP_SPACE] [--cpu-offload-gb CPU_OFFLOAD_GB] [--gpu-memory-utilization GPU_MEMORY_UTILIZATION]
                   [--num-gpu-blocks-override NUM_GPU_BLOCKS_OVERRIDE] [--max-num-batched-tokens MAX_NUM_BATCHED_TOKENS] [--max-num-seqs MAX_NUM_SEQS] [--max-logprobs MAX_LOGPROBS] [--disable-log-stats]
                   [--quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
                   [--rope-scaling ROPE_SCALING] [--rope-theta ROPE_THETA] [--hf-overrides HF_OVERRIDES] [--enforce-eager] [--max-seq-len-to-capture MAX_SEQ_LEN_TO_CAPTURE] [--disable-custom-all-reduce]
                   [--tokenizer-pool-size TOKENIZER_POOL_SIZE] [--tokenizer-pool-type TOKENIZER_POOL_TYPE] [--tokenizer-pool-extra-config TOKENIZER_POOL_EXTRA_CONFIG] [--limit-mm-per-prompt LIMIT_MM_PER_PROMPT]
                   [--mm-processor-kwargs MM_PROCESSOR_KWARGS] [--disable-mm-preprocessor-cache] [--enable-lora] [--enable-lora-bias] [--max-loras MAX_LORAS] [--max-lora-rank MAX_LORA_RANK]
                   [--lora-extra-vocab-size LORA_EXTRA_VOCAB_SIZE] [--lora-dtype {auto,float16,bfloat16}] [--long-lora-scaling-factors LONG_LORA_SCALING_FACTORS] [--max-cpu-loras MAX_CPU_LORAS] [--fully-sharded-loras]
                   [--enable-prompt-adapter] [--max-prompt-adapters MAX_PROMPT_ADAPTERS] [--max-prompt-adapter-token MAX_PROMPT_ADAPTER_TOKEN] [--device {auto,cuda,neuron,cpu,openvino,tpu,xpu,hpu}]
                   [--num-scheduler-steps NUM_SCHEDULER_STEPS] [--multi-step-stream-outputs [MULTI_STEP_STREAM_OUTPUTS]] [--scheduler-delay-factor SCHEDULER_DELAY_FACTOR] [--enable-chunked-prefill [ENABLE_CHUNKED_PREFILL]]
                   [--speculative-model SPECULATIVE_MODEL]
                   [--speculative-model-quantization {aqlm,awq,deepspeedfp,tpu_int8,fp8,fbgemm_fp8,modelopt,marlin,gguf,gptq_marlin_24,gptq_marlin,awq_marlin,gptq,compressed-tensors,bitsandbytes,qqq,hqq,experts_int8,neuron_quant,ipex,None}]
                   [--num-speculative-tokens NUM_SPECULATIVE_TOKENS] [--speculative-disable-mqa-scorer] [--speculative-draft-tensor-parallel-size SPECULATIVE_DRAFT_TENSOR_PARALLEL_SIZE]
                   [--speculative-max-model-len SPECULATIVE_MAX_MODEL_LEN] [--speculative-disable-by-batch-size SPECULATIVE_DISABLE_BY_BATCH_SIZE] [--ngram-prompt-lookup-max NGRAM_PROMPT_LOOKUP_MAX]
                   [--ngram-prompt-lookup-min NGRAM_PROMPT_LOOKUP_MIN] [--spec-decoding-acceptance-method {rejection_sampler,typical_acceptance_sampler}]
                   [--typical-acceptance-sampler-posterior-threshold TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_THRESHOLD] [--typical-acceptance-sampler-posterior-alpha TYPICAL_ACCEPTANCE_SAMPLER_POSTERIOR_ALPHA]
                   [--disable-logprobs-during-spec-decoding [DISABLE_LOGPROBS_DURING_SPEC_DECODING]] [--model-loader-extra-config MODEL_LOADER_EXTRA_CONFIG] [--ignore-patterns IGNORE_PATTERNS] [--preemption-mode PREEMPTION_MODE]
                   [--served-model-name SERVED_MODEL_NAME [SERVED_MODEL_NAME ...]] [--qlora-adapter-name-or-path QLORA_ADAPTER_NAME_OR_PATH] [--otlp-traces-endpoint OTLP_TRACES_ENDPOINT]
                   [--collect-detailed-traces COLLECT_DETAILED_TRACES] [--disable-async-output-proc] [--scheduling-policy {fcfs,priority}] [--override-neuron-config OVERRIDE_NEURON_CONFIG]
                   [--override-pooler-config OVERRIDE_POOLER_CONFIG] [--compilation-config COMPILATION_CONFIG] [--kv-transfer-config KV_TRANSFER_CONFIG] [--worker-cls WORKER_CLS] [--generation-config GENERATION_CONFIG]
                   [--disable-log-requests] [--max-log-len MAX_LOG_LEN] [--disable-fastapi-docs] [--enable-prompt-tokens-details] [--model-name MODEL_NAME] [--max-sequence-length MAX_SEQUENCE_LENGTH] [--max-new-tokens MAX_NEW_TOKENS]
                   [--max-batch-size MAX_BATCH_SIZE] [--max-concurrent-requests MAX_CONCURRENT_REQUESTS] [--dtype-str DTYPE_STR] [--quantize {awq,gptq,squeezellm,None}] [--num-gpus NUM_GPUS] [--num-shard NUM_SHARD]
                   [--output-special-tokens OUTPUT_SPECIAL_TOKENS] [--default-include-stop-seqs DEFAULT_INCLUDE_STOP_SEQS] [--grpc-port GRPC_PORT] [--tls-cert-path TLS_CERT_PATH] [--tls-key-path TLS_KEY_PATH]
                   [--tls-client-ca-cert-path TLS_CLIENT_CA_CERT_PATH] [--adapter-cache ADAPTER_CACHE] [--prefix-store-path PREFIX_STORE_PATH] [--speculator-name SPECULATOR_NAME] [--speculator-n-candidates SPECULATOR_N_CANDIDATES]
                   [--speculator-max-batch-size SPECULATOR_MAX_BATCH_SIZE] [--enable-vllm-log-requests ENABLE_VLLM_LOG_REQUESTS] [--disable-prompt-logprobs DISABLE_PROMPT_LOGPROBS]
__main__.py: error: unrecognized arguments: llama-run -c 2048 --temp 0.8 -v --ngl 999 /mnt/models/model.file
$ rpm -qv podman
podman-5.3.1-1.fc41.x86_64
$ rpm -qv python3-ramalama
python3-ramalama-0.5.5-1.fc41.noarch
$ rpm -qv golang-github-nvidia-container-toolkit
golang-github-nvidia-container-toolkit-1.16.2-1.fc41.x86_64
$ nvidia-ctk cdi list
INFO[0000] Found 3 CDI devices                          
nvidia.com/gpu=0
nvidia.com/gpu=GPU-9282fe1f-02bd-d793-11a8-5341a0858e3b
nvidia.com/gpu=all
@rhatdan
Copy link
Member

rhatdan commented Feb 7, 2025

Yes vllm can only do serve at this point.

@rhatdan
Copy link
Member

rhatdan commented Feb 7, 2025

Not even sure how well that works either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants