We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hi, For the below model and configuration gets ERROR during warmup. Also tested smaller BATCH_BUCKET_SIZE but also get error.
This same model works fine when the MAX_TOTAL_TOKENS and max_input_length are selected from power of 2 numbers like the ones used in the repo README.
Model = https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
services: tgi-gaudi-service: image: ghcr.io/huggingface/tgi-gaudi:2.0.6 container_name: tgi-gaudi-server ports: - "6005:80" volumes: - "PATH_TO_YOUR_LOCAL_MODEL_CACHE/hub:/data" environment: no_proxy: ${no_proxy} NO_PROXY: ${no_proxy} http_proxy: ${http_proxy} https_proxy: ${https_proxy} HF_TOKEN: ${HF_TOKEN} HF_HUB_DISABLE_PROGRESS_BARS: 1 HF_HUB_ENABLE_HF_TRANSFER: 0 HABANA_VISIBLE_DEVICES: all OMPI_MCA_btl_vader_single_copy_mechanism: none ENABLE_HPU_GRAPH: true LIMIT_HPU_GRAPH: true USE_FLASH_ATTENTION: true FLASH_ATTENTION_RECOMPUTE: true PT_HPU_ENABLE_LAZY_COLLECTIVES: true TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN: false MAX_TOTAL_TOKENS: 10000 BATCH_BUCKET_SIZE: 32 PREFILL_BATCH_BUCKET_SIZE: 2 PAD_SEQUENCE_TO_MULTIPLE_OF: 64 runtime: habana cap_add: - SYS_NICE ipc: host command: > --model-id ${LLM_MODEL_ID} --sharded true --num-shard 8 --max-input-length 8488 --max-total-tokens 10000 --max-batch-prefill-tokens 16976 --max-batch-total-tokens 320000 --max-waiting-tokens 7 --waiting-served-ratio 1.2 --max-concurrent-requests 512 networks: default: driver: bridge
cc: @yu
Correct launch of the service after warm up completion
The text was updated successfully, but these errors were encountered:
No branches or pull requests
System Info
Hi,
For the below model and configuration gets ERROR during warmup. Also tested smaller BATCH_BUCKET_SIZE but also get error.
This same model works fine when the MAX_TOTAL_TOKENS and max_input_length are selected from power of 2 numbers like the ones used in the repo README.
Model = https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct
cc: @yu
Information
Tasks
Reproduction
It fails during warm up at initialization
Expected behavior
Correct launch of the service after warm up completion
The text was updated successfully, but these errors were encountered: