Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warmup error when MAX_TOTAL_TOKENS and max_input_length are not power of 2 numbers #256

Open
2 of 4 tasks
rbrugaro opened this issue Dec 17, 2024 · 0 comments
Open
2 of 4 tasks

Comments

@rbrugaro
Copy link

System Info

Hi,
For the below model and configuration gets ERROR during warmup. Also tested smaller BATCH_BUCKET_SIZE but also get error.

This same model works fine when the MAX_TOTAL_TOKENS and max_input_length are selected from power of 2 numbers like the ones used in the repo README.

Model = https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct

services:
  tgi-gaudi-service:
    image: ghcr.io/huggingface/tgi-gaudi:2.0.6
    container_name: tgi-gaudi-server
    ports:
      - "6005:80"
    volumes:
      - "PATH_TO_YOUR_LOCAL_MODEL_CACHE/hub:/data"
    environment:
      no_proxy: ${no_proxy}
      NO_PROXY: ${no_proxy}
      http_proxy: ${http_proxy}
      https_proxy: ${https_proxy}
      HF_TOKEN: ${HF_TOKEN}
      HF_HUB_DISABLE_PROGRESS_BARS: 1
      HF_HUB_ENABLE_HF_TRANSFER: 0
      HABANA_VISIBLE_DEVICES: all
      OMPI_MCA_btl_vader_single_copy_mechanism: none
      ENABLE_HPU_GRAPH: true
      LIMIT_HPU_GRAPH: true
      USE_FLASH_ATTENTION: true
      FLASH_ATTENTION_RECOMPUTE: true
      PT_HPU_ENABLE_LAZY_COLLECTIVES: true
      TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN: false
      MAX_TOTAL_TOKENS: 10000
      BATCH_BUCKET_SIZE: 32
      PREFILL_BATCH_BUCKET_SIZE: 2
      PAD_SEQUENCE_TO_MULTIPLE_OF: 64
    runtime: habana
    cap_add:
      - SYS_NICE
    ipc: host
    command: >
      --model-id ${LLM_MODEL_ID} --sharded true --num-shard 8
      --max-input-length 8488 --max-total-tokens 10000
      --max-batch-prefill-tokens 16976 --max-batch-total-tokens 320000
      --max-waiting-tokens 7 --waiting-served-ratio 1.2
      --max-concurrent-requests 512
networks:
  default:
    driver: bridge
image (6) Screenshot 2024-12-15 at 11 23 50 PM

cc: @yu

Information

  • Docker
  • The CLI directly

Tasks

  • An officially supported command
  • My own modifications

Reproduction

  1. docker compose -f compose.yaml up -d
    It fails during warm up at initialization

Expected behavior

Correct launch of the service after warm up completion

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
@rbrugaro and others