Skip to content

AI-Hypercomputer/inference-benchmark

Repository files navigation

Inference Benchmark

A model server agnostic inference benchmarking tool that can be used to benchmark LLMs running on differet infrastructure like GPU and TPU. It can also be run on a GKE cluster as a container.

Run the benchmark

  1. Create a python virtualenv.

  2. Install all the prerequisite packages.

pip install -r requirements.txt
  1. Set your huggingface token as an enviornment variable
export HF_TOKEN=<your-huggingface-token>
  1. Download the ShareGPT dataset.
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
  1. Run the benchmarking script directly with a specific request rate.
python3 benchmark_serving.py --save-json-results --host=$IP  --port=$PORT --dataset=$PROMPT_DATASET_FILE --tokenizer=$TOKENIZER --request-rate=$REQUEST_RATE --backend=$BACKEND --num-prompts=$NUM_PROMPTS --max-input-length=$INPUT_LENGTH --max-output-length=$OUTPUT_LENGTH --file-prefix=$FILE_PREFIX
  1. Generate a full latency profile which generates latency and throughput data at different request rates.
./latency_throughput_curve.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published