Skip to content

Latest commit

 

History

History
104 lines (73 loc) · 4.58 KB

README.md

File metadata and controls

104 lines (73 loc) · 4.58 KB

⚠️ Notice: Limited Maintenance

This project is no longer actively maintained. While existing releases remain available, there are no planned updates, bug fixes, new features, or security patches. Users should be aware that vulnerabilities may not be addressed.

TorchServe inference with torch.compile of densenet161 model

This example shows how to take eager model of densenet161, configure TorchServe to use torch.compile and run inference using torch.compile

Pre-requisites

  • PyTorch >= 2.0

Change directory to the examples directory Ex: cd examples/pt2/torch_compile

torch.compile config

torch.compile supports a variety of config and the performance you get can vary based on the config. You can find the various options here

In this example , we use the following config

echo "pt2:
  compile:
    enable: True" > model-config.yaml

Create model archive

wget https://download.pytorch.org/models/densenet161-8d451a50.pth
mkdir model_store
torch-model-archiver --model-name densenet161 --version 1.0 --model-file model.py --serialized-file densenet161-8d451a50.pth --export-path model_store --extra-files ../../image_classifier/index_to_name.json --handler image_classifier --config-file model-config.yaml -f

Start TorchServe

torchserve --start --ncs --model-store model_store --models densenet161.mar --disable-token-auth  --enable-model-api

Run Inference

curl http://127.0.0.1:8080/predictions/densenet161 -T ../../image_classifier/kitten.jpg

produces the output

{
  "tabby": 0.4664836823940277,
  "tiger_cat": 0.4645617604255676,
  "Egyptian_cat": 0.06619937717914581,
  "lynx": 0.0012969186063855886,
  "plastic_bag": 0.00022856894065625966
}

Performance improvement from using torch.compile

To measure the handler preprocess, inference, postprocess times, run the following

Measure inference time with PyTorch eager

echo "handler:" > model-config.yaml && \
echo "  profile: true" >> model-config.yaml

Once the yaml file is updated, create the model-archive, start TorchServe and run inference using the steps shown above. After a few iterations of warmup, we see the following

2024-02-03T00:54:31,136 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_preprocess.Milliseconds:6.118656158447266|#ModelName:densenet161,Level:Model|#type:GAUGE|#hostname:ip-172-31-11-40,1706921671,c02b3170-c8fc-4396-857d-6c6266bf94a9, pattern=[METRICS]
2024-02-03T00:54:31,155 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_inference.Milliseconds:18.77564811706543|#ModelName:densenet161,Level:Model|#type:GAUGE|#hostname:ip-172-31-11-40,1706921671,c02b3170-c8fc-4396-857d-6c6266bf94a9, pattern=[METRICS]
2024-02-03T00:54:31,155 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_postprocess.Milliseconds:0.16630400717258453|#ModelName:densenet161,Level:Model|#type:GAUGE|#hostname:ip-172-31-11-40,1706921671,c02b3170-c8fc-4396-857d-6c6266bf94a9, pattern=[METRICS]

Measure inference time with torch.compile

echo "pt2:
  compile:
    enable: True
    backend: inductor
    mode: reduce-overhead" > model-config.yaml && \
echo "handler:
  profile: true" >> model-config.yaml

Once the yaml file is updated, create the model-archive, start TorchServe and run inference using the steps shown above. torch.compile needs a few inferences to warmup. Once warmed up, we see the following

2024-02-03T00:56:14,808 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_preprocess.Milliseconds:5.9771199226379395|#ModelName:densenet161,Level:Model|#type:GAUGE|#hostname:ip-172-31-11-40,1706921774,d38601be-6312-46b4-b455-0322150509e5, pattern=[METRICS]
2024-02-03T00:56:14,814 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_inference.Milliseconds:5.8818559646606445|#ModelName:densenet161,Level:Model|#type:GAUGE|#hostname:ip-172-31-11-40,1706921774,d38601be-6312-46b4-b455-0322150509e5, pattern=[METRICS]
2024-02-03T00:56:14,814 [INFO ] W-9000-densenet161_1.0-stdout org.pytorch.serve.wlm.WorkerLifeCycle - result=[METRICS]ts_handler_postprocess.Milliseconds:0.19392000138759613|#ModelName:densenet161,Level:Model|#type:GAUGE|#hostname:ip-172-31-11-40,1706921774,d38601be-6312-46b4-b455-0322150509e5, pattern=[METRICS]

Conclusion

torch.compile reduces the inference time from 18ms to 5ms