[Feature] Tensor parallelism fine-tuning #931

MrPanch · 2025-02-26T16:34:30Z

Motivation

Deepspeed provide out of box tensor parallelism. However, when I modify config, for example, internvl_chat/zero_stage3_config.json adding "model_parallelism" parameters to fine 26B model:

"model_parallel": {
"enabled": true,
"dp_world_size": 6,
"tensor_parallel_size": 6,
"pipeline_parallel_size": 1,
"cpu_offload": true
},

the utilization of gpus look like this:

Based on your documentation, to serve single 26B model on 4 gpus it requires 30Gb memory per gpu, and 25806 memory per gpu for 8, so interpolating, I expect to see 28Gb memory utilization, but I get out of memory error

So, even though on page it look like I can fit 26B and fine tune it with batchsize 1 and accumulated batchsize, for example 8, I face this porblem.

I am not sure, that I did everything correct. If my addition to config is wrong, please, let me know. If not: What do you think about adding tensor level parallelism.
Thank you in advance

Related resources

No response

Additional context

No response

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Tensor parallelism fine-tuning #931

[Feature] Tensor parallelism fine-tuning #931

MrPanch commented Feb 26, 2025

[Feature] Tensor parallelism fine-tuning #931

[Feature] Tensor parallelism fine-tuning #931

Comments

MrPanch commented Feb 26, 2025

Motivation

Related resources

Additional context