Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fastsafetensors #298

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Fastsafetensors #298

wants to merge 3 commits into from

Conversation

fialhocoelho
Copy link

This PR adds support for using fastsafetensors.

Implementation considerations

  • Dockerfile.ubi:

    • The numactl-devel package is required for installing fastsafetensors. However, it is not available in the repository (only numactl-libs is). To address this, we compile numa (numactl, numactl-devel, and numactl-libs) and install it in the python-cuda-base layer immediately before the installation of the packages in requirements-cuda.txt (where fastsafetensors is listed for installation)

    • In the vllm-openai layer, we add the LD_LIBRARY_PATH environment variable to include the path for the nvidia-cufile library, which is required for using fastsafetensors. Normally, cufile is installed alongside cuda-toolkit (version >12) or nvidia-GDS. Even though we are not using GDS, the binaries compiled by fastsafetensors require the cufile library to be loaded

    • In the final layer (vllm-grpc-adapter), we install numactl and numactl-libs, which were compiled in the python-cuda-base layer, to ensure the fastsafetensors library works properly with the same version used in the compilation step

    • Lastly, we install the nvidia-cufile package, which was already linked in the previous step/layer. While it may not be "aesthetically pleasing" to install the library alongside the RUN uv pip install command that installs the adapter, I also don't find it very functional to create a requirements-fastsafetensors.txt file or something similar, considering this is the only library besides the numa dependencies

  • vllm/model_executor/model_loader/weight_utils.py:

    • The SafeTensorsFileLoader() function initializes fastsafetensors and includes the nogds parameter, which defaults to False. According to the docs, when set to True, it disables support for nvidia-gds. Since we do not use nvidia-gds, I set it to True.
    • The documentation also mentions that if nogds is False and gds is unavailable, the loader will fail. However, during testing, the function worked with nogds set to both True and False. It might be necessary to conduct more tests/benchmarks to evaluate the functionality and performance impact of using this library.

Copy link

openshift-ci bot commented Jan 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fialhocoelho
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link

openshift-ci bot commented Jan 20, 2025

@fialhocoelho: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/smoke-test a9c6aa0 link true /test smoke-test
ci/prow/images a9c6aa0 link true /test images
ci/prow/rocm-pr-image-mirror a9c6aa0 link true /test rocm-pr-image-mirror

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -307,7 +308,15 @@ def _get_weights_iterator(
hf_weights_files,
)
elif use_safetensors:
weights_iterator = safetensors_weights_iterator(hf_weights_files)
use_fastsafe_tensor = os.getenv('USE_FASTSAFETENSOR',

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My suggestion here would be to enable loader extra config for the DefaultLoader: https://github.com/vllm-project/vllm/blob/e784c6b9984e8f8116f74000b863d941495acb0b/vllm/model_executor/model_loader/loader.py#L190

So that you can enable fastsafetensors and GDS with
--model-loader-extra-config='{"use_fastsafetensors": true, "enable_gds: true"}'

RUN --mount=type=cache,target=/root/.cache/pip \
--mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,from=build,src=/workspace/dist,target=/workspace/dist \
HOME=/root uv pip install "$(echo /workspace/dist/*.whl)[tensorizer]" vllm-tgis-adapter==0.6.0
HOME=/root uv pip install "$(echo /workspace/dist/*.whl)[tensorizer]" vllm-tgis-adapter==0.6.0 nvidia-cufile-cu12
Copy link

@dtrifiro dtrifiro Jan 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the fastsafetensors repo's pyproject.toml, nvidia-cufile-cu12 is not listed as a dependency. Should we open a PR there to add this as a dep instead of manually adding it here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cufile lib isn’t listed as a direct installation requirement, but it’s necessary because when importing the main object, it relies on a piece of code (fastcpp) that depends on cufile. The key point is that cufile is required for using the library with GDS support, which isn’t our use case. However, due to how the lib is structured, we still need this dependency even if we’re not using GDS. According to the docs, setting nogds=True disables GDS support, but the cufile requirement is triggered immediately upon import. I believe the best course of action is to open an issue in the repo to address this condition (I can take care of that).

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry @fialhocoelho, I missed this reply.

Thanks for the explanation. Are we sure we're not interested in GDS though? I don't know much about it, but it seems it's one of the performance improvements that fastsafetensors brings

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtrifiro I don’t have much experience with how GDS would be installed in an OpenShift env, since it operates at a low level. I’ve only used it in a DGX environment. I can look into it a bit.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at https://docs.nvidia.com/gpudirect-storage/overview-guide/index.html#software-architecture it doesn't seem like there's much needed apart from cufile and the nvidia driver

===================================================================

Using fastsafetensor library enables loading model weights to GPU memory by leveraging GPU direct storage. See https://github.com/foundation-model-stack/fastsafetensors for more details.
For enabling this feature, set the environment variable ``USE_FASTSAFETENSOR`` to ``true``

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still relevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants