Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

research: OCI artifacts #1209

Open
VoVAllen opened this issue Nov 17, 2022 · 3 comments
Open

research: OCI artifacts #1209

VoVAllen opened this issue Nov 17, 2022 · 3 comments

Comments

@VoVAllen
Copy link
Member

Description

Using oci artifact standard to store the artifacts/models when developing ML models


Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

@cutecutecat
Copy link
Member

cutecutecat commented Feb 9, 2023

Summary

ormb is a product which used to packet a model into an OCI artifact. However, it will become so slow when model size up to such as 20GB and more.

In this research, we will investigate the cons of ormb, why it is so slow for large models, and how to make better OCI storage.

Review of S3

S3 upload endpoint uses a ContentMd5 to validate files.

Content-MD5

The base64-encoded 128-bit MD5 digest of the message (without the headers) according to RFC 1864. This header can be used as a message integrity check to verify that the data is the same data that was originally sent. Although it is optional, we recommend using the Content-MD5 mechanism as an end-to-end integrity check. For more information about REST request authentication, see REST Authentication.

graph LR
1[calculate md5 local] --> 2[upload file and md5]
2[upload file and md5] --> 3[calculate md5 remote]
3[calculate md5 remote] --> 4[validate]
Loading
graph LR
1[upload file without md5] --> 2[not validate]
Loading

Improve ormb tool

1 - no compression

ormb makes a gzip compression to model, which consume 70%-75% time of the whole procedure.

mediaType string

This descriptor property has additional restrictions for . Implementations MUST support at least the following media types:layers[]

Manifests concerned with portability SHOULD use one of the above media types. An encountered that is unknown to the implementation MUST be ignored.mediaType

Entries in this field will frequently use the types.+gzip

OCI MediaType support unzipped layer without gzip, this compression could be removed totally. Whether we could get rid of tar is not clear.

2 - hash once

sha256 calculation cost 20% time of the whole procedure.

graph LR
1[calculate sha256 local at ormb] --> 2[calling oras with sha256]
2[calling oras with sha256] --> 3[calculate sha256 local at oras]
3[calculate sha256 local at oras] --> 4[validate]
4[validate] --> 5[Commit]
Loading

ormb calculates sha256 hash twice locally. As the local copy is almost error-free, this might be unnecessary.

oras could be called with no argument Digest, the procedure will become that.

graph LR
1[calling oras without sha256] --> 2[calculate sha256 local at oras]
2[calculate sha256 local at oras] --> 3[Commit]
Loading

Thus, time cost of sha256 can be halved.

3 - new hash algorithm

From Speed Hashing, we could see MD5 is much faster than SHA-256:

MD5 23070.7 M/s
SHA-1 7973.8 M/s
SHA-256 3110.2 M/s
SHA-512 267.1 M/s
NTLM 44035.3 M/s
DES 185.1 M/s
WPA/WPA2 348.0 k/s

The OCI image-spec pointed out that an image could use any unregistered algorithm for digestion, an unrecognized digested will pass validation. However, in open source registries django-oci and distribution/distribution(the core library for many registry operators including Docker Hub, GitHub Container Registry, GitLab Container Registry and DigitalOcean Container Registry), they would reject any unsupported algorithm. For this reason, we could not pick a faster algorithm, like xxHash.

Though opencontainer group proposed a new hash algorithm blake3 to speed up the hash procedure at multi-cpu machines, it's still considered as an alternate algorithm, and unsupported in the above registries till now.
related issue: opencontainers/go-digest#66

Conclusion

In the above discussions, we concluded that most of the time consumption of OCI upload is from calculating sha256, while S3 uses contentMd5 to validate uploaded files. Moreover, md5 is optional to S3, so user could trade off their requirement for speed versus correctness at their upload.

Though sha256 is much slower than md5, we could not get rid of it with a new algorithm like xxHash as OCI spec is not fully supported by registries. The official solution blake3 has not yet been supported by them either.

It is impossible to accelerate OCI to speed up to S3 level before any progress of opencontainer organization.

@kemingy
Copy link
Member

kemingy commented Feb 9, 2023

It requires a cryptographic hash algorithm, thus you cannot use something like md5 or xxHash. I guess we need to wait for the black3.

@aseaday
Copy link
Member

aseaday commented Feb 9, 2023

in the last two months, I am developing a LLM model. If you guys have any questions about LLM over 200GB, I am willing to give you feedbcks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Status: No status
Development

No branches or pull requests

4 participants