Utilize a parallel gzip implementation #25
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What is this?
This PR would allow parallel compression and improved decompression for slugs. I used a few cloned git repos from HashiCorp to test the performance difference for varying sizes of Git repositories. Mostly due to the fact that TFE utilizes this code and VCS repos are typically larger than 1MB.
This is based on some older worker here (https://github.com/hashicorp/go-service/pull/29), which should help with VCS ingress speeds, as it's still utilized by the slug ingress container, as per (unless I'm looking at old code): https://github.com/hashicorp/slug-ingress/blob/main/worker.go#L170-L180 (and the
import
therein). If these patches as well as #21 (great work!) were to be merged, significant speedups might be had in TFE/TFC/Agents.This could considerably reduce the time taken for source code ingress on TFE, especially for customers with large monorepos.
The library that provides the parallel gzip drop-in replacement is https://github.com/klauspost/pgzip.
As you can see from the results below, utilizing parallel gzip to compress these slugs can reduce the time it takes by up to 10x in some cases.
Decompression does have a speed-up as well, even though gzip decompression is single-threaded. Here's an except from the library's README explaining why it claims 104% decompression speedup over golang's implementation:
Testing environment
HashiCorp-provided Lenovo X1 Carbon on Arch linux. 4c/8t i7-8665U.
File sizes via
du -sh
, I just pulled a few random repos from our GitHub plus a large Android source repository (platform_frameworks_base
):, just to show the speedups in large workloads:Compression code
Decompression code
Compression Results
Atlas
Consul
go-service
is-immutable-aws-vault-consul
Here we see a great example of the <1MB of data being slightly slower. On average, about 1ms, or (1-(2.81/3.87)) ~27% slower.
platform_frameworks_base
Here we see a great example of the speedups capable with this patch. This is an extreme, 11-gigabyte example, but it serves to show the scalability of this method.
Decompression Results
Decompression is a slight improvement on average, not much to talk about.
Atlas
Consul
go-service
Here we see the decompression speed gap close for such a small amount of data.
is-immutable-aws-vault-consul
Here we see decompression is much slower on
pgzip
for such a small amount of data. This is likely due to the overhead of creating threads.platform_frameworks_base