-
Notifications
You must be signed in to change notification settings - Fork 609
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scanning a project with many DLLs is slow #3455
Comments
An initial glance shows that what's taking time is the
Now the time to read that many DLLs make sense, however, looking at the raw syft results I think a few things need to be verified:
I see a lot of packages (46,897!) and a lot of them have multiple duplicates (some in the hundreds!). We try to keep distinct project dependency graphs, so it might be possible that this is correct, but I think it would be worth double checking this result too. Back to the topic at hand: performance... I don't see how we can scan that many DLLs in only seconds, so that doesn't seem like the right answer here. The
And when we ignore DLLs (thus look purely at the deps.json) we see much better performance:
As a workaround @KeylinxTobias can you try out running syft with Side note: I wonder if we should have a |
There is a request for a csproj cataloger: #1522 |
is this issue happens on windows OS? |
@TimBrown1611 I don't believe this is a windows specific issue, no. GIven @wagoodman ran |
hi @popey ! |
For discussion: This sounds like a user is scanning a source directory with build artifacts and we don't want to include the build artifacts in the scan since we already have the build description files providing the information about what packages we should report. This is similar to a Maven project, where you have a This could also be related to how we use different catalogers if we're scanning image sources vs directories -- this may be somehow per-directory tree differences. |
I have observed this behavior also on an offline build server. It could be traced back to repeated attempts to download certificate revocation lists for code signing certificates. This of course happens only when there are already signed binaries in the scanned stuff. |
Hey @rsphilk -- your last comment mentions that you traced down slowness to downloading certificate revocation lists. Are you saying that Syft is doing this when you scan something with it? There is also a PR that improves performance when scanning DLLs, with initial testing showing that it may be twice as fast with significantly less memory allocation: #3563, this may help the situation for you, too. |
Hi @kzantow
Yes, that is what I observed. |
That's a really interesting observation! Did you happen to have an example repo or other artifact we could use to reproduce this behavior on an offline Windows VM/machine? |
do you have any eta when the PR will be merged? we can't upgrade syft due to the performance :( |
The PR limits allocations, but not total unreclaimable memory, as far as I understand it. I don't think it is going to solve the other issues that you have raised; I'm not sure how upgrading would be blocked on that PR. |
I am using syft 1.16.0, I can't point which comment \ change was made, but scanning windows machine became much longer (because of the mentioned cataloger). |
@TimBrown1611 -- if I look at the commits from 1.16.0 to 1.19.0, I am not seeing anything that directly affected the Dotnet Portable Executable cataloger, which is the one scanning DLLs. There was a |
hi! I don't have an image, but I've scanned AMI |
@rsphilk / @TimBrown1611 is it possible the network connections and delays are due to Microsoft Windows SmartScreen? I am not a Windows expert, but I understand it has the capability to fingerprint binaries (programs and libraries), then submit with an online Microsoft database to check them for known issues. If you're scanning an entire filesystem, I imagine this will cause Syft to open every binary, and potentially trigger SmartScreen. Is that something worth considering? |
hi @popey I do see big gaps between the performances of the different syft's |
Hi, I think I'm facing the same issue, I'll try to report as much details as I can: We're running syft 1.19.0 on windows with no internet connection to the outside. We're running scans against several projects with varying number of dlls. Just to give you and idea: a project with around 100 dlls takes 30-50 min to be analyzed. This seems to be happening only on windows. On linux the same project takes about 10-20 sec, which is acceptable. Enabling TRACE logs, it looks like the
|
Update: I think I might be up to something, I investigated the error As stated in the comment:
And that might explain why it gets stuck on machines without an internet connection. Windows might be trying to reach the external internet and then times out. I've patched syft to disable cert validation, just to verify my assumptions, and now the cataloger is very fast indeed. On my side, I'll try to figure out what urls need to be accessed by windows to download these certificates so that I can let them through the proxy/firewall, but still it would be nice to have the option to configure syft to ignore cert validation for the cataloger. |
That is a great find @rogueai -- I think in this context, disabling the cert validation is a fine thing to do -- we're not trying to validate these files are signed/valid, we're just trying to identify them. MAYBE some user would like to also validate them, though -- a PR to make this an option, disabled by default would be 👌, if you happen to have any time |
No problem, I'm not terribly versed in go, but nevertheless I'm going to take a shot at this and see what I can come up with. |
This change could be slightly unclear if you're not familiar with how syft configuration works; if you wanted to start a PR with the change you referenced, I could help getting the configuration passed through |
@kzantow I've created a draft PR, in the meanwhile I'll familiarize with the code base, cheers! |
@rogueai thanks much! it's a little bit of a pain to get the configuration plumbed through when there isn't an existing configuration object, easier when there is one... but I've added this to your PR, could you validate it behaves as you expect? and if you feel like adding any sort of test, that would be great (not entirely sure how to do this, since it's testing internal behavior of the library you mentioned); I think if you validate this, we could probably get it merged as-is (I'll get someone on the team to give a second set of eyes to my change) |
Thanks @kzantow , I validated the changes against the problematic machine and it looks like it's working as intended:
Thanks for helping with this and for getting this done so quickly! |
On the testing side, I'm not sure how to add any meaningful test either to be honest. |
@rogueai no worries, I think we can move this one forward as-is; thank you for verifying it! |
We are having our build environment disconnected from Internet for security reasons and want to be able to produce an non enriched SBOM using Syft. However when running Syft, the scanning is very slow. Using the verbose logging and we can track that it is some dll's that take a lot of time. We have checked the configuration options and tried to disable as much as possible but it still does not go well. It seems that when scanning each file, Syft tries to go out on Internet and check for something and then times out? A common solution for us can sometimes take about an hour to produce the SBOM. When using other scanners like Trivy, we are down to seconds.
We are mainly using nuget packages in our solution and are using windows servers for our build environment.
How to reproduce:
Thanks in advance.
The text was updated successfully, but these errors were encountered: