Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scanning a project with many DLLs is slow #3455

Closed
KeylinxTobias opened this issue Nov 18, 2024 · 27 comments · Fixed by #3677
Closed

Scanning a project with many DLLs is slow #3455

KeylinxTobias opened this issue Nov 18, 2024 · 27 comments · Fixed by #3677
Assignees
Labels
bug Something isn't working good-first-issue Good for newcomers

Comments

@KeylinxTobias
Copy link

We are having our build environment disconnected from Internet for security reasons and want to be able to produce an non enriched SBOM using Syft. However when running Syft, the scanning is very slow. Using the verbose logging and we can track that it is some dll's that take a lot of time. We have checked the configuration options and tried to disable as much as possible but it still does not go well. It seems that when scanning each file, Syft tries to go out on Internet and check for something and then times out? A common solution for us can sometimes take about an hour to produce the SBOM. When using other scanners like Trivy, we are down to seconds.

We are mainly using nuget packages in our solution and are using windows servers for our build environment.

How to reproduce:

  1. Download syft.exe and files, copy it to an offline machine
  2. Run a scan on a dotnet solution

Thanks in advance.

@KeylinxTobias KeylinxTobias added the bug Something isn't working label Nov 18, 2024
@wagoodman
Copy link
Contributor

An initial glance shows that what's taking time is the dotnet-portable-executable-cataloger. The dotnet-deps-cataloger is clocking in rather fast (only seconds for large projects such as dotnet/sdk and dotnet/roslyn. On the other hand, taking a look at specifically the SDK project after running ./build.sh we get a lot of DLLs:

find . | grep dll | wc -l                       
   36335

Now the time to read that many DLLs make sense, however, looking at the raw syft results I think a few things need to be verified:

❯ time syft .     
 ✔ Indexed file system                                                                                                                                                                                    .
 ✔ Cataloged contents                                                                                                                      cdb4ee2aea69cc6a83331bbe96dc2caa9a299d21329efb0336fc02a82e1839a8
   ├── ✔ Packages                        [46,897 packages]  
   ├── ✔ File digests                    [36,540 files]  
   ├── ✔ File metadata                   [36,540 locations]  
   └── ✔ Executables                     [36,689 executables]  
...
System.Diagnostics.DiagnosticSource                                            9.0.24.52809                                                       dotnet                  (+1 duplicate)      
System.Diagnostics.EventLog                                                    10.0.0-alpha.1.24531.8                                             dotnet                  (+2 duplicates)     
System.Diagnostics.EventLog                                                    10.0.0-alpha.1.24565.3                                             dotnet                  (+102 duplicates)   
System.Diagnostics.EventLog                                                    10.0.24.53005                                                      dotnet                  (+3 duplicates)     
System.Diagnostics.EventLog                                                    10.0.24.53108                                                      dotnet                  (+7 duplicates)     
System.Diagnostics.EventLog                                                    10.0.24.55105                                                      dotnet                  (+3 duplicates)     
System.Diagnostics.EventLog                                                    10.0.24.56503                                                      dotnet                  (+145 duplicates)   
System.Diagnostics.EventLog                                                    7.0.0                                                              dotnet                  (+30 duplicates)    
System.Diagnostics.EventLog                                                    8.0.0                                                              dotnet                  (+5 duplicates)     
...
syft .  173.88s user 25.37s system 119% cpu 2:46.38 total

I see a lot of packages (46,897!) and a lot of them have multiple duplicates (some in the hundreds!). We try to keep distinct project dependency graphs, so it might be possible that this is correct, but I think it would be worth double checking this result too.

Back to the topic at hand: performance... I don't see how we can scan that many DLLs in only seconds, so that doesn't seem like the right answer here. The deps.json cataloger also has a lot to work with one a build has been performed:

$ find . | grep deps.json | wc -l
     354

And when we ignore DLLs (thus look purely at the deps.json) we see much better performance:

❯ time syft . --exclude '**/*.dll'                                    
 ✔ Indexed file system                                                                                                                                                                                    .
 ✔ Cataloged contents                                                                                                                      cdb4ee2aea69cc6a83331bbe96dc2caa9a299d21329efb0336fc02a82e1839a8
   ├── ✔ Packages                        [10,787 packages]  
   ├── ✔ File digests                    [430 files]  
   ├── ✔ File metadata                   [430 locations]  
   └── ✔ Executables                     [497 executables]  
[0000]  WARN no explicit name and version provided for directory source, deriving artifact ID from the given path (which is not ideal)
NAME                                                                           VERSION                     TYPE                                      
.                                                                              10.0.100-dev                dotnet                  (+7 duplicates)    
.NET Host                                                                      6.0.4                       dotnet                                     
Argon                                                                          0.17.0                      dotnet                  (+1 duplicate)     
ArgumentForwarding.Tests                                                       10.0.100-dev                dotnet                                     
ArgumentsReflector                                                             10.0.100-dev                dotnet                  (+2 duplicates)    
Castle.Core                                                                    5.1.1                       dotnet                  (+16 duplicates)   
ConsoleDemoWithCasing                                                          1.0.0                       dotnet                  (+2 duplicates)    
DiffEngine                                                                     15.4.2                      dotnet                  (+1 duplicate)     
DiffPlex                                                                       1.5.0                       dotnet                                     
DiffPlex                                                                       1.7.2                       dotnet                  (+1 duplicate)     
DotNetWatchTasks                                                               10.0.100-dev                dotnet                  (+1 duplicate)     
DumpMinitool                                                                   17.1300.24.52301            dotnet                                     
DumpMinitool                                                                   17.1300.24.56301            dotnet                  (+4 duplicates)    
...
xunit.extensibility.core                                                       2.9.2                       dotnet                  (+50 duplicates)   
xunit.extensibility.execution                                                  2.9.2                       dotnet                  (+50 duplicates)   
xunit.runner.console                                                           2.9.2                       dotnet                  (+46 duplicates)   
xunit.runner.reporters                                                         2.9.2                       dotnet                  (+46 duplicates)   
xunit.runner.utility                                                           2.9.2                       dotnet                  (+46 duplicates)   
xunit.runner.visualstudio                                                      2.8.2                       dotnet                  (+46 duplicates)
A newer version of syft is available for download: 1.15.0 (installed version is 1.14.2)
syft . --exclude '**/*.dll'  7.50s user 3.08s system 99% cpu 10.665 total

As a workaround @KeylinxTobias can you try out running syft with --exclude '**/*.dll' and report back if it is both performant and accurate for you?

Side note: I wonder if we should have a *.csproj cataloger? I'm not a dotnet developer, so this might be a bad suggestion for a syft enhancement.

@kzantow
Copy link
Contributor

kzantow commented Nov 18, 2024

There is a request for a csproj cataloger: #1522

@TimBrown1611
Copy link

is this issue happens on windows OS?

@popey
Copy link
Contributor

popey commented Nov 25, 2024

@TimBrown1611 I don't believe this is a windows specific issue, no. GIven @wagoodman ran ./build.sh, and as a fine upstanding gentleman, I presume he's running Linux (or macos). The confusion often comes when talking about DLL files, as DotNet on non-Windows operating system also uses DLLs for shared libraries.

@tomersein
Copy link
Contributor

hi @popey !
I think I have a similar issue with this cataloger, I wonder if you can explain the workaround so I can see if it works?
you say we might use exclusions on dll files and the packages will appear in the SBOM from other catalogers?
thanks!

@kzantow
Copy link
Contributor

kzantow commented Dec 11, 2024

For discussion:

This sounds like a user is scanning a source directory with build artifacts and we don't want to include the build artifacts in the scan since we already have the build description files providing the information about what packages we should report.

This is similar to a Maven project, where you have a pom.xml at the top level but also a target directory, where after a build is run, it is populated with some build artifacts that are not what the user is interested in reporting.

This could also be related to how we use different catalogers if we're scanning image sources vs directories -- this may be somehow per-directory tree differences.

@rsphilk
Copy link

rsphilk commented Jan 13, 2025

I have observed this behavior also on an offline build server. It could be traced back to repeated attempts to download certificate revocation lists for code signing certificates. This of course happens only when there are already signed binaries in the scanned stuff.
It was not possible to control that behavior or make syft use a proxy for this step via the environment variables. So the only solution was to enable internet access via proxy.

@wagoodman wagoodman changed the title Syft scan in offline mode is slow Scanning a project with many DLLs is slow Jan 23, 2025
@kzantow
Copy link
Contributor

kzantow commented Feb 3, 2025

Hey @rsphilk -- your last comment mentions that you traced down slowness to downloading certificate revocation lists. Are you saying that Syft is doing this when you scan something with it?

There is also a PR that improves performance when scanning DLLs, with initial testing showing that it may be twice as fast with significantly less memory allocation: #3563, this may help the situation for you, too.

@rsphilk
Copy link

rsphilk commented Feb 3, 2025

Hi @kzantow

Are you saying that Syft is doing this when you scan something with it?

Yes, that is what I observed.
My assumption is that syft uses some Windows standard function which initiates the download of the certificate revocation lists. That would also explain why the syft proxy settings are ignored here.

@kzantow
Copy link
Contributor

kzantow commented Feb 3, 2025

My assumption is that syft uses some Windows standard function which initiates the download of the certificate revocation lists. That would also explain why the syft proxy settings are ignored here.

That's a really interesting observation! Did you happen to have an example repo or other artifact we could use to reproduce this behavior on an offline Windows VM/machine?

@TimBrown1611
Copy link

Hey @rsphilk -- your last comment mentions that you traced down slowness to downloading certificate revocation lists. Are you saying that Syft is doing this when you scan something with it?

There is also a PR that improves performance when scanning DLLs, with initial testing showing that it may be twice as fast with significantly less memory allocation: #3563, this may help the situation for you, too.

do you have any eta when the PR will be merged? we can't upgrade syft due to the performance :(

@kzantow
Copy link
Contributor

kzantow commented Feb 3, 2025

do you have any eta when the PR will be merged? we can't upgrade syft due to the performance :(

The PR limits allocations, but not total unreclaimable memory, as far as I understand it. I don't think it is going to solve the other issues that you have raised; I'm not sure how upgrading would be blocked on that PR.

@TimBrown1611
Copy link

do you have any eta when the PR will be merged? we can't upgrade syft due to the performance :(

The PR limits allocations, but not total concurrent memory usage. I don't think it is going to solve the other issues that you have raised; I'm not sure how upgrading would be blocked on that PR.

I am using syft 1.16.0, I can't point which comment \ change was made, but scanning windows machine became much longer (because of the mentioned cataloger).
but in latest versions of syft it takes 3x more time to scan dlls..
reducing the memory will not make the scans faster?

@kzantow
Copy link
Contributor

kzantow commented Feb 3, 2025

@TimBrown1611 -- if I look at the commits from 1.16.0 to 1.19.0, I am not seeing anything that directly affected the Dotnet Portable Executable cataloger, which is the one scanning DLLs. There was a dotnet-packages-lock-cataloger added, but that scans specifically named JSON files and you could just disable that if it's problematic for you (with --select catalogers -dotnet-packages-lock-cataloger). If there really was a performance regression from 1.16 to 1.19, we should try to figure out what is causing it, though! Is there an image we could run them both with to see the problem?

@TimBrown1611
Copy link

hi!
I've done some research on my end (added some stats to calculate the times gap)
the issue is mainly in the dotnet-portable-executable-cataloger

I don't have an image, but I've scanned AMI
Windows Server 2025
Windows Server 2019
Windows Server 2022

@popey
Copy link
Contributor

popey commented Feb 6, 2025

@rsphilk / @TimBrown1611 is it possible the network connections and delays are due to Microsoft Windows SmartScreen?

I am not a Windows expert, but I understand it has the capability to fingerprint binaries (programs and libraries), then submit with an online Microsoft database to check them for known issues.

If you're scanning an entire filesystem, I imagine this will cause Syft to open every binary, and potentially trigger SmartScreen. Is that something worth considering?

@TimBrown1611
Copy link

@rsphilk / @TimBrown1611 is it possible the network connections and delays are due to Microsoft Windows SmartScreen?

I am not a Windows expert, but I understand it has the capability to fingerprint binaries (programs and libraries), then submit with an online Microsoft database to check them for known issues.

If you're scanning an entire filesystem, I imagine this will cause Syft to open every binary, and potentially trigger SmartScreen. Is that something worth considering?

hi @popey
in my case it is not possible this is the issue

I do see big gaps between the performances of the different syft's

@rogueai
Copy link
Contributor

rogueai commented Feb 21, 2025

Hi, I think I'm facing the same issue, I'll try to report as much details as I can:

We're running syft 1.19.0 on windows with no internet connection to the outside. We're running scans against several projects with varying number of dlls. Just to give you and idea: a project with around 100 dlls takes 30-50 min to be analyzed.

This seems to be happening only on windows. On linux the same project takes about 10-20 sec, which is acceptable.

Enabling TRACE logs, it looks like the dotnet-portable-executable-cataloger gets stuck parsing file contents for a very long time on some (but not all) files, eventually running into an error, for instance:

[0001] TRACE parsing file contents path=\bin\DocumentFormat.OpenXml.dll
[0023] TRACE unexpected stdout: ERROR msg=failed to loadSystemRoots: exit status 0x80072ee2

loadSystemRoots doesn't seem to be related to any syft code, and I wonder if it's related to a previous comment about windows trying to do something concerning some kind of certificates?

@rogueai
Copy link
Contributor

rogueai commented Feb 21, 2025

Update: I think I might be up to something, I investigated the error failed to loadSystemRoots.
The error comes from the module saferwall/pe

As stated in the comment:

Unfortunately, Windows does not ship with all of its root certificates installed. Instead, it downloads them on-demand

And that might explain why it gets stuck on machines without an internet connection. Windows might be trying to reach the external internet and then times out.

I've patched syft to disable cert validation, just to verify my assumptions, and now the cataloger is very fast indeed.

On my side, I'll try to figure out what urls need to be accessed by windows to download these certificates so that I can let them through the proxy/firewall, but still it would be nice to have the option to configure syft to ignore cert validation for the cataloger.

@kzantow
Copy link
Contributor

kzantow commented Feb 21, 2025

That is a great find @rogueai -- I think in this context, disabling the cert validation is a fine thing to do -- we're not trying to validate these files are signed/valid, we're just trying to identify them. MAYBE some user would like to also validate them, though -- a PR to make this an option, disabled by default would be 👌, if you happen to have any time

@kzantow kzantow moved this to Ready in OSS Feb 21, 2025
@kzantow kzantow added the good-first-issue Good for newcomers label Feb 21, 2025
@rogueai
Copy link
Contributor

rogueai commented Feb 21, 2025

No problem, I'm not terribly versed in go, but nevertheless I'm going to take a shot at this and see what I can come up with.

@kzantow
Copy link
Contributor

kzantow commented Feb 21, 2025

This change could be slightly unclear if you're not familiar with how syft configuration works; if you wanted to start a PR with the change you referenced, I could help getting the configuration passed through

@rogueai
Copy link
Contributor

rogueai commented Feb 21, 2025

@kzantow I've created a draft PR, in the meanwhile I'll familiarize with the code base, cheers!

@kzantow
Copy link
Contributor

kzantow commented Feb 21, 2025

@rogueai thanks much! it's a little bit of a pain to get the configuration plumbed through when there isn't an existing configuration object, easier when there is one... but I've added this to your PR, could you validate it behaves as you expect? and if you feel like adding any sort of test, that would be great (not entirely sure how to do this, since it's testing internal behavior of the library you mentioned); I think if you validate this, we could probably get it merged as-is (I'll get someone on the team to give a second set of eyes to my change)

@kzantow kzantow self-assigned this Feb 21, 2025
@kzantow kzantow moved this from Ready to In Review in OSS Feb 21, 2025
@rogueai
Copy link
Contributor

rogueai commented Feb 21, 2025

Thanks @kzantow , I validated the changes against the problematic machine and it looks like it's working as intended:

  • no config added: it's picking up the default value and scan is fast
  • dotnet.enable-certificate-validation: true: scan is slow
  • dotnet.enable-certificate-validation: false: scan is fast

Thanks for helping with this and for getting this done so quickly!

@rogueai
Copy link
Contributor

rogueai commented Feb 21, 2025

On the testing side, I'm not sure how to add any meaningful test either to be honest.
Is there some documentation I'd need to get ready with the new config option?

@kzantow
Copy link
Contributor

kzantow commented Feb 21, 2025

@rogueai no worries, I think we can move this one forward as-is; thank you for verifying it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good-first-issue Good for newcomers
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

8 participants