Configurable package merge behavior #3485

wagoodman · 2024-11-27T16:19:50Z

What would you like to be added:
The user should be able to configure with more nuance:

how to detect when two packages should be merged
how to persist package details for merged packages

For instance, today you can assume that any pkg.Package.Metadata will be a single struct that represents details for a single package. However, when merging packages today these structs MUST be the same. What is being proposed is to not necessarily require this, so the user can merge similar packages that might have different data cataloged.

This requires us to "join" metadatas like so:

# assign this to pkg.Package.Metadata
# note: we logically want a map, but not actually so
type JoinedMetadata map[file.CoordinateSet]any

This way when merging packages with fuzzier logic we can still keep all information (try not to drop anything).

From a user configuration perspective, this could look like so:

package:
  merge:
    # hash location paths when making package IDs
    use-paths: true
    
    # hash location layer information when making package IDs
    use-layers: false

    # hash package metadata struct when making package IDs
    use-metadata: true

    use-license: true
    use-purl: true
    # use....

... this is probably a bad example; I think a good part of this issue discussion should be about how the user would specify this.

Another way to do this is to make a single flag instead of a config for this:

syft --merge-pkg ENUMVALUE

(handwaving at the enum values now)

It's very possible that we should have specific heuristics in specific cataloging cases, so we don't expose these low level options directly, but instead allow or deny a list of heuristics.

Another variant, specify a list of package types to merge:

# be more aggressive on deduplication logic for packages of the given types
# note: this does not merge across types, so you can still only merge python packages with other python packages.
syft --merge-pkg-types python,ruby,binary,rpm

Why is this needed:
Today syft makes a few assumptions about what makes a distinctive package:

packages with the exact same core + metadata information in different project trees (different paths) are considered to be distinct
packages with the same core + metadata information in the same path but different layers are considered to NOT be distinct

This allows us to produce SBOMs with the "maximum resolution" so to speak -- you can answer questions about separate project trees (since they are not merged). However, this also has the downside of producing potentially large SBOMs and package graphs when logically they may be for the same application. Merging dependency trees may be ideal for a users use case. We have a few detailed examples of these:

This hints that we should make this behavior configurable.

Additional context:

Today we have deduplication of OS and binary packages, which is one of the only cases of cross-package-type deduplication behavior (by dropping). How would this be affected by the proposed configuration?

The text was updated successfully, but these errors were encountered:

willmurphyscode · 2025-01-06T15:39:57Z

We should discuss #3554 (comment) at the same time.

wagoodman · 2025-01-09T20:48:38Z

From the live discussion:

we could start adding confidence values onto package detections, such merging operations where there are singular values have a more clear precedence (which field overrides the other)
should pkg.Package have a shadowPackages slice for what packages are subsumed
relationships also need to be reconciled (merging package graphs)
this might be made easier within an SBOMWriter interface (which takes a set of packages and relationships to write into an SBOM object). This same idea came up within the spooling results to disk.
we have a path in syft 1.0 to do this work

jimmystewpot · 2025-01-27T05:19:10Z

I've been thinking about this issue as it relates to a pr that I have open. One approach that comes to mind is extending the existing package types. This would add an additional layer to the existing Package struct depth. This exists in principle today with pkg.Package having the metadata interface.

type Package struct {
	id        artifact.ID      `hash:"ignore"`
	Name      string           // the package name
        // redacted
	Metadata  interface{}      // additional data found while parsing the package source
}

With the existing type layout, the Metadata field gives us the second layer, which could then be extended to make the Cataloger's metadata extendible in the same fashion.

type CatalogerMetadata struct {
	Source string
	Checksum string
	[]Dependencies
	Enriched interface{} // this could be Enriched, Enrichment.

This shouldn't be a breaking change and would be configured to default off. When a cataloger has an online enrichment possibility, it can be enabled; each cataloger would uniformly use the Enrichment interface. In this situation, the merge configuration becomes more of an override pattern than a merge, although I like the sound of merge as the configuration key. This extends the information available.

cargo-auditable-binary-cataloger:
  enrichment: true
  merge:
    fields:
      licenses: enrichment

The merge configuration must be validated to ensure the behaviour is applied consistently across catalogers. All data is left in place, and the override or merge only matters when it's being written out based on the output. This model would also allow multiple concurrent metadata types to coexist with more elaborate rules, where the merge fields would have more elaborate configurations.

cargo-auditable-binary-cataloger:
  enrichment: true
  merge:
    fields:
      licenses:
        from: crates-io-enrichment
        fieldName: xxx
      author:
        from: github.com-enrichment
        fieldName: contributors

In this model, there would likely need to be a Metadata type considered to be a "complete" SBOM output entry, with the extensibility pattern still in place for custom fields that could be used. It appears this would be the "simple" solution for merging in this scenario, having more than one metadata type merged into a single entry while maintaining all of the original information.

In this scenario, if we did have weighted logic, the merge/fields could be extended with a key for per-cataloger weights based on completeness or similar priorities.

cargo-auditable-binary-cataloger:
  enrichment: true
  merge:
    priorities:
      - crates-io-enrichment
      - github-com-enrichment
      - my-artifactory-mirror-enrichment

This solution handles the use case of merging metadata that's aligned in nature; what happens when we have more than one cataloger and enriched that supports the same types of files, or are there non-related enrichment services? These would fall outside the cataloger's local metadata extension idea and require a global package merge level.

In this scenario, the configuration idea could still be applied at a higher level. I've included priority and field-level config for context.

package:
  merge:
    priorities:
    - crates-io-enrichment
    - github-com-enrichment
    - my-artifactory-mirror-enrichment
    fields:
      licenses:
        from: super-licensing-cataloger
        fieldName: licenses

In this example, both of the yaml lists for priorities assume that the yaml is ordered. However, as suggested above, it could also be set on other metrics, like completeness.

wagoodman added the enhancement New feature or request label Nov 27, 2024

anchoretoolsops added this to OSS Nov 27, 2024

wagoodman added the needs-discussion label Nov 27, 2024

wagoodman moved this to Backlog in OSS Nov 27, 2024

kzantow mentioned this issue Dec 2, 2024

Scanning a source tree with duplicate conanfile.txt dependencies generates multiple components #3403

Closed

witchcraze mentioned this issue Dec 24, 2024

wrong traefik rc versions at binary detection #3535

Open

wagoodman removed the needs-discussion label Jan 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable package merge behavior #3485

Configurable package merge behavior #3485

wagoodman commented Nov 27, 2024 •

edited

Loading

willmurphyscode commented Jan 6, 2025

wagoodman commented Jan 9, 2025

jimmystewpot commented Jan 27, 2025 •

edited

Loading

Configurable package merge behavior #3485

Configurable package merge behavior #3485

Comments

wagoodman commented Nov 27, 2024 • edited Loading

willmurphyscode commented Jan 6, 2025

wagoodman commented Jan 9, 2025

jimmystewpot commented Jan 27, 2025 • edited Loading

wagoodman commented Nov 27, 2024 •

edited

Loading

jimmystewpot commented Jan 27, 2025 •

edited

Loading