Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement request: Support global, per product, and per package registry/proxy configuration for purl2url #256

Open
robertguetzkow opened this issue Feb 8, 2025 · 9 comments
Labels
design needed Design details needed to complete the issue enhancement New feature or request

Comments

@robertguetzkow
Copy link

robertguetzkow commented Feb 8, 2025

Is your enhancement request related to a problem? Please describe.
Currently, DejaCode attempts to convert the PURL to a URL in order to submit a scan to ScanCode.io. Since PURLs do not contain any indication where the package is actually stored, this will not work for when the packages are stored in an internal artifact registry service (like Nexus, JFrog Artifactory, or similar). Additionally, attempting to download packages directly from the package registries could trigger rate limits or some organization may block direct downloads for security reasons, so that the use of an artifact proxy registry may be required (essentially, instead of requesting a package from e.g., npm, one would direct the request at the proxy registry on the internal artifact server, which would then request it from npm and the package data would be cached for subsequent requests by the artifact registry service)

What are the benefits of the requested enhancement?
If the target registry could be configured, then downloads from an internal artifact registry would be possible. This would help with analyzing internal packages that are not published to the public registry of the respective package manager and also support usage in security restricted environments.

Describe the solution you would like
This suggested approach would be a tiered configuration, global, per product, and per package, where for each supported package manager one can set the target registry, which would override the default purl2url behavior. Furthermore the configuration for the global configuration would be overridden by the product configuration and the product configuration overridden by the package configuration.

Additional notes
This feature should be compatible with SBOM imports followed by automatic scans (applying the global or product settings, or if package already exists, the package specific one)

@robertguetzkow robertguetzkow added design needed Design details needed to complete the issue enhancement New feature or request labels Feb 8, 2025
@robertguetzkow
Copy link
Author

robertguetzkow commented Feb 8, 2025

I would also suggest using resolved URLs as the basis when they are already included in the SBOM and identifiable, instead of relying on purl2url which might not work for every package manager.

@pombredanne
Copy link
Member

@robertguetzkow Thanks. When you say proxy, do you mean an alternative default repository_url that would override the spec default for a type?

If the package is not found there, should we fallback to the default registry?

@robertguetzkow
Copy link
Author

robertguetzkow commented Feb 12, 2025

I'm not quite sure what the repository_url is that you are referencing, so I'll try to describe it.

From my understanding purl2url would resolve to the standard/main registry of the respective package manager. However, if we use our own internal artifact registry, then the URL to download the package will have to match the location on that internal server. Simply replacing the domain name will not be enough as the path structure of the artifact registry service might be different. What I was proposing was that instead of trying to resolve the PURL, use either a direct URL to the package on an internal artifact registry manually defined by the user or an already resolved URL is present in the SBOM, if the user configures it to not use the main package registry. The latter does not seem to be standardized and tools like cdxgen will use externalReferences or properties to provide this as additional data.

From my undestanding the actual VCS URL pointing to the source code of the package is a different topic, if that is what you were referring to with the repository_url. This can also be used for scanning for license information, but is unrelated to the internal artifact registry. There can also be internal source code repositories though, similar to the artifact registry. I have not thought about that aspect and have not encountered issues with that case so far, most likely because DejaCode currently does not make use of the VCS URL returned by ScanCode.io (#255).

@pombredanne
Copy link
Member

@robertguetzkow re:

I'm not quite sure what the repository_url is that you are referencing, so I'll try to describe it.

This is something I specified in the purl spec at https://github.com/package-url/purl-spec/blob/65d98c4a6f8520b2fe006dff932e1508fc1b8317/PURL-SPECIFICATION.rst#L426 :

  • repository_url is an extra URL for an alternative, non-default package
    repository or registry. When a package does not come from the default public
    package repository for its type a purl may be qualified with this extra
    URL. The default repository or registry of a type is documented in the
    "Known purl types" section.

@pombredanne
Copy link
Member

The default "repository_url" is the URL to maven central for maven, pypi.org for Python and so on.

So to recap:

  • We would define a configurable alternative repository_url to override the default repository_url
  • This would be from the most generic to the most specific (that later would win if present):
    • a global dataspace-wide attribute,
    • or for a specific product,
    • or for a specific package,
    • (or may be for specific package in a product?)
  • This would need to be stored in a model with a (package type, repository url)
  • For now, we would leave aside issues of auth/IAM for these, but they are likely access controlled,
  • When resolving the actual URLs for a PURL, we should update purl2url to accept this alternative repository_url as an argument. For instance, this https://github.com/package-url/packageurl-python/blob/6f38e3e2f90be99d634cf72d0c0755c8c9e62084/src/packageurl/contrib/purl2url.py#L213 would become a function arg with a default.

@pombredanne
Copy link
Member

I would also suggest using resolved URLs as the basis when they are already included in the SBOM and identifiable, instead of relying on purl2url which might not work for every package manager.

AFAIK, we always use the provided download URL when you create a package. Have you seen something different? .... Though we may need to tighten this for SBOMs.

@robertguetzkow
Copy link
Author

@robertguetzkow re:

I'm not quite sure what the repository_url is that you are referencing, so I'll try to describe it.

This is something I specified in the purl spec at https://github.com/package-url/purl-spec/blob/65d98c4a6f8520b2fe006dff932e1508fc1b8317/PURL-SPECIFICATION.rst#L426 :

  • repository_url is an extra URL for an alternative, non-default package
    repository or registry. When a package does not come from the default public
    package repository for its type a purl may be qualified with this extra
    URL. The default repository or registry of a type is documented in the
    "Known purl types" section.

Good find! That sounds absolutely right. Though I have to check, I can't recall seeing an SBOM that has used that yet. Perhaps the tools I have previously used could not handle that case. At least for the export of SBOMs from DejaCode that seems to be the right way to declare it.

@robertguetzkow
Copy link
Author

The default "repository_url" is the URL to maven central for maven, pypi.org for Python and so on.

So to recap:

* We would define a configurable alternative repository_url to override the default repository_url

* This would be from the most generic to the most specific (that later would win if present):
  
  * a global dataspace-wide attribute,
  * or for a specific product,
  * or for a specific package,
  * (or may be for specific package in a product?)

* This would need to be stored in a model with a (package type, repository url)

* For now, we would leave aside issues of auth/IAM for these, but they are likely access controlled,

* When resolving the actual URLs for a PURL, we should update purl2url to accept this alternative repository_url as an argument. For instance, this https://github.com/package-url/packageurl-python/blob/6f38e3e2f90be99d634cf72d0c0755c8c9e62084/src/packageurl/contrib/purl2url.py#L213 would become a function arg with a default.

Yes, that is spot on. Auth/IAM is a good point, it's needed for anything that doesn't allow unauthenticated access. Perhaps for the second iteration of the feature.

@robertguetzkow
Copy link
Author

I would also suggest using resolved URLs as the basis when they are already included in the SBOM and identifiable, instead of relying on purl2url which might not work for every package manager.

AFAIK, we always use the provided download URL when you create a package. Have you seen something different? .... Though we may need to tighten this for SBOMs.

I've been mostly going through the route of importing SBOMs, which would trigger the load_sbom pipeline in ScanCode.io. I will have to check the CycloneDX specification and example SBOMs, but the last time I checked these download URLs where additional information that were not necessarily in a standardized structure or key/value pair. This could make the implementation challenging. See the following for example: #121 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design needed Design details needed to complete the issue enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants