-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Should malware/spam/etc which gets removed from pypi generate advisories here? #45
Comments
Perhaps it should pass some sort of threshold for number of downloads or something to make it more worthwhile? |
@oliverchang I'm curious if there is precedent for this in other advisory dbs. |
I do think npm at least publishes advisories for malicious packages (I think theirs is entirely via GitHub's advisories now if I remember correctly?) https://github.com/advisories?page=5&query=malicious+ecosystem%3Anpm |
I think it's fair game to include these, and the reporting can re-use the existing infrastructure / tooling (i.e. pip-audit). As @westonsteimel mentioned, other vuln DBs like GHSA also track these. |
@oliverchang, I did attempt using the existing analysis on one of these (I think one of the ones from this jfrog article), and it does cause some failures because, of course, these packages have been removed from pypi, so when it attempts to extract versions from pypi project JSON it fails. I did notice we do appear to have the version info in the input pypi_versions.json from the BigQuery query though. |
Do we need some mechanism for "this entire project is malicious, regardless of version"? |
Hey all 👋 Just want to chime in on this in hopes of reviving the conversation and to share some of github's thinking. npm does indeed make a point of publishing advisories for malware packages. Those packages are also pulled and the namespace for the package is forever more dead. Pulling the packages prevents future exploitation of course and the alert is to inform users who have already downloaded the package in order to minimize an attacker's window of opportunity. So to that end I think it would 100% be valuable to publish malware takedown advisories.
It might be nice, but this can be achieved with uncapped version ranges in any advisory. eg. Something else I would like to ask which may be more controversial is that in the event a package is taken the namespace for that package also be taken down/reserved/be made never usable again. The rational here is so that these advisories need not be invalidated over time as new users re-use package names. |
Hey, thanks for that insight!
As great as this sounds I think it's not possible in practice for an ecosystem like Python that (currently) only has a single global namespace and has an open registration system for projects. A very common occurrence is that a legitimate maintainer publishes a source repo, and before they get a chance to publish this to PyPI, an attacker beats them to it and publishes a malicious version of that project name. We'd want to take down the project and inform people that a specific release was malicious, but we wouldn't want to block the legitimate maintainers from eventually publishing that name on PyPI. |
Totally fair. Certainly for one off malware I would not suggest burning the namespace, but maybe the specific version. Anyway, advisories on malware is very much a 👍 from me 😄 |
It's probably worth framing this in the context of this work 144K packages is a pretty wild number. The list exists though Looks like there are 7824 PyPI packages. I think in the case of malicious packages, there's not much to say other than "don't use this". It differs from a traditional security advisory that tends to need some additional details for the people on the receiving end. I think for something like GitHub or OSV, it should be trivial to create the data for this based off something as simple as a CSV file. One doesn't need a lot of details, just a list of bad versions. |
Which should authenticate these packages sources and their Git references so only the package maintainer could use his own repo (an idea by @jossef) . Regarding naming - that's of course a wider discussion. |
I suspect package to repo authentication is out of scope for this topic. I'd love to see it even if opt-in only, but for the moment it's not in place and malware advisories would be useful with or without it. @di let me know if there's anything I can do from the github side to help 👍 |
One comment re the info, a package might be malware, or specific versions might have malware (but older/newer versions are fine). So unless the package is removed, some source of data saying which ones are safe/unsafe to use is advisable. It doesn't necessarily have to live in the pypa database though. |
So for now, I think the key bit here is having a way for pypi to communicate the list of packages you shouldn't be getting from Pypi. While I think that eventually it likely should be a part of an API definition (perhaps as an alternative 'list packages' interface), given that Pypi and its mirrors are used often as the root, it makes sense to communicate at least what you shouldn't expect to be getting from Pypi and as an advisory for any metadata mirrors to at least have a stance on how they should react with packages are added to or removed from the pypi-removed list. I'd like to propose an idea here for a format, though I can understand if something more structured like json might be preferable: a So the process would be, in either a scheduled and/or on-change basis, a job would generate a new blocklist file from the existing DB table and if different from the existing file, make the pull request to the advisory database to update the file. What do others think? I'd be willing to spike a little work on it if this seems like a reasonable approximation. |
Reviving this topic a bit, I think it makes sense to have packages which have been removed from PyPI listed in this database. My thinking is:
I'm unsure how automated this process could be, but a PYSEC that contains the removed package name, versions (assuming that versions can't be re-used post-deletion, please check this assumption), and hashes of the files I think would be enough for pip-audit to detect a malicious package? |
I can unequivocally say the the GSD project (https://gsd.id) would like to do ID's for these issues, especially as we can properly tag them, currently they would get tagged as "concern": { But we can definitely look at adding a "malware" category (I suspect there are enough of these across multiple ecosystems to make it worth doing). We are also happy to support automation in order to get you GSD ID's quickly and easily, like we do for the Linux Kernel already (several thousand per year). |
I met with @ewdurbin and @miketheman to figure out what we need to do to unblock this. Questions we discussed & came to an agreement on: Should we declare the entire project as malware or specific versions?
Should this include spam, namesquatting, etc?
How should the flow of information work?
What should we do for old prohibited project names around malware/typosquatting?
(cc @woodruffw and @sethmlarson for thoughts) |
I agree with the reasoning below about having it be specific versions! I think it would be non-ideal if someone could permanently taint/poison parts of the projects namespace by releasing only a single version (or fixed set of versions).
These make sense to me, although out of curiosity: is there a way you're expecting downstreams to consume (So maybe I answered my own question!)
The flow below makes sense! To the point about deleted projects: perhaps it would be useful (longer term) to have dedicated advisory API endpoints? I can imagine downstreams being interested in historical advisories even if the files are no longer available, e.g. as part of incident response. |
In terms of history and record of known-bad, it would be really useful
going forwards especially for downstream metadata mirrors. Because of this
kind of post-publishing curation, one strategy is to compare a current view
of the repository to an older snapshot, using only the intersection of
artifacts available in both. If there was a way to simply request the
advisories (especially with a net-new similar to using the pypi serial),
it’d be a great feature for mirrors.
As for a dedicated API, It’s definitely nice to have it available under the
same banner as the repository from a certificates/origin standpoint (I.e.
not having to check 2 sources), but it’s also not that big of a deal IMHO
…On Fri, Feb 28, 2025 at 9:08 AM William Woodruff ***@***.***> wrote:
Should we declare the entire project as malware or specific versions?
I agree with the reasoning below about having it be specific versions! I
think it would be non-ideal if someone could permanently taint/poison parts
of the projects namespace by releasing only a single version (or fixed set
of versions).
Different identifiers? PYMAL-, PYSPAM-, PYSQUAT? These are mutually
exclusive.
These make sense to me, although out of curiosity: is there a way you're
expecting downstreams to consume PYSQUAT- entries? They seem less
immediately useful than PYMAL- and PYSPAN-, although I suppose they could
be useful for behavioral analysis/monitoring a squatting campaign 🙂
(So maybe I answered my own question!)
How should the flow of information work?
The flow below makes sense! To the point about deleted projects: perhaps
it would be useful (longer term) to have dedicated advisory API endpoints?
I can imagine downstreams being interested in historical advisories even if
the files are no longer available, e.g. as part of incident response.
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYCRPPZJSAIJVCNHDSPKH32SCJXDAVCNFSM6AAAAABYCXK2NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGE2DSMJXGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
[image: woodruffw]*woodruffw* left a comment (pypa/advisory-database#45)
<#45 (comment)>
Should we declare the entire project as malware or specific versions?
I agree with the reasoning below about having it be specific versions! I
think it would be non-ideal if someone could permanently taint/poison parts
of the projects namespace by releasing only a single version (or fixed set
of versions).
Different identifiers? PYMAL-, PYSPAM-, PYSQUAT? These are mutually
exclusive.
These make sense to me, although out of curiosity: is there a way you're
expecting downstreams to consume PYSQUAT- entries? They seem less
immediately useful than PYMAL- and PYSPAN-, although I suppose they could
be useful for behavioral analysis/monitoring a squatting campaign 🙂
(So maybe I answered my own question!)
How should the flow of information work?
The flow below makes sense! To the point about deleted projects: perhaps
it would be useful (longer term) to have dedicated advisory API endpoints?
I can imagine downstreams being interested in historical advisories even if
the files are no longer available, e.g. as part of incident response.
—
Reply to this email directly, view it on GitHub
<#45 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAYCRPPZJSAIJVCNHDSPKH32SCJXDAVCNFSM6AAAAABYCXK2NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGE2DSMJXGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
Not particularly, I think it's just worthwhile for us to distinguish between something that we've determined to be actively harmful vs. just a name squat that is otherwise innocuous.
Punting on this for now, since this issue is specifically about making the advisories available here, but I think this is probably worth thinking through more on the PyPI side once this is available. |
Specific versions for all the reasons mentioned.
These feel closer to the request for information on "deleted projects" instead of actually malicious code? The records for vulnerabilities are to signal to users they need to take action, the same is true for malware, but I don't think the same is true for spam and namesquatting? Do we want to include these two categories alongside malware?
PYMAL makes sense to me for malware! We'll have to duplicate a few malware entries in PYSEC- to be under PYMAL, too.
That design seems good IMO
If we have this information on-hand we should definitely backfill with old information, especially if the package download URLs are still available. This data is invaluable for researchers. |
I think this is a good point. We can move forward with just malware ( |
Following from some discussion in pypi/warehouse#4703, do we think that packages removed from PyPI due to being classified as malware, etc should cause advisories to be generated here?
The text was updated successfully, but these errors were encountered: