Should malware/spam/etc which gets removed from pypi generate advisories here? #45

westonsteimel · 2021-12-04T21:17:27Z

Following from some discussion in pypi/warehouse#4703, do we think that packages removed from PyPI due to being classified as malware, etc should cause advisories to be generated here?

westonsteimel · 2021-12-04T21:22:45Z

Perhaps it should pass some sort of threshold for number of downloads or something to make it more worthwhile?

di · 2021-12-04T22:37:43Z

@oliverchang I'm curious if there is precedent for this in other advisory dbs.

westonsteimel · 2021-12-04T22:41:58Z

I do think npm at least publishes advisories for malicious packages (I think theirs is entirely via GitHub's advisories now if I remember correctly?) https://github.com/advisories?page=5&query=malicious+ecosystem%3Anpm

oliverchang · 2021-12-06T04:54:58Z

I think it's fair game to include these, and the reporting can re-use the existing infrastructure / tooling (i.e. pip-audit).

As @westonsteimel mentioned, other vuln DBs like GHSA also track these.

westonsteimel · 2021-12-06T07:41:35Z

@oliverchang, I did attempt using the existing analysis on one of these (I think one of the ones from this jfrog article), and it does cause some failures because, of course, these packages have been removed from pypi, so when it attempts to extract versions from pypi project JSON it fails. I did notice we do appear to have the version info in the input pypi_versions.json from the BigQuery query though.

di · 2021-12-06T14:11:34Z

Do we need some mechanism for "this entire project is malicious, regardless of version"?

darakian · 2022-12-13T21:40:06Z

Hey all 👋

Just want to chime in on this in hopes of reviving the conversation and to share some of github's thinking. npm does indeed make a point of publishing advisories for malware packages. Those packages are also pulled and the namespace for the package is forever more dead. Pulling the packages prevents future exploitation of course and the alert is to inform users who have already downloaded the package in order to minimize an attacker's window of opportunity.

So to that end I think it would 100% be valuable to publish malware takedown advisories.

Do we need some mechanism for "this entire project is malicious, regardless of version"?

It might be nice, but this can be achieved with uncapped version ranges in any advisory. eg. >= 0.0.0. I know differing version schemes might make that hard to be perfect, but it's a start.

Something else I would like to ask which may be more controversial is that in the event a package is taken the namespace for that package also be taken down/reserved/be made never usable again. The rational here is so that these advisories need not be invalidated over time as new users re-use package names.

di · 2022-12-13T22:39:13Z

Hey, thanks for that insight!

Something else I would like to ask which may be more controversial is that in the event a package is taken the namespace for that package also be taken down/reserved/be made never usable again. The rational here is so that these advisories need not be invalidated over time as new users re-use package names.

As great as this sounds I think it's not possible in practice for an ecosystem like Python that (currently) only has a single global namespace and has an open registration system for projects.

A very common occurrence is that a legitimate maintainer publishes a source repo, and before they get a chance to publish this to PyPI, an attacker beats them to it and publishes a malicious version of that project name.

We'd want to take down the project and inform people that a specific release was malicious, but we wouldn't want to block the legitimate maintainers from eventually publishing that name on PyPI.

darakian · 2022-12-14T00:13:46Z

Totally fair. Certainly for one off malware I would not suggest burning the namespace, but maybe the specific version. Anyway, advisories on malware is very much a 👍 from me 😄

joshbressers · 2022-12-15T15:18:43Z

It's probably worth framing this in the context of this work
https://checkmarx.com/blog/how-140k-nuget-npm-and-pypi-packages-were-used-to-spread-phishing-links/

144K packages is a pretty wild number. The list exists though
https://gist.github.com/jossef/1c1152368ff6210340644f44afec7e8e

Looks like there are 7824 PyPI packages.

I think in the case of malicious packages, there's not much to say other than "don't use this". It differs from a traditional security advisory that tends to need some additional details for the people on the receiving end.

I think for something like GitHub or OSV, it should be trivial to create the data for this based off something as simple as a CSV file. One doesn't need a lot of details, just a list of bad versions.

kaplanlior · 2022-12-18T09:44:30Z

A very common occurrence is that a legitimate maintainer publishes a source repo, and before they get a chance to publish this to PyPI, an attacker beats them to it and publishes a malicious version of that project name.

We'd want to take down the project and inform people that a specific release was malicious, but we wouldn't want to block the legitimate maintainers from eventually publishing that name on PyPI.

Which should authenticate these packages sources and their Git references so only the package maintainer could use his own repo (an idea by @jossef) . Regarding naming - that's of course a wider discussion.

darakian · 2023-01-12T18:29:42Z

I suspect package to repo authentication is out of scope for this topic. I'd love to see it even if opt-in only, but for the moment it's not in place and malware advisories would be useful with or without it.

@di let me know if there's anything I can do from the github side to help 👍

kurtseifried · 2023-01-17T18:12:46Z

One comment re the info, a package might be malware, or specific versions might have malware (but older/newer versions are fine). So unless the package is removed, some source of data saying which ones are safe/unsafe to use is advisable. It doesn't necessarily have to live in the pypa database though.

teruokun · 2023-05-23T16:28:33Z

So for now, I think the key bit here is having a way for pypi to communicate the list of packages you shouldn't be getting from Pypi. While I think that eventually it likely should be a part of an API definition (perhaps as an alternative 'list packages' interface), given that Pypi and its mirrors are used often as the root, it makes sense to communicate at least what you shouldn't expect to be getting from Pypi and as an advisory for any metadata mirrors to at least have a stance on how they should react with packages are added to or removed from the pypi-removed list.

I'd like to propose an idea here for a format, though I can understand if something more structured like json might be preferable: a pip requirements.txt format of the block list, potentially zipped. This would be almost trivial to construct from the existing warehouse table and would be worthwhile for most consumers as an easy way to scan their existing environment for the packages using existing tools (i.e. any existing prohibited requirements). If transparency of reasoning is also worthwhile, it could include the reasons as comments. In terms of later flexibility, it does also benefit from easily adding version specifications or other requirement qualifiers, though it need not do so at outset. It also doesn't necessarily block a more structured file with more details later on, if it's desired, but does make an easy integration point for pip users

So the process would be, in either a scheduled and/or on-change basis, a job would generate a new blocklist file from the existing DB table and if different from the existing file, make the pull request to the advisory database to update the file.

What do others think? I'd be willing to spike a little work on it if this seems like a reasonable approximation.

sethmlarson · 2023-07-10T16:00:48Z

Reviving this topic a bit, I think it makes sense to have packages which have been removed from PyPI listed in this database. My thinking is:

Package names that have been removed aren't removed from the "name pool" forever. People can request previously removed package names via PEP 541. This will help mirrors/consumers have more confidence with which packages are malicious versus safe to use and install knowing that names can be reused.
CVEs aren't emitted for malicious packages and it's unlikely other providers of vulnerability IDs would do so for PyPI. Having a PYSEC record for this information is likely the only provider that would do so.

I'm unsure how automated this process could be, but a PYSEC that contains the removed package name, versions (assuming that versions can't be re-used post-deletion, please check this assumption), and hashes of the files I think would be enough for pip-audit to detect a malicious package?

kurtseifried · 2023-07-10T16:58:13Z

CVEs aren't emitted for malicious packages and it's unlikely other providers of vulnerability IDs would do so for PyPI. Having a PYSEC record for this information is likely the only provider that would do so.

I can unequivocally say the the GSD project (https://gsd.id) would like to do ID's for these issues, especially as we can properly tag them, currently they would get tagged as "concern":

{
"gsd": {
"metadata": {
"type": "concern",

But we can definitely look at adding a "malware" category (I suspect there are enough of these across multiple ecosystems to make it worth doing).

We are also happy to support automation in order to get you GSD ID's quickly and easily, like we do for the Linux Kernel already (several thousand per year).

di · 2025-02-28T16:48:43Z

I met with @ewdurbin and @miketheman to figure out what we need to do to unblock this. Questions we discussed & came to an agreement on:

Should we declare the entire project as malware or specific versions?

Specific versions: sometimes projects are taken down and later the name is released as a legitimate project
Listing specific versions is cumbersome if reports are made manually, but if these are automated it would be possible.

Should this include spam, namesquatting, etc?

Yes, but it should be distinct from vulnerabilities
Different identifiers? PYMAL-, PYSPAM-, PYSQUAT? These are mutually exclusive.

How should the flow of information work?

Create a task/pipeline on PyPI for writing takedowns as advisories to the advisory-db
OSV will pick up on these and include it in its dataset
For now: OSV will (may?) report a vulnerability to PyPI for the advisory
For now: If the project has been deleted, PyPI won't currently serve any information about it via the /json endpoint

What should we do for old prohibited project names around malware/typosquatting?

Backfill observations for prohibited names (same timestamp)
- actions: include the original timestamp of the prohibition
- additional: explanation that this is a backfill
- payload: nothing for now

(cc @woodruffw and @sethmlarson for thoughts)

woodruffw · 2025-02-28T17:07:38Z

Should we declare the entire project as malware or specific versions?

I agree with the reasoning below about having it be specific versions! I think it would be non-ideal if someone could permanently taint/poison parts of the projects namespace by releasing only a single version (or fixed set of versions).

Different identifiers? PYMAL-, PYSPAM-, PYSQUAT? These are mutually exclusive.

These make sense to me, although out of curiosity: is there a way you're expecting downstreams to consume PYSQUAT- entries? They seem less immediately useful than PYMAL- and PYSPAN-, although I suppose they could be useful for behavioral analysis/monitoring a squatting campaign 🙂

(So maybe I answered my own question!)

How should the flow of information work?

The flow below makes sense! To the point about deleted projects: perhaps it would be useful (longer term) to have dedicated advisory API endpoints? I can imagine downstreams being interested in historical advisories even if the files are no longer available, e.g. as part of incident response.

teruokun · 2025-02-28T17:28:30Z

In terms of history and record of known-bad, it would be really useful going forwards especially for downstream metadata mirrors. Because of this kind of post-publishing curation, one strategy is to compare a current view of the repository to an older snapshot, using only the intersection of artifacts available in both. If there was a way to simply request the advisories (especially with a net-new similar to using the pypi serial), it’d be a great feature for mirrors. As for a dedicated API, It’s definitely nice to have it available under the same banner as the repository from a certificates/origin standpoint (I.e. not having to check 2 sources), but it’s also not that big of a deal IMHO

…

On Fri, Feb 28, 2025 at 9:08 AM William Woodruff ***@***.***> wrote: Should we declare the entire project as malware or specific versions? I agree with the reasoning below about having it be specific versions! I think it would be non-ideal if someone could permanently taint/poison parts of the projects namespace by releasing only a single version (or fixed set of versions). Different identifiers? PYMAL-, PYSPAM-, PYSQUAT? These are mutually exclusive. These make sense to me, although out of curiosity: is there a way you're expecting downstreams to consume PYSQUAT- entries? They seem less immediately useful than PYMAL- and PYSPAN-, although I suppose they could be useful for behavioral analysis/monitoring a squatting campaign 🙂 (So maybe I answered my own question!) How should the flow of information work? The flow below makes sense! To the point about deleted projects: perhaps it would be useful (longer term) to have dedicated advisory API endpoints? I can imagine downstreams being interested in historical advisories even if the files are no longer available, e.g. as part of incident response. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYCRPPZJSAIJVCNHDSPKH32SCJXDAVCNFSM6AAAAABYCXK2NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGE2DSMJXGY> . You are receiving this because you commented.Message ID: ***@***.***> [image: woodruffw]*woodruffw* left a comment (pypa/advisory-database#45) <#45 (comment)> Should we declare the entire project as malware or specific versions? I agree with the reasoning below about having it be specific versions! I think it would be non-ideal if someone could permanently taint/poison parts of the projects namespace by releasing only a single version (or fixed set of versions). Different identifiers? PYMAL-, PYSPAM-, PYSQUAT? These are mutually exclusive. These make sense to me, although out of curiosity: is there a way you're expecting downstreams to consume PYSQUAT- entries? They seem less immediately useful than PYMAL- and PYSPAN-, although I suppose they could be useful for behavioral analysis/monitoring a squatting campaign 🙂 (So maybe I answered my own question!) How should the flow of information work? The flow below makes sense! To the point about deleted projects: perhaps it would be useful (longer term) to have dedicated advisory API endpoints? I can imagine downstreams being interested in historical advisories even if the files are no longer available, e.g. as part of incident response. — Reply to this email directly, view it on GitHub <#45 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAYCRPPZJSAIJVCNHDSPKH32SCJXDAVCNFSM6AAAAABYCXK2NOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDMOJRGE2DSMJXGY> . You are receiving this because you commented.Message ID: ***@***.***>

di · 2025-02-28T17:29:10Z

is there a way you're expecting downstreams to consume PYSQUAT- entries?

Not particularly, I think it's just worthwhile for us to distinguish between something that we've determined to be actively harmful vs. just a name squat that is otherwise innocuous.

To the point about deleted projects: perhaps it would be useful (longer term) to have dedicated advisory API endpoints? I can imagine downstreams being interested in historical advisories even if the files are no longer available, e.g. as part of incident response.

Punting on this for now, since this issue is specifically about making the advisories available here, but I think this is probably worth thinking through more on the PyPI side once this is available.

sethmlarson · 2025-03-03T17:48:08Z

Should we declare the entire project as malware or specific versions?

Specific versions for all the reasons mentioned.

Should this include spam, namesquatting, etc?

These feel closer to the request for information on "deleted projects" instead of actually malicious code? The records for vulnerabilities are to signal to users they need to take action, the same is true for malware, but I don't think the same is true for spam and namesquatting? Do we want to include these two categories alongside malware?

Different identifiers? PYMAL-, PYSPAM-, PYSQUAT? These are mutually exclusive.

PYMAL makes sense to me for malware! We'll have to duplicate a few malware entries in PYSEC- to be under PYMAL, too.

How should the flow of information work?

That design seems good IMO

Should we backfill?

If we have this information on-hand we should definitely backfill with old information, especially if the package download URLs are still available. This data is invaluable for researchers.

di · 2025-03-03T19:55:50Z

The records for vulnerabilities are to signal to users they need to take action, the same is true for malware, but I don't think the same is true for spam and namesquatting?

I think this is a good point. We can move forward with just malware (PYMAL-) for now and revisit the rest later if there is demand.

di mentioned this issue Mar 7, 2025

Store observation kind when prohibiting project names pypi/warehouse#17731

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should malware/spam/etc which gets removed from pypi generate advisories here? #45

Should malware/spam/etc which gets removed from pypi generate advisories here? #45

westonsteimel commented Dec 4, 2021

westonsteimel commented Dec 4, 2021

di commented Dec 4, 2021

westonsteimel commented Dec 4, 2021 •

edited

Loading

oliverchang commented Dec 6, 2021

westonsteimel commented Dec 6, 2021 •

edited

Loading

di commented Dec 6, 2021

darakian commented Dec 13, 2022

di commented Dec 13, 2022

darakian commented Dec 14, 2022

joshbressers commented Dec 15, 2022

kaplanlior commented Dec 18, 2022

darakian commented Jan 12, 2023

kurtseifried commented Jan 17, 2023

teruokun commented May 23, 2023

sethmlarson commented Jul 10, 2023 •

edited

Loading

kurtseifried commented Jul 10, 2023

di commented Feb 28, 2025

woodruffw commented Feb 28, 2025

teruokun commented Feb 28, 2025 via email

di commented Feb 28, 2025

sethmlarson commented Mar 3, 2025

di commented Mar 3, 2025

Should malware/spam/etc which gets removed from pypi generate advisories here? #45

Should malware/spam/etc which gets removed from pypi generate advisories here? #45

Comments

westonsteimel commented Dec 4, 2021

westonsteimel commented Dec 4, 2021

di commented Dec 4, 2021

westonsteimel commented Dec 4, 2021 • edited Loading

oliverchang commented Dec 6, 2021

westonsteimel commented Dec 6, 2021 • edited Loading

di commented Dec 6, 2021

darakian commented Dec 13, 2022

di commented Dec 13, 2022

darakian commented Dec 14, 2022

joshbressers commented Dec 15, 2022

kaplanlior commented Dec 18, 2022

darakian commented Jan 12, 2023

kurtseifried commented Jan 17, 2023

teruokun commented May 23, 2023

sethmlarson commented Jul 10, 2023 • edited Loading

kurtseifried commented Jul 10, 2023

di commented Feb 28, 2025

woodruffw commented Feb 28, 2025

teruokun commented Feb 28, 2025 via email

di commented Feb 28, 2025

sethmlarson commented Mar 3, 2025

di commented Mar 3, 2025

westonsteimel commented Dec 4, 2021 •

edited

Loading

westonsteimel commented Dec 6, 2021 •

edited

Loading

sethmlarson commented Jul 10, 2023 •

edited

Loading