Python bindings: deprecate nonstandard inputs to bencode #5991

AllSeeingEyeTolledEweSew · 2021-02-18T02:33:09Z

bencode() currently accepts any type as input, including object.

bdecode() only outputs dict, list, int or bytes. This means there are lots of values that aren't stable round-trip, that is v != bdecode(bencode(v)).

It's not clear this support is helpful for any use case, and it can hide mistakes.

bencode() should only accept combinations of dict, list, int or bytes. Any others should fire DeprecationWarning, and should eventually raise TypeError at some later time.

The text was updated successfully, but these errors were encountered:

AllSeeingEyeTolledEweSew · 2021-02-18T05:43:26Z

I only just realized that tuples of ints (100, 100, 100) are treated as preformatted inputs to bencode(). I guess round-trip stability is intentionally bypassed here. I think tuples-of-ints is a pretty unfriendly way to present this in python; I'll have to think more about the best approach here.

In the meantime at least fleat and object should be deprecated.

arvidn · 2021-04-10T13:09:15Z

you don't think it would be reasonable to accept string as well, and encode it as UTF-8 into bytes?

arvidn · 2021-04-10T13:35:33Z

how does this look? #6125

arvidn · 2021-07-24T17:08:17Z

@AllSeeingEyeTolledEweSew that patch, #6125, would you consider that to address this issue?

Also, which one of your PRs do you think makes the most sense to land next?

AllSeeingEyeTolledEweSew · 2021-07-24T19:50:36Z

I think str should be deprecated. I could probably do the patch. Sorry, I meant to do more patches to fix all the bugs I opened but I got momentum doing my own project.

#6188 is definitely my vote for what to do next.

arvidn · 2021-07-25T15:29:08Z

I still don't understand the rationale for deprecating str. In a way it simplifies usage, since the caller doesn't need to know that strings are supposed to be UTF-8 encoded.

On the other hand, accepting strings makes the API non-symmetric. The string won't round-trip.

What is the reason to deprecate str in your mind?

AllSeeingEyeTolledEweSew · 2021-07-25T18:41:37Z

I think supporting str is an impedence mismatch with python 3.

Python 3 considers that string encodings should be explicit. Typically using a str input in the wrong place raises an explicit error, rather than implicitly encode the string. A user accustomed to this might expect that if they accidentally use a str somewhere in the input to bencode, that it should also raise an error. If we helpfully implicitly encode, the user could get themselves in trouble later.

If we were still in the python 2 world, I would want to support str inputs the way we do now, to match the environment.

arvidn · 2021-07-25T19:45:56Z

but bencoding does support strings, and they are supposed to be encoded as utf-8 (by the spec). In a way, one could argue that leaving the string encoding to the user opens up the possibility to generate invalid bencoded structures. In fact this used to be a problem in the early days, where people would encode filenames with arbitrary, local, code pages.

I think it's a stretch to say a str in a bencoded structure is "the wrong place". But I appreciate that it might be a better match to have bencoding and bdecoding the at the same abstraction level.

AllSeeingEyeTolledEweSew · 2021-07-30T00:35:02Z

I believe that accidental or ill-considered use of str is more common than intentional use of str to mean "encode this as utf-8", as you describe. I don't have data for this assertion, but I do know that considerations like this came up a lot in python 2, and many people smarter than me decided the best course was to make encoding explicit in python 3.

Also, remember that not all strs encode to utf-8 cleanly. Invalid surrogate pairs exist, and this is used as a "feature" for pathname handling. An app author might use a str for a pathname passed to bencode(). But if the str came from the filesystem it will have been treated with the filesystem encoding, and it may not encode to utf-8. There are examples of this for both windows and linux. The proper handling is something like os.fsencode(str_path).decode("utf-8", "replace"). If we allow str inputs and the author only tests with ascii filenames, then the failure will only be found by an unlucky user.

Lastly, as you mention, I think it's best to have bencode() and bdecode() match each other to minimize surprise.

So I still vote to deprecate str.

arvidn added this to the 1.2.14 milestone Apr 10, 2021

arvidn modified the milestones: 1.2.14, 1.2.15 Jun 7, 2021

AllSeeingEyeTolledEweSew mentioned this issue Jul 24, 2021

Deprecate bencode('str') #6334

Merged

arvidn modified the milestones: 1.2.15, 1.2.16 Dec 27, 2021

arvidn modified the milestones: 1.2.16, 1.2.17 Apr 17, 2022

arvidn modified the milestones: 1.2.17, 1.2.19 Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python bindings: deprecate nonstandard inputs to bencode #5991

Python bindings: deprecate nonstandard inputs to bencode #5991

AllSeeingEyeTolledEweSew commented Feb 18, 2021

AllSeeingEyeTolledEweSew commented Feb 18, 2021

arvidn commented Apr 10, 2021

arvidn commented Apr 10, 2021

arvidn commented Jul 24, 2021

AllSeeingEyeTolledEweSew commented Jul 24, 2021

arvidn commented Jul 25, 2021

AllSeeingEyeTolledEweSew commented Jul 25, 2021

arvidn commented Jul 25, 2021

AllSeeingEyeTolledEweSew commented Jul 30, 2021

Python bindings: deprecate nonstandard inputs to bencode #5991

Python bindings: deprecate nonstandard inputs to bencode #5991

Comments

AllSeeingEyeTolledEweSew commented Feb 18, 2021

AllSeeingEyeTolledEweSew commented Feb 18, 2021

arvidn commented Apr 10, 2021

arvidn commented Apr 10, 2021

arvidn commented Jul 24, 2021

AllSeeingEyeTolledEweSew commented Jul 24, 2021

arvidn commented Jul 25, 2021

AllSeeingEyeTolledEweSew commented Jul 25, 2021

arvidn commented Jul 25, 2021

AllSeeingEyeTolledEweSew commented Jul 30, 2021