Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python bindings: deprecate nonstandard inputs to bencode #5991

Open
AllSeeingEyeTolledEweSew opened this issue Feb 18, 2021 · 9 comments
Open
Milestone

Comments

@AllSeeingEyeTolledEweSew
Copy link
Contributor

bencode() currently accepts any type as input, including object.

bdecode() only outputs dict, list, int or bytes. This means there are lots of values that aren't stable round-trip, that is v != bdecode(bencode(v)).

It's not clear this support is helpful for any use case, and it can hide mistakes.

bencode() should only accept combinations of dict, list, int or bytes. Any others should fire DeprecationWarning, and should eventually raise TypeError at some later time.

@AllSeeingEyeTolledEweSew
Copy link
Contributor Author

I only just realized that tuples of ints (100, 100, 100) are treated as preformatted inputs to bencode(). I guess round-trip stability is intentionally bypassed here. I think tuples-of-ints is a pretty unfriendly way to present this in python; I'll have to think more about the best approach here.

In the meantime at least fleat and object should be deprecated.

@arvidn
Copy link
Owner

arvidn commented Apr 10, 2021

you don't think it would be reasonable to accept string as well, and encode it as UTF-8 into bytes?

@arvidn arvidn added this to the 1.2.14 milestone Apr 10, 2021
@arvidn
Copy link
Owner

arvidn commented Apr 10, 2021

how does this look? #6125

@arvidn arvidn modified the milestones: 1.2.14, 1.2.15 Jun 7, 2021
@arvidn
Copy link
Owner

arvidn commented Jul 24, 2021

@AllSeeingEyeTolledEweSew that patch, #6125, would you consider that to address this issue?

Also, which one of your PRs do you think makes the most sense to land next?

@AllSeeingEyeTolledEweSew
Copy link
Contributor Author

I think str should be deprecated. I could probably do the patch. Sorry, I meant to do more patches to fix all the bugs I opened but I got momentum doing my own project.

#6188 is definitely my vote for what to do next.

@arvidn
Copy link
Owner

arvidn commented Jul 25, 2021

I still don't understand the rationale for deprecating str. In a way it simplifies usage, since the caller doesn't need to know that strings are supposed to be UTF-8 encoded.

On the other hand, accepting strings makes the API non-symmetric. The string won't round-trip.

What is the reason to deprecate str in your mind?

@AllSeeingEyeTolledEweSew
Copy link
Contributor Author

I think supporting str is an impedence mismatch with python 3.

Python 3 considers that string encodings should be explicit. Typically using a str input in the wrong place raises an explicit error, rather than implicitly encode the string. A user accustomed to this might expect that if they accidentally use a str somewhere in the input to bencode, that it should also raise an error. If we helpfully implicitly encode, the user could get themselves in trouble later.

If we were still in the python 2 world, I would want to support str inputs the way we do now, to match the environment.

@arvidn
Copy link
Owner

arvidn commented Jul 25, 2021

but bencoding does support strings, and they are supposed to be encoded as utf-8 (by the spec). In a way, one could argue that leaving the string encoding to the user opens up the possibility to generate invalid bencoded structures. In fact this used to be a problem in the early days, where people would encode filenames with arbitrary, local, code pages.

I think it's a stretch to say a str in a bencoded structure is "the wrong place". But I appreciate that it might be a better match to have bencoding and bdecoding the at the same abstraction level.

@AllSeeingEyeTolledEweSew
Copy link
Contributor Author

I believe that accidental or ill-considered use of str is more common than intentional use of str to mean "encode this as utf-8", as you describe. I don't have data for this assertion, but I do know that considerations like this came up a lot in python 2, and many people smarter than me decided the best course was to make encoding explicit in python 3.

Also, remember that not all strs encode to utf-8 cleanly. Invalid surrogate pairs exist, and this is used as a "feature" for pathname handling. An app author might use a str for a pathname passed to bencode(). But if the str came from the filesystem it will have been treated with the filesystem encoding, and it may not encode to utf-8. There are examples of this for both windows and linux. The proper handling is something like os.fsencode(str_path).decode("utf-8", "replace"). If we allow str inputs and the author only tests with ascii filenames, then the failure will only be found by an unlucky user.

Lastly, as you mention, I think it's best to have bencode() and bdecode() match each other to minimize surprise.

So I still vote to deprecate str.

@arvidn arvidn modified the milestones: 1.2.15, 1.2.16 Dec 27, 2021
@arvidn arvidn modified the milestones: 1.2.16, 1.2.17 Apr 17, 2022
@arvidn arvidn modified the milestones: 1.2.17, 1.2.19 Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants