Python bindings: deprecate some usage of str / bytes #5988

AllSeeingEyeTolledEweSew · 2021-02-18T01:21:43Z

Currently, the python bindings generally allow str as input where std::string or char * is expected internally. It attempts to convert str to bytes with the default encoding.

In some cases, I think this does more harm than good. I think we should deprecate this functionality (and fire DeprecationWarning).

I'm finished writing new unit tests and won't add any more items to this issue.

The text was updated successfully, but these errors were encountered:

AllSeeingEyeTolledEweSew · 2021-02-18T02:18:29Z

There's a solid argument that lots of str accessors should be deprecated, like file_storage.file_path() or even torrent_info.name(). These functions access encoding-agnostic byte strings, which may be created with unknown encodings on other systems, so it's usually a mistake to access them as str

However there are cases where str access is appropriate, such as an app creating torrent_info with synthetic data and filenames (not sourced from the filesystem). If the data was sourced from python str, it's totally fair to access it as str. Deprecating str access would be pretty unfriendly to such a case.

I am on the fence about whether we should be unfriendly to a fringe use case, in favor of preventing the huge majority of use cases from doing something wrong.

arvidn · 2021-04-10T20:09:40Z

when libtorrent loads a torrent, it sanitizes file paths/names as well as comment and creator strings. They should be valid UTF-8 strings (possibly with replacement characters for invalid sequences).

Is this not your experience?

Perhaps I'm a bit too ignorant of python, but str is always a unicode string, right? or is there a new type for that?

I agree that sha1_hash(), add_piece(), 'generate_fingerprint()' and sha1_hash.to_string() should not use str.

How come str inputs to bencode() causes round-trip issues? You mean because when you decode it, all strings are bytes?
I'm not sure I'm convinced of that. If you have a unicode string it seems convenient to not have to encode it explicitly.

AllSeeingEyeTolledEweSew · 2021-04-10T22:20:47Z

when libtorrent loads a torrent, it sanitizes file paths/names as well as comment and creator strings. They should be valid UTF-8 strings (possibly with replacement characters for invalid sequences).

Is this not your experience?

Ahh, I didn't think about the fact that the underyling torrent_info always does sanitization. I was focused on the fact that I can create a lot of non-utf8 data by instantiating file_storage() / create_torrent() directly.

So I'm wrong; I guess there isn't a case where torrent_info.name() is "wrongly" translating encoding-agnostic strings. The underlying c++ code already munged it to unicode, so it's appropriate for the python bindings to always present that as str.

You're right, python str is unicode.

How come str inputs to bencode() causes round-trip issues? You mean because when you decode it, all strings are bytes?
I'm not sure I'm convinced of that. If you have a unicode string it seems convenient to not have to encode it explicitly.

There's a particular danger of allowing str inputs to bencode(); the strings will be pathnames in many cases. If they came from the filesystem then they use python's filesystem encoding, and handling them with strict utf-8 encoding is inappropriate, but will work most of the time and so be a good source of bugs. I talk about this more in #5984.

For this reason and others, there has been steady momentum over the years to eliminate implicit encoding, or mixing str and bytes. I don't think it's too surprising to a python dev to see bencode() requires bytes. In the standard library today and the popular third-party libraries, I can't think of any cases where bytes/str can be mixed freely. There are some "templated" functions and data structures, but I can't think of a way to apply that to bencode()/bdecode().

Given all that, I think the burden is to come up with a good use case where mixing str/bytes provides value, and I haven't been able to come up with this. Most of my usage of bencode() is for creating synthetic data for unit tests, where it's as easy to write b"test.txt" as "test.txt". Otherwise I do some manipulation of resume data.

If I want to craft new torrents, I think create_torrent()/file_storage() is a fine high-level interface, where str is more appropriate. It's certainly more convenient than handcrafting an info-dict.

Deluge uses a custom bencode function, not libtorrent's. Not sure why.

AllSeeingEyeTolledEweSew · 2021-04-11T16:29:50Z

Responding here from #6125

What kind of test do you envision for the str patch? that all valid "entry" objects round-trip through bencode() -> bdecode()?

I think the tests for str should be the same as the tests for other deprecated inputs -- that they bencode() to expected values, but generate warnings.

Test-wise, I think testing round-trip stability (v == bdecode(bencode(v))) makes tests readable, but it's not really a guarantee of the API especially in the face of "preformatted"-type inputs.

I do think round-trip stability is best, to improve clarity and reduce usage bugs. But unit tests should test guarantees, not design goals.

I actually think we should add tests that bdecode(bencode("abc")) == b"abc" to ensure we don't change functionality while deprecating, since the point of deprecating is to keep functionality for a while.

AllSeeingEyeTolledEweSew · 2021-08-02T18:02:29Z

I changed my mind about torrent_handle.set_ssl_certificate(..., passphrase="str"). Passphrases are ultimately bytes, but I think it's probably fine if some app only supports them in a str context, so we should just leave it allowed. Removed it from the list.

arvidn added this to the 1.2.14 milestone Apr 8, 2021

AllSeeingEyeTolledEweSew mentioned this issue Apr 10, 2021

deprecate invalid inputs to bencode() #6125

Merged

AllSeeingEyeTolledEweSew changed the title ~~Python bindings: deprecate some usage of str~~ Python bindings: deprecate some usage of str / bytes Apr 15, 2021

arvidn modified the milestones: 1.2.14, 1.2.15 Jun 7, 2021

arvidn modified the milestones: 1.2.15, 1.2.16 Dec 27, 2021

arvidn modified the milestones: 1.2.16, 1.2.17 Apr 17, 2022

arvidn modified the milestones: 1.2.17, 1.2.19 Apr 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Sponsors

Python bindings: deprecate some usage of str / bytes #5988

Python bindings: deprecate some usage of str / bytes #5988

AllSeeingEyeTolledEweSew commented Feb 18, 2021 •

edited

Loading

AllSeeingEyeTolledEweSew commented Feb 18, 2021

arvidn commented Apr 10, 2021 •

edited

Loading

AllSeeingEyeTolledEweSew commented Apr 10, 2021

AllSeeingEyeTolledEweSew commented Apr 11, 2021

AllSeeingEyeTolledEweSew commented Aug 2, 2021

Python bindings: deprecate some usage of str / bytes #5988

Python bindings: deprecate some usage of str / bytes #5988

Comments

AllSeeingEyeTolledEweSew commented Feb 18, 2021 • edited Loading

AllSeeingEyeTolledEweSew commented Feb 18, 2021

arvidn commented Apr 10, 2021 • edited Loading

AllSeeingEyeTolledEweSew commented Apr 10, 2021

AllSeeingEyeTolledEweSew commented Apr 11, 2021

AllSeeingEyeTolledEweSew commented Aug 2, 2021

AllSeeingEyeTolledEweSew commented Feb 18, 2021 •

edited

Loading

arvidn commented Apr 10, 2021 •

edited

Loading