Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python bindings: add bytes accessors for data #5985

Open
1 task done
AllSeeingEyeTolledEweSew opened this issue Feb 17, 2021 · 4 comments
Open
1 task done

Python bindings: add bytes accessors for data #5985

AllSeeingEyeTolledEweSew opened this issue Feb 17, 2021 · 4 comments
Milestone

Comments

@AllSeeingEyeTolledEweSew
Copy link
Contributor

AllSeeingEyeTolledEweSew commented Feb 17, 2021

A number of fields in the torrent protocol (or libtorrent) are encoding-agnostic byte strings.

We should ensure there's always a way to access bytes, if bytes may be passed as input, or if the underlying data is encoding-agnostic.

One pattern in python for this is "runtime templating", where the type of output changes with the type of input. For instance, os.listdir(".") returns a list of str, while os.listdir(b".") returns a list of bytes. However, the python bindings have historically allowed either bytes or str as input for char * or std::string, but tend to always return str. If we make output type vary with input, we'll probably break a lot of clients.

The only alternative I can think of is have *_bytes versions of functions.

Here's a proposed list of functions returning str, which should have pair functions returning bytes:

EDIT: I changed my mind about most of these, realizing that the underlying C++ logic munges inputs and presents utf-8 strings. generate_fingerprint() is the only one that seems relevant now.

  • file_storage.symlink(): symlink_bytes()
  • file_storage.file_path(): file_path_bytes()
  • file_storage.file_name(): file_name_bytes()
  • file_storage.name(): name_bytes()
  • generate_fingerprint(): generate_fingerprint_bytes()
  • torrent_info.web_seeds(): web_seeds_bytes() (the "auth" field may be binary data)
  • torrent_info.name(): name_bytes()
  • torrent_info.creator(): creator_bytes()
  • torrent_info.comment(): comment_bytes()
  • torrent_info.collections(): collections_bytes()

I'll add more as I find them.

Adding these new functions should be a non-breaking change.


Thoughts on file_storage

EDIT: never mind

There's an argument that the fields of file_storage should be given "pathname treatment", and use the filesystem encoding as detailed in #5984. This would make sense when creating torrents off the local filesystem.

However, currently file_storage is used in lots of different contexts. A torrent found randomly on the internet and converted into a torrent_info has a file_storage whose bytes may represent an encoding from some other system.

So I think the best thing is to add *_bytes() accessors to file_storage, and clients can apply whatever decoding (or guesses, with fallbacks) as necessary.

Notably, that means that the proper usage for creating files from the filesystem is a little funky:

fs = lt.file_storage()
lt.add_files(fs, my_root_dir)
for i in fs.num_files():
    # file_storage is abstract, but we know the encoding came from the filesystem
    print(os.fsdecode(fs.file_path_bytes(i)))

The only alternative I can think of here is that file_storage could hold a flag to remember how its inputs are encoded, such that fs.file_path(i) is the same as os.fsdecode(fs.file_path_bytes(i)) if fs was populated from the filesystem, but I can't see any way to add this feature without making it confusing and error prone.

@AllSeeingEyeTolledEweSew
Copy link
Contributor Author

Also: URIs and HTTP header names and values are meant to be limited to ASCII by various underlying protocols, so IMO it's not important to make these data accessible as bytes. We can do this for API consistency if we want, though.

@arvidn arvidn added this to the 1.2.14 milestone Apr 8, 2021
@AllSeeingEyeTolledEweSew
Copy link
Contributor Author

If my perspective in #5984 (comment) is right, I'll close this.

generate_fingerprint() returning bytes is the only ask that would still be legitimate, but I don't feel strongly about it

@arvidn
Copy link
Owner

arvidn commented Apr 17, 2021

It seems right to return bytes there. The tests should also make sure bytes could be used to set the peer id

@AllSeeingEyeTolledEweSew
Copy link
Contributor Author

Per a lot of discussion, I stripped this down to just add a bytes version of generate_fingerprint(). I did that in #6349

@arvidn arvidn modified the milestones: 1.2.15, 1.2.16 Dec 27, 2021
@arvidn arvidn modified the milestones: 1.2.16, 1.2.17 Apr 17, 2022
@arvidn arvidn modified the milestones: 1.2.17, 1.2.19 Apr 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants