-
Notifications
You must be signed in to change notification settings - Fork 766
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shrink the PBS uids cookie #1985
Comments
Ran an experiment with an online tool to see whether the uids cookie would be helped by protobuf. In short, I found we're be better off with compressed JSON. Here's what I did. Baseline JSONThe contents of the JSON uids file with 20 IDs:
This is 1441 bytes gzipped and base64 encoded. Note that I took care to minimize repeated strings so as not to favor gzip. When I used the same uid or bidder values, gzip did much better. :-) ProtobufCame up with this protobuf definition:
Used the tool at https://www.sisik.eu/proto to create a binary representation of the JSON above and it came out to be 1522 bytes. |
Discussed in PBS committee. @SyntaxNode has a hypothesis that protobuf would have the better CPU-performance profile, so even though the JSON approach saves a few bytes, he's planning to run an experiment to get some data. |
I initially thought the 1522 bytes for Protobuf looked great against the 1441 bytes for JSON+GZIP, but I realized in my testing the Protobuf result is not Base64 encoded. When we add that in we get a less exciting 2028 bytes. Using the same Baseline JSON model @bretg shared in his earlier comment, I've performed a comparison with several different formats. The first, JSON+Base64 is basically what we do today as a point of reference (but using the relative timestamps). Write / Encoding
Read / Decoding
Protobuf+Base64 is the most efficient option and while the output is 33% smaller than what we use today that compares unfavorably to the other formats. If we are optimizing for both speed and size equally, this is the best option. However, if we want to optimize more for size and a little less for speed, I propose the Protobuf+Brotli-Level-0+Base64 format which is just ~8% larger than GZIP while 400%+ faster in my write benchmarks and 30% faster in my read benchmarks. I experimented with using GZIP and Brotli for just the UIDs, but the overhead of both compression algorithms actually increased the final size. Similarly, I tested Protobuf+Snappy+Base64 but the Snappy compression increased the size a bit from just Protobuf+Base64. Are there any other compression libraries you'd like to see me add to this benchmark comparison? Any suggestions must have Go and Java libraries. |
VersioningWe need to represent the version outside of the encoded payload to determine which decoding approach we need to use. This approach needs to be backwards compatible with the current JSON+Base64 encoded format. We'll need to use a separate character not present in the Base64 URL character set, of which I think period "." is a good choice. This is the same choice made by JWT tokens and TCF2 consent strings. Example:
We should be fine for a long time with just using 1 character for the version followed by 1 character for the separator, but in the future we could extend the number of characters proceeding the separator if need be. The algorithm for version detection would be:
|
Protobuf is great - but first cookie values "uid" should be split based on datatype... Since protobuf real record size is based on datatype - longs would be way smaller than same data written as string. Also bidder string name could be moved to some table with uint32 ID. If needed adapters would have to declare using string/long field data (it could be also done automatically based on content).
|
That's a good observation. There are several possible data types we could detect and optimize. You mentioned long in your example, there are also uuid, base64 encoded bytes, and hex strings. My hope is the compression layer on top of the protobuf binary encoding will solve for these storage inefficiencies without needing to add structure complexity. Let's test it. I'll use the protobuf structure you provided. I replaced 25% of the entries in the Baseline JSON example with long values. This seems to be a slightly generous distribution based on the real world examples. The runtime complexity of both are close enough that I'll only list the sizes.
Without compression there is a 4% size reduction. With compression, there is a 4% size increase. It seems the compression algorithm fares a bit better with string numerics than with binary encoded numerics. This might not hold true for other potential optimized types though.
Yes, I agree that would provide a size savings. I'm going to test it with the following protobuf definition:
Without compression there is a 11% size reduction. With compression, there is a 9% size reduction. I'd like opinions if this is worth the complexity of maintaining a list of bidder ids. This cannot be as simple as an alphabetical index since we need to account for added bidders, removed bidders, and different bidders between PBS-Go, PBS-Java, and forks. |
Discussed in PBS committee. We're leaning towards the 'Protobuf+StringBidders+Brotli0-Base64' solution, with external version number. |
Java results are at https://github.com/snahornyi/uids-java-tests The team suggests using 'Protobuf+StringBidders+Brotli0**+**Base64' -- i.e. brotli and base64. @SyntaxNode - please confirm that your table above meant to include Base64... i.e. was the minus sign a typo? |
Confirmed. That is a typo. I'll fix it in my other comment. |
There's a detail here I don't think we've ironed out: how the UID itself is represented. I don't like the idea of having to configure for each adapter the datatype of their ID. Looking at the list of IDs in my current cookie, the pattern I see is that most are strings, and the ones that are ints are generally shorter (~20chars rather than ~40). I suppose we could manage this in the /setuid code that creates the values:
It's easy enough on the read side to deal with this, but we'd have to agree to prefer one over the other in case somehow both of them get set. |
@bretg Please review the conversation in this issue. The data type specific UID storage was discussed, explored, and ultimately rejected. We will be storing all UID values as strings. |
This is the updated protobuf definition that I am proposing. I've followed the best practices from the protobuf style guide. cookie2.proto
I've added the go package option required for the code generator. Java would need to add their own Java options as well. Alternatively, we could provide the options directly to the code generator, but the best practices from Google on the matter are to include them in the file. I'm ok with making it as easy as possible for us to generate code for use in both PBS implementations. This produces the following generated code structs for Go:
|
Thanks @SyntaxNode - just to make sure we're on the same page on the /cookie_sync endpoint expectations... when it receives a cookie, it's going to have to first see if it's base64 encoded JSON. If not, then it passes the body through the brotli and protobuf decoders, right? |
That's not what I had in mind. I proposed using a version prefix to make it easier / quicker for us to determine the correct code paths to use for decoding. In short, if it starts with "2." then remove those characters and decode the rest as a base64 encoded brotli compressed protobuf message. Else, consider it version 1 and error if there is a decoding problem. I think the progression of ideas in this issue thread has muddled the intended proposal. I'll create a separate Google doc with the full proposed specs on Monday. |
I have a proposal to move cookies to an external cache like Redis. The problem with base64-encoded cookies is that they can exceed the maximum allowed header size of 4-8K (default in nginx), and nginx will drop the request. The nginx buffer size has been increased, but as far as I understand, this is a temporary solution because if a user wants to sync with 30 bidders, the cookie size will become very large. |
Interesting idea @linux019 - so you're proposing bidder IDs are stored in PBC and PBS to set that cache ID in its cookie so that it can be used to look up in PBC? That could work, but would be quite costly: PBC volume would go way up and latency on the auction as well as it waited for the cache results. I would not want the /cookie_sync endpoint to attempt a write to PBC unless it knew that 3PC were supported. That would mean either a handshake or examining the UA details. |
This is from the production log. nginx was dropping these requests. It was fixed by increase
If cookie |
A lot of time has passed. It likely makes sense to use zstd over brotli nowadays (+40% faster at same compression ratio). |
This is an interesting idea, but a very large change that will incur significant extra cost and headache from a PBS host company for unknown benefit. I'll open a separate issue to talk through the details as a community, but honestly I would not expect the core team to consider this very high priority. You should assume that it will something the community will need to contribute.
My take is that I don't think this protobuf thing is going to happen. It's too much work for the added benefit. I propose we close this particular issue as "won't do". If we're going to address the too-many-bidder-ids problem, I think #4080 is the main solution. This distributed key-value DB seems like a reasonable alternate solution. |
There are so many server-side adapters now that the PBS uids cookie has grown so large that it's starting to affect what other cookie values the host company domain can receive. e.g. my uids cookie is 3900 bytes. (!)
We need to address this.
Here are some values from my cookie:
Totally, I have 32 bidder entries my cookie, for an average of 121 bytes per bidder encoded.
(See below for a major update to the original proposal.)
Current Structure
The
expires
value is used to drop the value from the cookie so /cookie_sync will get an updated ID from that bidder.Structure of the current cookie:
Background on the current structure
Here's the comments from the code: (usersync/cookie.go)
// "Legacy" cookies had UIDs without expiration dates, and recognized "0" as a legitimate UID for audienceNetwork.
// "Current" cookies always include UIDs with expiration dates, and never allow "0" for audienceNetwork.
//
// This Unmarshal method interprets both data formats, and does some conversions on legacy data to make it current.
// If you're seeing this message after March 2018, it's safe to assume that all the legacy cookies have been
// updated and remove the legacy logic.
Possible new structure
Here's a straw proposal based on using a relative timestamp rather than absolute:
The text was updated successfully, but these errors were encountered: