-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Use FxHasher
in places where we don't need DDoS resistance
#2342
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2342 +/- ##
==========================================
- Coverage 95.26% 95.26% -0.01%
==========================================
Files 115 115
Lines 37198 37198
Branches 37198 37198
==========================================
- Hits 35436 35435 -1
- Misses 1756 1757 +1
Partials 6 6 ☔ View full report in Codecov by Sentry. |
Failed Interop TestsQUIC Interop Runner, client vs. server, differences relative to db807a9. neqo-latest as client
neqo-latest as server
All resultsSucceeded Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
Unsupported Interop TestsQUIC Interop Runner, client vs. server neqo-latest as client
neqo-latest as server
|
Benchmark resultsPerformance differences relative to 966ece6. decode 4096 bytes, mask ff: 💚 Performance has improved.time: [10.885 µs 11.009 µs 11.220 µs] change: [-12.410% -11.713% -10.806%] (p = 0.00 < 0.05) decode 1048576 bytes, mask ff: 💔 Performance has regressed.time: [3.1253 ms 3.1332 ms 3.1425 ms] change: [+9.5412% +10.102% +10.624%] (p = 0.00 < 0.05) decode 4096 bytes, mask 7f: 💚 Performance has improved.time: [17.645 µs 17.694 µs 17.750 µs] change: [-15.834% -15.483% -15.136%] (p = 0.00 < 0.05) decode 1048576 bytes, mask 7f: 💔 Performance has regressed.time: [5.4000 ms 5.4113 ms 5.4242 ms] change: [+18.481% +18.896% +19.331%] (p = 0.00 < 0.05) decode 4096 bytes, mask 3f: 💚 Performance has improved.time: [6.6033 µs 6.6303 µs 6.6616 µs] change: [-21.578% -21.138% -20.605%] (p = 0.00 < 0.05) decode 1048576 bytes, mask 3f: 💔 Performance has regressed.time: [1.7579 ms 1.7581 ms 1.7582 ms] change: [+9.8402% +10.318% +10.706%] (p = 0.00 < 0.05) coalesce_acked_from_zero 1+1 entries: No change in performance detected.time: [91.344 ns 91.626 ns 91.922 ns] change: [-0.2968% +0.1447% +0.5900%] (p = 0.51 > 0.05) coalesce_acked_from_zero 3+1 entries: No change in performance detected.time: [109.66 ns 109.92 ns 110.21 ns] change: [-0.5182% -0.1354% +0.1948%] (p = 0.48 > 0.05) coalesce_acked_from_zero 10+1 entries: No change in performance detected.time: [109.53 ns 109.97 ns 110.50 ns] change: [-0.3668% +0.8457% +2.5725%] (p = 0.26 > 0.05) coalesce_acked_from_zero 1000+1 entries: No change in performance detected.time: [93.135 ns 93.264 ns 93.409 ns] change: [-0.9337% +0.1372% +1.1600%] (p = 0.82 > 0.05) RxStreamOrderer::inbound_frame(): Change within noise threshold.time: [111.53 ms 111.58 ms 111.64 ms] change: [-0.6630% -0.5984% -0.5338%] (p = 0.00 < 0.05) SentPackets::take_ranges: No change in performance detected.time: [5.2749 µs 5.4594 µs 5.6571 µs] change: [-3.4818% +0.0194% +3.3201%] (p = 0.99 > 0.05) transfer/pacing-false/varying-seeds: Change within noise threshold.time: [34.066 ms 34.133 ms 34.205 ms] change: [-0.8291% -0.5592% -0.2986%] (p = 0.00 < 0.05) transfer/pacing-true/varying-seeds: Change within noise threshold.time: [34.167 ms 34.230 ms 34.300 ms] change: [-1.0494% -0.8154% -0.5389%] (p = 0.00 < 0.05) transfer/pacing-false/same-seed: Change within noise threshold.time: [33.760 ms 33.818 ms 33.878 ms] change: [-2.0543% -1.8058% -1.5756%] (p = 0.00 < 0.05) transfer/pacing-true/same-seed: No change in performance detected.time: [34.253 ms 34.318 ms 34.382 ms] change: [-0.5293% -0.2585% -0.0049%] (p = 0.06 > 0.05) 1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.time: [800.11 ms 809.09 ms 818.43 ms] thrpt: [122.18 MiB/s 123.60 MiB/s 124.98 MiB/s] change: time: [-4.4311% -2.8044% -1.3297%] (p = 0.00 < 0.05) thrpt: [+1.3476% +2.8853% +4.6366%] 1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💚 Performance has improved.time: [298.13 ms 301.62 ms 305.17 ms] thrpt: [32.769 Kelem/s 33.154 Kelem/s 33.543 Kelem/s] change: time: [-7.9734% -6.5578% -5.0939%] (p = 0.00 < 0.05) thrpt: [+5.3673% +7.0180% +8.6642%] 1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.time: [25.483 ms 25.646 ms 25.816 ms] thrpt: [38.735 elem/s 38.992 elem/s 39.241 elem/s] change: time: [-0.6933% +0.1979% +1.1117%] (p = 0.68 > 0.05) thrpt: [-1.0995% -0.1975% +0.6981%] 1-conn/1-100mb-resp/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.time: [1.7306 s 1.7472 s 1.7637 s] thrpt: [56.699 MiB/s 57.235 MiB/s 57.782 MiB/s] change: time: [-7.6104% -6.2437% -4.9015%] (p = 0.00 < 0.05) thrpt: [+5.1541% +6.6595% +8.2372%] Client/server transfer resultsPerformance differences relative to 966ece6. Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.
|
Signed-off-by: Lars Eggert <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 in general.
That said, I would prefer only replacing std::collectionsHash*
where ever it proofs beneficial, e.g. not in unit tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've spotted two places where EnumMap
could give us bigger wins.
I think that your performance gains largely derive from the changes to the client and server code. There, the security risk is limited (we're not using this server in real deployments).
Still, you should review the changes for security risk. This hasher could expose us to DoS if the hashed values are controlled by an adversary. I've checked the usage in our server code, which is fine because attackers don't get to control memory allocations (we use pointer values for the hash). Still, that makes me wonder whether we should be using Pin
.
@martinthomson thanks for the analysis. My plan is to add some benches first in another PR. I'll add some for those instances where you suggest to look into Even if some of the macro benefits come from speeding up the demo client and server code, it's IMO still worth doing, since eliminating those overheads makes it easier to spot other bottlenecks. About security, I didn't do much of an analysis, but I think the main use of this insecure hasher would be when looking up items (streams, unacked chunks) that while under the control of an attacker are also quite limited in what valid values are that wouldn't immediately cause a connection clause. |
I definitely agree with the point about removing the overheads from our toy code as much as possible. This seems like a pretty substantial win there, so it's worth doing. I doubt that my |
I think the |
FxHasher
in places where we don't need DDoS resistance
pub send_streams: HashMap<StreamId, Box<dyn SendStream>>, | ||
pub recv_streams: HashMap<StreamId, Box<dyn RecvStream>>, | ||
streams_with_pending_data: HashSet<StreamId>, | ||
pub send_streams: IndexMap<StreamId, Box<dyn SendStream>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conceptually, why is an IndexMap
faster than a HashMap
here? Both end up hashing the key, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not faster, it's possibly even slower. I made the change based on this comment: https://github.com/mozilla/neqo/pull/2342/files/bd061693b40e91b846c4d4cd1bc0ecfcd27c4e45#r1945509653
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do I understand correctly that we want to be able to iterate send_streams
and the like in order? Given that HashMap
does not guarantee order, you are suggesting to use IndexMap
like we do in neqo-transport
.
If so, the requirement for ordering might be worth documenting. It wasn't obvious to me.
In that case, preference for using BTreeMap
, solely because it is in std
and only use IndexMap
in case it proves to be more performant than BTreeMap
in a meaningful way.
Given that IndexMap
is a fairly small dependency, the above is not a strong opinion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm not completely sure that we need consistently ordered iteration here. We rely on that in the transport crate because we want frame sending to follow a predictable pattern (clear the old stuff out first). However, that's not something we rely on at this layer of the stack, so whatever is fastest might be the right answer.
Maybe the first question we have is whether we need (insertion-)ordered iteration over this collection?
@@ -43,8 +44,8 @@ pub struct WebTransportSession { | |||
state: SessionState, | |||
frame_reader: FrameReader, | |||
events: Box<dyn ExtendedConnectEvents>, | |||
send_streams: BTreeSet<StreamId>, | |||
recv_streams: BTreeSet<StreamId>, | |||
send_streams: HashSet<StreamId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the benefit of a HashSet
over a BTreeSet` here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was assuming HashSet
to be faster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should only make these changes if we have proof that they are more performant.
Connecting this to the discussion above, might ordering (i.e. thus BTreeSet
) be relevant here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point about only changing this after benchmarking.
@@ -75,6 +78,8 @@ pub use self::{ | |||
version::Version, | |||
}; | |||
|
|||
pub type IndexMap<K, V> = indexmap::IndexMap<K, V, BuildHasherDefault<FxHasher>>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this more performant? Can you add a comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's probably nor, see above as to why.
Good point. Though before we introduce the complexity of |
Signed-off-by: Lars Eggert <[email protected]>
Signed-off-by: Lars Eggert <[email protected]>
Definitely the right question to be asking. I think that it might be possible to use the first connection ID as a key for this sort of thing, but we don't tend to keep that around today, once we stop using it. Everything else -- as far as I know -- is ephemeral and therefore not suitable. |
I think this may be worthwhile. The
cargo bench
es don't consistently show a benefit, but the loopback transfers on the bencher machine are faster, e.g.,without this PR but
with it.
(I'll see if I can improve CI so that we also see the differences to
main
for the table results.)