Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Use FxHasher in places where we don't need DDoS resistance #2342

Open
wants to merge 27 commits into
base: main
Choose a base branch
from

Conversation

larseggert
Copy link
Collaborator

@larseggert larseggert commented Jan 10, 2025

I think this may be worthwhile. The cargo benches don't consistently show a benefit, but the loopback transfers on the bencher machine are faster, e.g.,

neqo 	neqo 	cubic 	on 	1504 	495.9 ± 96.2 	426.6 	712.7

without this PR but

neqo 	neqo 	cubic 	on 	1504 	429.1 ± 9.6 	415.4 	442.6

with it.

(I'll see if I can improve CI so that we also see the differences to main for the table results.)

@larseggert larseggert marked this pull request as ready for review January 10, 2025 14:12
Copy link

codecov bot commented Jan 10, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.26%. Comparing base (966ece6) to head (69a6710).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2342      +/-   ##
==========================================
- Coverage   95.26%   95.26%   -0.01%     
==========================================
  Files         115      115              
  Lines       37198    37198              
  Branches    37198    37198              
==========================================
- Hits        35436    35435       -1     
- Misses       1756     1757       +1     
  Partials        6        6              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

github-actions bot commented Jan 10, 2025

Failed Interop Tests

QUIC Interop Runner, client vs. server, differences relative to db807a9.

neqo-latest as client

neqo-latest as server

All results

Succeeded Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Unsupported Interop Tests

QUIC Interop Runner, client vs. server

neqo-latest as client

neqo-latest as server

Copy link

github-actions bot commented Jan 10, 2025

Benchmark results

Performance differences relative to 966ece6.

decode 4096 bytes, mask ff: 💚 Performance has improved.
       time:   [10.885 µs 11.009 µs 11.220 µs]
       change: [-12.410% -11.713% -10.806%] (p = 0.00 < 0.05)

Found 16 outliers among 100 measurements (16.00%)
1 (1.00%) low severe
3 (3.00%) low mild
2 (2.00%) high mild
10 (10.00%) high severe

decode 1048576 bytes, mask ff: 💔 Performance has regressed.
       time:   [3.1253 ms 3.1332 ms 3.1425 ms]
       change: [+9.5412% +10.102% +10.624%] (p = 0.00 < 0.05)

Found 7 outliers among 100 measurements (7.00%)
7 (7.00%) high severe

decode 4096 bytes, mask 7f: 💚 Performance has improved.
       time:   [17.645 µs 17.694 µs 17.750 µs]
       change: [-15.834% -15.483% -15.136%] (p = 0.00 < 0.05)

Found 18 outliers among 100 measurements (18.00%)
2 (2.00%) low severe
2 (2.00%) low mild
14 (14.00%) high severe

decode 1048576 bytes, mask 7f: 💔 Performance has regressed.
       time:   [5.4000 ms 5.4113 ms 5.4242 ms]
       change: [+18.481% +18.896% +19.331%] (p = 0.00 < 0.05)

Found 13 outliers among 100 measurements (13.00%)
13 (13.00%) high severe

decode 4096 bytes, mask 3f: 💚 Performance has improved.
       time:   [6.6033 µs 6.6303 µs 6.6616 µs]
       change: [-21.578% -21.138% -20.605%] (p = 0.00 < 0.05)

Found 5 outliers among 100 measurements (5.00%)
3 (3.00%) high mild
2 (2.00%) high severe

decode 1048576 bytes, mask 3f: 💔 Performance has regressed.
       time:   [1.7579 ms 1.7581 ms 1.7582 ms]
       change: [+9.8402% +10.318% +10.706%] (p = 0.00 < 0.05)

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) high mild
1 (1.00%) high severe

coalesce_acked_from_zero 1+1 entries: No change in performance detected.
       time:   [91.344 ns 91.626 ns 91.922 ns]
       change: [-0.2968% +0.1447% +0.5900%] (p = 0.51 > 0.05)

Found 12 outliers among 100 measurements (12.00%)
8 (8.00%) high mild
4 (4.00%) high severe

coalesce_acked_from_zero 3+1 entries: No change in performance detected.
       time:   [109.66 ns 109.92 ns 110.21 ns]
       change: [-0.5182% -0.1354% +0.1948%] (p = 0.48 > 0.05)

Found 8 outliers among 100 measurements (8.00%)
2 (2.00%) low mild
1 (1.00%) high mild
5 (5.00%) high severe

coalesce_acked_from_zero 10+1 entries: No change in performance detected.
       time:   [109.53 ns 109.97 ns 110.50 ns]
       change: [-0.3668% +0.8457% +2.5725%] (p = 0.26 > 0.05)

Found 16 outliers among 100 measurements (16.00%)
5 (5.00%) low severe
1 (1.00%) low mild
2 (2.00%) high mild
8 (8.00%) high severe

coalesce_acked_from_zero 1000+1 entries: No change in performance detected.
       time:   [93.135 ns 93.264 ns 93.409 ns]
       change: [-0.9337% +0.1372% +1.1600%] (p = 0.82 > 0.05)

Found 11 outliers among 100 measurements (11.00%)
7 (7.00%) high mild
4 (4.00%) high severe

RxStreamOrderer::inbound_frame(): Change within noise threshold.
       time:   [111.53 ms 111.58 ms 111.64 ms]
       change: [-0.6630% -0.5984% -0.5338%] (p = 0.00 < 0.05)

Found 15 outliers among 100 measurements (15.00%)
6 (6.00%) low mild
7 (7.00%) high mild
2 (2.00%) high severe

SentPackets::take_ranges: No change in performance detected.
       time:   [5.2749 µs 5.4594 µs 5.6571 µs]
       change: [-3.4818% +0.0194% +3.3201%] (p = 0.99 > 0.05)

Found 9 outliers among 100 measurements (9.00%)
8 (8.00%) high mild
1 (1.00%) high severe

transfer/pacing-false/varying-seeds: Change within noise threshold.
       time:   [34.066 ms 34.133 ms 34.205 ms]
       change: [-0.8291% -0.5592% -0.2986%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

transfer/pacing-true/varying-seeds: Change within noise threshold.
       time:   [34.167 ms 34.230 ms 34.300 ms]
       change: [-1.0494% -0.8154% -0.5389%] (p = 0.00 < 0.05)

Found 2 outliers among 100 measurements (2.00%)
1 (1.00%) low mild
1 (1.00%) high severe

transfer/pacing-false/same-seed: Change within noise threshold.
       time:   [33.760 ms 33.818 ms 33.878 ms]
       change: [-2.0543% -1.8058% -1.5756%] (p = 0.00 < 0.05)

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high severe

transfer/pacing-true/same-seed: No change in performance detected.
       time:   [34.253 ms 34.318 ms 34.382 ms]
       change: [-0.5293% -0.2585% -0.0049%] (p = 0.06 > 0.05)
1-conn/1-100mb-resp/mtu-1504 (aka. Download)/client: 💚 Performance has improved.
       time:   [800.11 ms 809.09 ms 818.43 ms]
       thrpt:  [122.18 MiB/s 123.60 MiB/s 124.98 MiB/s]
change:
       time:   [-4.4311% -2.8044% -1.3297%] (p = 0.00 < 0.05)
       thrpt:  [+1.3476% +2.8853% +4.6366%]
1-conn/10_000-parallel-1b-resp/mtu-1504 (aka. RPS)/client: 💚 Performance has improved.
       time:   [298.13 ms 301.62 ms 305.17 ms]
       thrpt:  [32.769 Kelem/s 33.154 Kelem/s 33.543 Kelem/s]
change:
       time:   [-7.9734% -6.5578% -5.0939%] (p = 0.00 < 0.05)
       thrpt:  [+5.3673% +7.0180% +8.6642%]

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

1-conn/1-1b-resp/mtu-1504 (aka. HPS)/client: No change in performance detected.
       time:   [25.483 ms 25.646 ms 25.816 ms]
       thrpt:  [38.735  elem/s 38.992  elem/s 39.241  elem/s]
change:
       time:   [-0.6933% +0.1979% +1.1117%] (p = 0.68 > 0.05)
       thrpt:  [-1.0995% -0.1975% +0.6981%]

Found 1 outliers among 100 measurements (1.00%)
1 (1.00%) high mild

1-conn/1-100mb-resp/mtu-1504 (aka. Upload)/client: 💚 Performance has improved.
       time:   [1.7306 s 1.7472 s 1.7637 s]
       thrpt:  [56.699 MiB/s 57.235 MiB/s 57.782 MiB/s]
change:
       time:   [-7.6104% -6.2437% -4.9015%] (p = 0.00 < 0.05)
       thrpt:  [+5.1541% +6.6595% +8.2372%]

Client/server transfer results

Performance differences relative to 966ece6.

Transfer of 33554432 bytes over loopback, 30 runs. All unit-less numbers are in milliseconds.

Client Server CC Pacing Mean ± σ Min Max Δ main Δ main
neqo neqo reno on 542.1 ± 77.4 444.9 807.4 -2.5 -0.1%
neqo neqo reno 559.1 ± 122.1 459.0 1102.2 25.7 1.2%
neqo neqo cubic on 544.4 ± 38.1 460.0 648.9 3.3 0.2%
neqo neqo cubic 538.0 ± 43.7 471.9 688.1 1.4 0.1%
google neqo reno on 887.7 ± 104.7 649.8 1105.7 10.7 0.3%
google neqo reno 886.3 ± 97.0 631.3 1083.4 5.7 0.2%
google neqo cubic on 880.6 ± 97.7 639.2 977.7 2.4 0.1%
google neqo cubic 891.4 ± 108.2 644.6 1169.6 16.5 0.5%
google google 540.2 ± 22.6 520.2 635.7 -8.0 -0.4%
neqo msquic reno on 232.9 ± 45.4 204.1 443.8 7.0 0.8%
neqo msquic reno 231.8 ± 42.8 202.2 390.4 10.9 1.2%
neqo msquic cubic on 229.1 ± 37.8 201.6 383.7 0.5 0.1%
neqo msquic cubic 231.4 ± 43.5 199.8 372.1 10.9 1.2%
msquic msquic 122.5 ± 38.1 97.5 299.9 5.2 1.1%

⬇️ Download logs

@larseggert larseggert marked this pull request as draft February 4, 2025 15:10
@larseggert larseggert marked this pull request as ready for review February 4, 2025 16:08
Copy link
Collaborator

@mxinden mxinden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 in general.

That said, I would prefer only replacing std::collectionsHash* where ever it proofs beneficial, e.g. not in unit tests.

Cargo.toml Outdated Show resolved Hide resolved
neqo-transport/src/connection/tests/stream.rs Outdated Show resolved Hide resolved
neqo-transport/src/crypto.rs Outdated Show resolved Hide resolved
Copy link
Member

@martinthomson martinthomson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've spotted two places where EnumMap could give us bigger wins.

I think that your performance gains largely derive from the changes to the client and server code. There, the security risk is limited (we're not using this server in real deployments).

Still, you should review the changes for security risk. This hasher could expose us to DoS if the hashed values are controlled by an adversary. I've checked the usage in our server code, which is fine because attackers don't get to control memory allocations (we use pointer values for the hash). Still, that makes me wonder whether we should be using Pin.

neqo-transport/src/crypto.rs Outdated Show resolved Hide resolved
neqo-transport/src/tparams.rs Outdated Show resolved Hide resolved
@larseggert
Copy link
Collaborator Author

larseggert commented Feb 6, 2025

@martinthomson thanks for the analysis. My plan is to add some benches first in another PR. I'll add some for those instances where you suggest to look into EnumMap as well.

Even if some of the macro benefits come from speeding up the demo client and server code, it's IMO still worth doing, since eliminating those overheads makes it easier to spot other bottlenecks.

About security, I didn't do much of an analysis, but I think the main use of this insecure hasher would be when looking up items (streams, unacked chunks) that while under the control of an attacker are also quite limited in what valid values are that wouldn't immediately cause a connection clause.

@martinthomson
Copy link
Member

I definitely agree with the point about removing the overheads from our toy code as much as possible. This seems like a pretty substantial win there, so it's worth doing. I doubt that my EnumMap suggestions will have a major impact, but the change did highlight the possibility (and it's not that much typing to switch over).

@larseggert
Copy link
Collaborator Author

I think the EnumMap work should be factored out another PR, it will cause a bunch of changes throughout.

@larseggert larseggert changed the title feat: Try FxHasher to see if it makes a difference feat: Use FxHasher in places where we don't need DDoS resistance Feb 10, 2025
neqo-http3/src/connection.rs Show resolved Hide resolved
pub send_streams: HashMap<StreamId, Box<dyn SendStream>>,
pub recv_streams: HashMap<StreamId, Box<dyn RecvStream>>,
streams_with_pending_data: HashSet<StreamId>,
pub send_streams: IndexMap<StreamId, Box<dyn SendStream>>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conceptually, why is an IndexMap faster than a HashMap here? Both end up hashing the key, right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not faster, it's possibly even slower. I made the change based on this comment: https://github.com/mozilla/neqo/pull/2342/files/bd061693b40e91b846c4d4cd1bc0ecfcd27c4e45#r1945509653

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that we want to be able to iterate send_streams and the like in order? Given that HashMap does not guarantee order, you are suggesting to use IndexMap like we do in neqo-transport.

If so, the requirement for ordering might be worth documenting. It wasn't obvious to me.

In that case, preference for using BTreeMap, solely because it is in std and only use IndexMap in case it proves to be more performant than BTreeMap in a meaningful way.

Given that IndexMap is a fairly small dependency, the above is not a strong opinion.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm not completely sure that we need consistently ordered iteration here. We rely on that in the transport crate because we want frame sending to follow a predictable pattern (clear the old stuff out first). However, that's not something we rely on at this layer of the stack, so whatever is fastest might be the right answer.

Maybe the first question we have is whether we need (insertion-)ordered iteration over this collection?

@@ -43,8 +44,8 @@ pub struct WebTransportSession {
state: SessionState,
frame_reader: FrameReader,
events: Box<dyn ExtendedConnectEvents>,
send_streams: BTreeSet<StreamId>,
recv_streams: BTreeSet<StreamId>,
send_streams: HashSet<StreamId>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the benefit of a HashSet over a BTreeSet` here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was assuming HashSet to be faster?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should only make these changes if we have proof that they are more performant.

Connecting this to the discussion above, might ordering (i.e. thus BTreeSet) be relevant here as well?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point about only changing this after benchmarking.

@@ -75,6 +78,8 @@ pub use self::{
version::Version,
};

pub type IndexMap<K, V> = indexmap::IndexMap<K, V, BuildHasherDefault<FxHasher>>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this more performant? Can you add a comment?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably nor, see above as to why.

@mxinden
Copy link
Collaborator

mxinden commented Feb 12, 2025

I've checked the usage in our server code, which is fine because attackers don't get to control memory allocations (we use pointer values for the hash). Still, that makes me wonder whether we should be using Pin.

Good point. Though before we introduce the complexity of Pin, we might find a simple way around hashing the pointer values in the first place.

@martinthomson
Copy link
Member

Though before we introduce the complexity of Pin, we might find a simple way around hashing the pointer values in the first place.

Definitely the right question to be asking. I think that it might be possible to use the first connection ID as a key for this sort of thing, but we don't tend to keep that around today, once we stop using it. Everything else -- as far as I know -- is ephemeral and therefore not suitable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants