Improve NodeSession TCP performance #319

kanru · 2025-01-09T11:49:40Z

Each of NodeSession's message is encoded with a 2 bytes length prefix followed by the actual object data. Originally when writing the message to TcpStream we write them separately, however the TcpStream is not buffered, so the first 2 bytes write may be pushed to the underlying network stack as a tiny packet, then the data is pushed as one or more packets.

The first tiny packet might trigger the pathological interaction between Nagle's Algorithm and Delayed Acknowledgement in TCP, causing from 40ms to 200ms delay of the actual payload delivery. This can be observed from the average 40ms latency between two nodes on the same host conneting via the loopback interface.

Combine the length prefix and payload data to one write_all call to avoid this latency anomaly. Since the message is already length prefixed, also skip the length delimiter encoding in proto.

Note - remove the double length prefix is a breaking change. If it's not desired, we can keep the original length prefix in proto by using encode_length_delimited. That would just add a little extra overhead.

codecov · 2025-01-09T11:52:30Z

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.

Project coverage is 81.61%. Comparing base (9f8a239) to head (5c9e9d3).
Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
ractor_cluster/src/net/session.rs	0.00%	13 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #319      +/-   ##
==========================================
+ Coverage   81.58%   81.61%   +0.03%     
==========================================
  Files          61       61              
  Lines       12304    12299       -5     
==========================================
  Hits        10038    10038              
+ Misses       2266     2261       -5

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

kanru · 2025-01-09T12:55:12Z

Hmm, pg_group integration test failed but other integration tests passed. I think there's a race condition - now that we finish session setup much faster, when the tests run the PG membership hasn't been propagated. If I insert a sleep between session setup and the test then it passes locally.

slawlor · 2025-01-09T14:30:04Z

I'm concerned about the need to wait for the sync to occur. It "should" occur immediately after the tcp session is authenticated by the NodeSession reference.

We should be spawning the remote actors, then synchronizing the PG groups. In "theory" it's possible we need to reverse these two statements, so we notify the node server that the session is ready only after the rest of the post-auth steps are done. But that'd be very surprising. link.

Maybe let's try reversing those prior to doing this sleep thing, it really shouldn't be the case, and is concerning if it is.

kanru · 2025-01-09T14:49:07Z

My understanding is that in after_authenticated we push the membership to peers. There are two async events 1. tcp_send_control() just cast! to the messaging actor, it does not wait for reply. 2. we also need the peer to push the membership to us, and this is also inherently async.

If we want to guarantee the PG groups are synced after session setup, we probably need to include it in the auth state machine.

kanru · 2025-01-09T15:26:29Z

In other words, I don't feel swapping the two lines can help. I tried anyway, and the test still failed.

slawlor · 2025-01-09T18:10:28Z

If we want to guarantee the PG groups are synced after session setup, we probably need to include it in the auth state machine.

Yes OK I see your point now. It might be work adding a control message stating that initial synchronization has completed, and both sides wait to mark the node online until it's completed. In its present form it's complicated to guarantee that a node is "fully" online without these kind of manual polling techniques/etc

kanru · 2025-01-09T23:19:05Z

If we want to guarantee the PG groups are synced after session setup, we probably need to include it in the auth state machine.

Yes OK I see your point now. It might be work adding a control message stating that initial synchronization has completed, and both sides wait to mark the node online until it's completed. In its present form it's complicated to guarantee that a node is "fully" online without these kind of manual polling techniques/etc

I thought of two options.

We can change the initial PG sync from push based to pull model so the node session can know all PGs and remote actors have been created.
Send an end-of-after-auth message to mark the end of node session setup. This will rely on the strong ordering guarantee of control messages.

I can try to make another PR for (2), what do you think?

slawlor · 2025-01-13T15:36:51Z

Mind rebasing and we can see what this looks like on the new changes for the sync state? Thanks!

Each of NodeSession's message is encoded with a 2 bytes length prefix followed by the actual object data. Originally when writing the message to TcpStream we write them separately, however the TcpStream is not buffered, so the first 2 bytes write may be pushed to the underlying network stack as a tiny packet, then the data is pushed as one or more packets. The first tiny packet might trigger the pathological interaction between Nagle's Algorithm and Delayed Acknowledgement in TCP, causing from 40ms to 200ms delay of the actual payload delivery. This can be observed from the average 40ms latency between two nodes on the same host conneting via the loopback interface. Combine the length prefix and payload data to one write_all call to avoid this latency anomaly. Since the message is already length prefixed, also skip the length delimiter encoding in proto. Signed-off-by: Kan-Ru Chen <[email protected]>

kanru · 2025-01-13T16:10:42Z

Mind rebasing and we can see what this looks like on the new changes for the sync state? Thanks!

It works :)

slawlor

LGTM

kanru force-pushed the refactor-improve-tcp-perf branch 2 times, most recently from d716e0e to 9f46597 Compare January 9, 2025 12:08

kanru mentioned this pull request Jan 10, 2025

Add a new NodeSessionMessage to get the state of initial sync #322

Merged

kanru force-pushed the refactor-improve-tcp-perf branch from 3f252a8 to dee0567 Compare January 11, 2025 03:10

kanru force-pushed the refactor-improve-tcp-perf branch from dee0567 to 5c9e9d3 Compare January 13, 2025 16:02

slawlor approved these changes Jan 13, 2025

View reviewed changes

slawlor merged commit 081c6b7 into slawlor:main Jan 13, 2025
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve NodeSession TCP performance #319

Improve NodeSession TCP performance #319

kanru commented Jan 9, 2025

codecov bot commented Jan 9, 2025 •

edited

Loading

kanru commented Jan 9, 2025

slawlor commented Jan 9, 2025

kanru commented Jan 9, 2025

kanru commented Jan 9, 2025

slawlor commented Jan 9, 2025

kanru commented Jan 9, 2025

slawlor commented Jan 13, 2025

kanru commented Jan 13, 2025

slawlor left a comment

Improve NodeSession TCP performance #319

Improve NodeSession TCP performance #319

Conversation

kanru commented Jan 9, 2025

codecov bot commented Jan 9, 2025 • edited Loading

Codecov Report

kanru commented Jan 9, 2025

slawlor commented Jan 9, 2025

kanru commented Jan 9, 2025

kanru commented Jan 9, 2025

slawlor commented Jan 9, 2025

kanru commented Jan 9, 2025

slawlor commented Jan 13, 2025

kanru commented Jan 13, 2025

slawlor left a comment

Choose a reason for hiding this comment

codecov bot commented Jan 9, 2025 •

edited

Loading