Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve NodeSession TCP performance #319

Merged
merged 1 commit into from
Jan 13, 2025

Conversation

kanru
Copy link
Contributor

@kanru kanru commented Jan 9, 2025

Each of NodeSession's message is encoded with a 2 bytes length prefix followed by the actual object data. Originally when writing the message to TcpStream we write them separately, however the TcpStream is not buffered, so the first 2 bytes write may be pushed to the underlying network stack as a tiny packet, then the data is pushed as one or more packets.

The first tiny packet might trigger the pathological interaction between Nagle's Algorithm and Delayed Acknowledgement in TCP, causing from 40ms to 200ms delay of the actual payload delivery. This can be observed from the average 40ms latency between two nodes on the same host conneting via the loopback interface.

Combine the length prefix and payload data to one write_all call to avoid this latency anomaly. Since the message is already length prefixed, also skip the length delimiter encoding in proto.

Note - remove the double length prefix is a breaking change. If it's not desired, we can keep the original length prefix in proto by using encode_length_delimited. That would just add a little extra overhead.

Copy link

codecov bot commented Jan 9, 2025

Codecov Report

Attention: Patch coverage is 0% with 13 lines in your changes missing coverage. Please review.

Project coverage is 81.61%. Comparing base (9f8a239) to head (5c9e9d3).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ractor_cluster/src/net/session.rs 0.00% 13 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #319      +/-   ##
==========================================
+ Coverage   81.58%   81.61%   +0.03%     
==========================================
  Files          61       61              
  Lines       12304    12299       -5     
==========================================
  Hits        10038    10038              
+ Misses       2266     2261       -5     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@kanru kanru force-pushed the refactor-improve-tcp-perf branch 2 times, most recently from d716e0e to 9f46597 Compare January 9, 2025 12:08
@kanru
Copy link
Contributor Author

kanru commented Jan 9, 2025

Hmm, pg_group integration test failed but other integration tests passed. I think there's a race condition - now that we finish session setup much faster, when the tests run the PG membership hasn't been propagated. If I insert a sleep between session setup and the test then it passes locally.

@slawlor
Copy link
Owner

slawlor commented Jan 9, 2025

I'm concerned about the need to wait for the sync to occur. It "should" occur immediately after the tcp session is authenticated by the NodeSession reference.

We should be spawning the remote actors, then synchronizing the PG groups. In "theory" it's possible we need to reverse these two statements, so we notify the node server that the session is ready only after the rest of the post-auth steps are done. But that'd be very surprising. link.

Maybe let's try reversing those prior to doing this sleep thing, it really shouldn't be the case, and is concerning if it is.

@kanru
Copy link
Contributor Author

kanru commented Jan 9, 2025

My understanding is that in after_authenticated we push the membership to peers. There are two async events 1. tcp_send_control() just cast! to the messaging actor, it does not wait for reply. 2. we also need the peer to push the membership to us, and this is also inherently async.

If we want to guarantee the PG groups are synced after session setup, we probably need to include it in the auth state machine.

@kanru
Copy link
Contributor Author

kanru commented Jan 9, 2025

In other words, I don't feel swapping the two lines can help. I tried anyway, and the test still failed.

@slawlor
Copy link
Owner

slawlor commented Jan 9, 2025

If we want to guarantee the PG groups are synced after session setup, we probably need to include it in the auth state machine.

Yes OK I see your point now. It might be work adding a control message stating that initial synchronization has completed, and both sides wait to mark the node online until it's completed. In its present form it's complicated to guarantee that a node is "fully" online without these kind of manual polling techniques/etc

@kanru
Copy link
Contributor Author

kanru commented Jan 9, 2025

If we want to guarantee the PG groups are synced after session setup, we probably need to include it in the auth state machine.

Yes OK I see your point now. It might be work adding a control message stating that initial synchronization has completed, and both sides wait to mark the node online until it's completed. In its present form it's complicated to guarantee that a node is "fully" online without these kind of manual polling techniques/etc

I thought of two options.

  1. We can change the initial PG sync from push based to pull model so the node session can know all PGs and remote actors have been created.
  2. Send an end-of-after-auth message to mark the end of node session setup. This will rely on the strong ordering guarantee of control messages.

I can try to make another PR for (2), what do you think?

@slawlor
Copy link
Owner

slawlor commented Jan 13, 2025

Mind rebasing and we can see what this looks like on the new changes for the sync state? Thanks!

Each of NodeSession's message is encoded with a 2 bytes length
prefix followed by the actual object data. Originally when writing
the message to TcpStream we write them separately, however the
TcpStream is not buffered, so the first 2 bytes write may be pushed
to the underlying network stack as a tiny packet, then the data
is pushed as one or more packets.

The first tiny packet might trigger the pathological interaction
between Nagle's Algorithm and Delayed Acknowledgement in TCP, causing
from 40ms to 200ms delay of the actual payload delivery. This can
be observed from the average 40ms latency between two nodes on the
same host conneting via the loopback interface.

Combine the length prefix and payload data to one write_all call
to avoid this latency anomaly. Since the message is already
length prefixed, also skip the length delimiter encoding in proto.

Signed-off-by: Kan-Ru Chen <[email protected]>
@kanru kanru force-pushed the refactor-improve-tcp-perf branch from dee0567 to 5c9e9d3 Compare January 13, 2025 16:02
@kanru
Copy link
Contributor Author

kanru commented Jan 13, 2025

Mind rebasing and we can see what this looks like on the new changes for the sync state? Thanks!

It works :)

Copy link
Owner

@slawlor slawlor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@slawlor slawlor merged commit 081c6b7 into slawlor:main Jan 13, 2025
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants