Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ChannelSend trait to prep for lossy sender #4797

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cpubot
Copy link

@cpubot cpubot commented Feb 4, 2025

Problem

Gossip service consistently sees bursts in the millions of packets over the 2 second stat submission window.

2025-02-04-142734_hyprshot

The service currently uses crossbeam_channel::unbounded channels to buffer packets, which as the name implies, will unboundedly allocate space for incoming packets. The channel receivers will ultimately drop packets outside the bounds of MAX_GOSSIP_TRAFFIC.

Summary of Changes

This PR introduces a trait, ChannelSend, which abstracts over a send side implementation of a channel. This lays the foundation for a follow up PR that will introduce a notion of a "lossy" channel sender. This implementation will wrap a crossbeam_channel::bounded channel with a sender that will drop messages as channel capacity is reached rather than continuing to buffer messages until the receiver side is drained.

This is introduced behind a trait bound due to the genericity of the streamer crate and the multi-purpose use of its receiver implementation. In particular, this will allow gossip to use the aforementioned bounded / lossy sender without requiring other components like repair and fetch stage to use this forthcoming implementation.

@cpubot cpubot requested a review from alessandrod February 4, 2025 23:29
@cpubot cpubot requested a review from behzadnouri February 5, 2025 01:17
@behzadnouri
Copy link

The topic of bounded channels have been brought up multiple times before.

We prefer not to use bounded channels because they do not distinguish between good and bad packets, and drop (or back pressure) packets indiscriminately.

Gossip (or any other protocol) packets are not equal in terms of importance for the protocol or how much resources they need to process. Bounded channels are oblivious to that.

If we decide to drop packets we should also take into account stake associated with sender or receiver or CrdsValues. Again something bounded channels ignore.

Dropping packets indiscriminately also makes it easier for a spammer to censor good packets.

The preferred alternative is doing load shedding at a stage in the pipeline where we have better context how to prioritize packets and which ones to drop if overwhelmed.

This is all assuming the right thing to do here is to drop some packets. But I do not have enough context on what problem you are trying to solve, and I can't rule out the culprit or the remedy is something else.

@cpubot
Copy link
Author

cpubot commented Feb 5, 2025

The topic of bounded channels have been brought up multiple times before.

We prefer not to use bounded channels because they do not distinguish between good and bad packets, and drop (or back pressure) packets indiscriminately.

Gossip (or any other protocol) packets are not equal in terms of importance for the protocol or how much resources they need to process. Bounded channels are oblivious to that.

If we decide to drop packets we should also take into account stake associated with sender or receiver or CrdsValues. Again something bounded channels ignore.

Dropping packets indiscriminately also makes it easier for a spammer to censor good packets.

The preferred alternative is doing load shedding at a stage in the pipeline where we have better context how to prioritize packets and which ones to drop if overwhelmed.

This is all assuming the right thing to do here is to drop some packets. But I do not have enough context on what problem you are trying to solve, and I can't rule out the culprit or the remedy is something else.

Thanks for raising these concerns. Allow me to clarify the implicated code paths for more context.

The gossip service creates three unbounded channels upon construction (GossipService::new). Dropping is currently performed on the receiving end of the first two:

  1. fn count_dropped_packets(packets: &PacketBatch, dropped_packets_counts: &mut [u64; 7]) {
    for packet in packets {
    let k = packet
    .data(..4)
    .and_then(|data| <[u8; 4]>::try_from(data).ok())
    .map(u32::from_le_bytes)
    .filter(|&k| k < 6)
    .unwrap_or(/*invalid:*/ 6) as usize;
    dropped_packets_counts[k] += 1;
    }
    }
    let mut dropped_packets_counts = [0u64; 7];
    let mut num_packets = 0;
    let mut packets = VecDeque::with_capacity(2);
    for packet_batch in receiver
    .recv_timeout(RECV_TIMEOUT)
    .map(std::iter::once)?
    .chain(receiver.try_iter())
    {
    num_packets += packet_batch.len();
    packets.push_back(packet_batch);
    while num_packets > MAX_GOSSIP_TRAFFIC {
    // Discard older packets in favor of more recent ones.
    let Some(packet_batch) = packets.pop_front() else {
    break;
    };
    num_packets -= packet_batch.len();
    count_dropped_packets(&packet_batch, &mut dropped_packets_counts);
    }
    }
    let num_packets_dropped = self.stats.record_dropped_packets(&dropped_packets_counts);
    self.stats
    .packets_received_count
    .add_relaxed(num_packets as u64 + num_packets_dropped);
  2. let excess_count = packets.len().saturating_sub(MAX_GOSSIP_TRAFFIC);
    if excess_count > 0 {
    packets.drain(0..excess_count);
    self.stats
    .gossip_packets_dropped_count
    .add_relaxed(excess_count as u64);
    }

As far as I can tell, there is no drop prioritization of packets in the first two channels other than newer packets being preferred.

The third channel is only partially bounded from dropping, see details

The third channel is implicitly partially bounded, as its sender is partially populated via the receiver of the the second channel, which drops packets according to the above scheme (response_sender is pushed via listen_receiver). That being said, the sender of the of the third channel remains partially unbounded via ClusterInfo::gosip.

In summary, I believe we should be able to introduce bounded channels with "lossy" senders in the first two channels without losing any existing behavior. Let me know if there is an existing prioritization mechanism to this that I am missing.

@behzadnouri
Copy link

As far as I can tell, there is no drop prioritization of packets in the first two channels other than newer packets being preferred.

I did not claim current implementation cannot be improved.
My point was that bounded channels will not be very useful here.

We might need to do something better than the current naive load-shedding and maybe prioritize packets by some metrics first (e.g. type of packets, associated stakes, who the sender or receiver is, etc). But a bounded channel which indiscriminately drops (or back pressure) packets does not help with that.

That being said, before changing load-shedding, we need to first understand what the actual problem is here.
Otherwise we are just obscuring the symptom without addressing the root-cause.

  • Have these spikes been investigated? Do we understand what the root cause is?
  • Is there a bug somewhere causing these spikes?
  • Does dropping the packets address the bug or the main culprit?
  • Do we know what type of packets spike? are they just redundant/spam or legit packets?
  • If they are legit packet, is it fine to just drop them? Wouldn't that even cause more traffic amplifications because crds tables are more out-of-sync and there will be more pull responses?

@cpubot
Copy link
Author

cpubot commented Feb 6, 2025

Today, there are spikes in traffic. All incoming traffic is put into an unbounded channel.

The receivers on the channel throw away the oldest information beyond a threshold.

I created this PR to allow agave to easily switch out the channel implementation. The follow on pr will bound the worst case behavior while providing the exact same load shedding that exists in agave today.

If someone (such as the networking team) would like to explore alternate load shedding implementations and root cause the spikes, this seems like a valuable project whose priority would need to be considered relative to other projects.

The result of this pr (and the following one) would allow an unbounded channel to be replaced by a bounded channel with the exact behavior as we currently have… except we eliminate the worst case behavior of wastefully growing an unbounded channel to hold data that will be thrown away past a threshold at receive time.

If subsequent changes are made to include more sophisticated load shedding, then the fact that implementation is behind a trait bound will likely allow that implementation to plug into the existing code paths. Or, if the more sophisticated version is so vastly different enough that the trait bound isn't sufficient, we can revert / rework relevant code paths at that time -- this is work that would have to be done in either case. This code change and the follow on PR are minimal enough that very little code needs to be changed in either case.

@behzadnouri
Copy link

I appreciate you contributing here and we definitely can benefit from more people helping with debugging and maintaining gossip or scrutinizing the design and implementation. I also fully acknowledge that these packet spikes are something which needs to be investigated and addressed. So please do not consider my comments as dismissive of your work or intentions.

However, my point is, before implementing any solutions, we need to take some prerequisite steps:

  1. First we need to understand the nature and root cause of these spikes.
    • Are these legit traffic or spams? Is there an implementation or design bug? Is there any correlation with the node's stake or node's leader slots? What is the distribution of packet types within these spikes? How many of these packets are duplicate? etc.
  2. Then we need to figure out what best addresses the root cause.
    • Is there a bug that needs to be patched? Does it require a design change? Should we prioritize certain packets? Should we do more load-shedding or a different kind of load-shedding? etc
    • In particular if the issue is an implementation or design bug, then (any kind of) load-shedding will only obscure the symptoms and does not really address the root-cause. Not only that, it will even make it harder to investigate the root-cause.

So, again, I would appreciate it if we do not jump to any solution before we better understand the dynamics here. If you are interested to help with that I would suggest to chat with @gregcusack first. I am sure he has some good insights how to best approach debugging and investigating gossip traffic. Thank you.

Copy link

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the proposed changes have merit. let's reframe the problem and clarify priorities

the problem being addressed here is not in fact the traffic spikes, it is the memory pressure being imposed on the rest of the system when these spikes occur

we no doubt need to be addressing the root cause of these events, but i'll contend that that is not the mandate of the perf team. the network team has the expertise to resolve that problem most effectively

the perf team on the other hand is tasked with identifying and mitigating bottlenecks with minimal changes to logic and i think that's what they're doing here. replacing one lossy channel implementation with another, while retaining similar loss characteristics is about as close to zero logical changes as we can expect here. however, we trade the former's memory foot print elasticity for the latter's static allocation, isolating these gossip events from the rest of the validator. that is a win in my book. any decoupling, especially when it is implicit, is a win in my book

introducing a trait as an intermediary seems like a fine idea. sure there's only one implementation today, so it's a bit superfluous, but it does make the change set dead simple to review. i expect the followup swapping out the implementation will be just as reviewable. if the network team determines that a smarter channel implementation is needed in the future, there's little to no harm in the trait being there and it might even simplify prototyping

i'm r+ network team approval here. it is ultimately their code to maintain. i do believe that this is the right change to make today for the sake of the rest of agave

--

one criticism i will give is that we are a not a merge update house. @cpubot, please rebase that ugly commit away 😉

@behzadnouri
Copy link

it is the memory pressure being imposed on the rest of the system when these spikes occur

This PR was done with the presumption that the channel is growing unboundedly. I have explained on a call that neither the metrics nor the profiles "necessarily" indicate that the channel is growing too large.

I have also tested this presumption on a "staked" testnet node, which has shown that right before we start reading from the channel:
https://github.com/anza-xyz/agave/blob/da466b38f/gossip/src/cluster_info.rs#L2237
vast majority of the time the channel is "empty", meaning that we have to "wait" until something is pushed into the channel, and very very rarely there are more than 10 items already in the channel.

Also once we start reading from the channel we completely drain the channel and leave it again "empty". Meaning that vast majority of the time we start with an empty channel, we have to wait until something is pushed to the channel and "always" we leave the channel in an "empty" state.

Crossbeam implementation of unbounded channel also does not hold on to allocations:
https://github.com/crossbeam-rs/crossbeam/blob/f92b4a71e/crossbeam-channel/src/flavors/list.rs
so if there is a sudden spike of packets (which we pretty much prefer to be able to ingest fully) causing the channel to grow, that memory is immediately released as soon as run_socket_consume runs, which drains and empties the channel.

There are all sort of other issues, some of them pointed in the earlier comments, but lets leave it at this for the moment: we have not yet seen the evidence that the channel is growing unboundedly or causing any memory issues. Quite the opposite, most of the time the channel is empty, and hardly ever contains more than 10 entries. So the starting presumption here is uncorroborated by the evidence.

@gregcusack
Copy link

Agree somewhat with Behzad, we need to figure out what is causing the memory pressure:

  1. Allocation/Deallocation overhead: Is the memory pressure due to constant/frequent allocations and deallocations from the channel rapidly filling and draining? If this is the case, then a bounded channel is not going to fix this problem.
  2. Excessive Packets: If the memory pressure in the channel is from too many packets filling up the channel, then we need to find out if the additional allocations/deallocations beyond MAX_GOSSIP_TRAFFIC is causing significant memory pressure.

I think @cpubot is already working on running a mainnet node and checking to see if the unbounded channel specifically is actually causing significant memory pressure. As discussed today, let's wait to get this info before making a decision.

Side note/thinking out loud here. There seems to be a difference in how we're each defining "growing unboundedly" that is based on the timescale we are observing the channel

  1. cpubot: The channel can grow beyond MAX_GOSSIP_TRAFFIC between reads, leading to unbounded growth in the short term (since we see dropped packets).
  2. Behzad: On a longer timescale (across multiple reads), because the channel is emptied after each read, the channel does not grow unboundedly.

@t-nelson
Copy link

so if there is a sudden spike of packets (which we pretty much prefer to be able to ingest fully) causing the channel to grow, that memory is immediately released as soon as run_socket_consume runs, which drains and empties the channel.

this is the problem though. bursty allocation patterns like this are degenerate with jemalloc. that's how removing the allocation elasticity with a bounded channel intends to improve the situation. that is, there is really no problem with gossip service in isolation during these events, but gossip is coupled to the rest of the validator via the allocator, which is being thrown into a degenerate state


we have not yet seen the evidence that the channel is growing unboundedly or causing any memory issues

if we don't trust the profiler, what other evidence can we provide? our metrics aren't going to capture issues that manifest in the allocator


so long as drop counters are moved to the sending side, i'm failing to see how the proposed changes are going to have a material impact on the gossip service's functional behavior. we'll still see the spikes because the metrics are still there to count drops. so networking team can investigate their root cause without issue and perf team can move on to the next bottleneck.

can someone enumerate and elaborate the risks this change introduces. apologies if this was already done elsewhere and not documented here

@behzadnouri
Copy link

@t-nelson Please run a validator and inspect the channel size for yourself.

A patch introducing a bad design would only make things worse and harder to maintain.
In the worst case you are entirely breaking gossip by dropping packets the node was supposed to ingest, which can severely degrade message propagation or exacerbate duplicate packets.

I would also appreciate it if you can actually spend some time debugging any of the gossip real issues. At the very least that would provide you with some more insight and understanding of how the pipeline works, and perhaps some of my points above would make more sense to you then.

For the record below is the distribution of the channel size right before we start reading from it. This is based on a random sample of 6337 logs over ~18 hours from a staked node on testnet. A channel which is mostly empty is what we are wasting so much time on instead of the actual problems.

channel size       number of observations
           0       5798  
           1        409  
           2         80  
           3         18  
           4         13  
           5          2  
           6          3  
           7          5  
           8          2  
           9          2  
           11         2  
           19         1  
           26         1  
           30         1  

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants