Add ChannelSend trait to prep for lossy sender #4797

cpubot · 2025-02-04T23:29:31Z

Problem

Gossip service consistently sees bursts in the millions of packets over the 2 second stat submission window.

The service currently uses crossbeam_channel::unbounded channels to buffer packets, which as the name implies, will unboundedly allocate space for incoming packets. The channel receivers will ultimately drop packets outside the bounds of MAX_GOSSIP_TRAFFIC.

Summary of Changes

This PR introduces a trait, ChannelSend, which abstracts over a send side implementation of a channel. This lays the foundation for a follow up PR that will introduce a notion of a "lossy" channel sender. This implementation will wrap a crossbeam_channel::bounded channel with a sender that will drop messages as channel capacity is reached rather than continuing to buffer messages until the receiver side is drained.

This is introduced behind a trait bound due to the genericity of the streamer crate and the multi-purpose use of its receiver implementation. In particular, this will allow gossip to use the aforementioned bounded / lossy sender without requiring other components like repair and fetch stage to use this forthcoming implementation.

behzadnouri · 2025-02-05T01:51:05Z

The topic of bounded channels have been brought up multiple times before.

We prefer not to use bounded channels because they do not distinguish between good and bad packets, and drop (or back pressure) packets indiscriminately.

Gossip (or any other protocol) packets are not equal in terms of importance for the protocol or how much resources they need to process. Bounded channels are oblivious to that.

If we decide to drop packets we should also take into account stake associated with sender or receiver or CrdsValues. Again something bounded channels ignore.

Dropping packets indiscriminately also makes it easier for a spammer to censor good packets.

The preferred alternative is doing load shedding at a stage in the pipeline where we have better context how to prioritize packets and which ones to drop if overwhelmed.

This is all assuming the right thing to do here is to drop some packets. But I do not have enough context on what problem you are trying to solve, and I can't rule out the culprit or the remedy is something else.

cpubot · 2025-02-05T20:16:35Z

The topic of bounded channels have been brought up multiple times before.

We prefer not to use bounded channels because they do not distinguish between good and bad packets, and drop (or back pressure) packets indiscriminately.

Gossip (or any other protocol) packets are not equal in terms of importance for the protocol or how much resources they need to process. Bounded channels are oblivious to that.

If we decide to drop packets we should also take into account stake associated with sender or receiver or CrdsValues. Again something bounded channels ignore.

Dropping packets indiscriminately also makes it easier for a spammer to censor good packets.

The preferred alternative is doing load shedding at a stage in the pipeline where we have better context how to prioritize packets and which ones to drop if overwhelmed.

This is all assuming the right thing to do here is to drop some packets. But I do not have enough context on what problem you are trying to solve, and I can't rule out the culprit or the remedy is something else.

Thanks for raising these concerns. Allow me to clarify the implicated code paths for more context.

The gossip service creates three unbounded channels upon construction (GossipService::new). Dropping is currently performed on the receiving end of the first two:

agave/gossip/src/cluster_info.rs

Lines 2219 to 2252 in bd6e9f9

    
           fn count_dropped_packets(packets: &PacketBatch, dropped_packets_counts: &mut [u64; 7]) { 
        
               for packet in packets { 
        
                   let k = packet 
        
                       .data(..4) 
        
                       .and_then(|data| <[u8; 4]>::try_from(data).ok()) 
        
                       .map(u32::from_le_bytes) 
        
                       .filter(|&k| k < 6) 
        
                       .unwrap_or(/*invalid:*/ 6) as usize; 
        
                   dropped_packets_counts[k] += 1; 
        
               } 
        
           } 
        
           let mut dropped_packets_counts = [0u64; 7]; 
        
           let mut num_packets = 0; 
        
           let mut packets = VecDeque::with_capacity(2); 
        
           for packet_batch in receiver 
        
               .recv_timeout(RECV_TIMEOUT) 
        
               .map(std::iter::once)? 
        
               .chain(receiver.try_iter()) 
        
           { 
        
               num_packets += packet_batch.len(); 
        
               packets.push_back(packet_batch); 
        
               while num_packets > MAX_GOSSIP_TRAFFIC { 
        
                   // Discard older packets in favor of more recent ones. 
        
                   let Some(packet_batch) = packets.pop_front() else { 
        
                       break; 
        
                   }; 
        
                   num_packets -= packet_batch.len(); 
        
                   count_dropped_packets(&packet_batch, &mut dropped_packets_counts); 
        
               } 
        
           } 
        
           let num_packets_dropped = self.stats.record_dropped_packets(&dropped_packets_counts); 
        
           self.stats 
        
               .packets_received_count 
        
               .add_relaxed(num_packets as u64 + num_packets_dropped);

agave/gossip/src/cluster_info.rs

Lines 2319 to 2325 in bd6e9f9

    
           let excess_count = packets.len().saturating_sub(MAX_GOSSIP_TRAFFIC); 
        
           if excess_count > 0 { 
        
               packets.drain(0..excess_count); 
        
               self.stats 
        
                   .gossip_packets_dropped_count 
        
                   .add_relaxed(excess_count as u64); 
        
           }

As far as I can tell, there is no drop prioritization of packets in the first two channels other than newer packets being preferred.

The third channel is only partially bounded from dropping, see details

The third channel is implicitly partially bounded, as its sender is partially populated via the receiver of the the second channel, which drops packets according to the above scheme (response_sender is pushed via listen_receiver). That being said, the sender of the of the third channel remains partially unbounded via ClusterInfo::gosip.

In summary, I believe we should be able to introduce bounded channels with "lossy" senders in the first two channels without losing any existing behavior. Let me know if there is an existing prioritization mechanism to this that I am missing.

behzadnouri · 2025-02-06T00:09:36Z

As far as I can tell, there is no drop prioritization of packets in the first two channels other than newer packets being preferred.

I did not claim current implementation cannot be improved.
My point was that bounded channels will not be very useful here.

We might need to do something better than the current naive load-shedding and maybe prioritize packets by some metrics first (e.g. type of packets, associated stakes, who the sender or receiver is, etc). But a bounded channel which indiscriminately drops (or back pressure) packets does not help with that.

That being said, before changing load-shedding, we need to first understand what the actual problem is here.
Otherwise we are just obscuring the symptom without addressing the root-cause.

Have these spikes been investigated? Do we understand what the root cause is?
Is there a bug somewhere causing these spikes?
Does dropping the packets address the bug or the main culprit?
Do we know what type of packets spike? are they just redundant/spam or legit packets?
If they are legit packet, is it fine to just drop them? Wouldn't that even cause more traffic amplifications because crds tables are more out-of-sync and there will be more pull responses?

cpubot · 2025-02-06T19:37:23Z

Today, there are spikes in traffic. All incoming traffic is put into an unbounded channel.

The receivers on the channel throw away the oldest information beyond a threshold.

I created this PR to allow agave to easily switch out the channel implementation. The follow on pr will bound the worst case behavior while providing the exact same load shedding that exists in agave today.

If someone (such as the networking team) would like to explore alternate load shedding implementations and root cause the spikes, this seems like a valuable project whose priority would need to be considered relative to other projects.

The result of this pr (and the following one) would allow an unbounded channel to be replaced by a bounded channel with the exact behavior as we currently have… except we eliminate the worst case behavior of wastefully growing an unbounded channel to hold data that will be thrown away past a threshold at receive time.

If subsequent changes are made to include more sophisticated load shedding, then the fact that implementation is behind a trait bound will likely allow that implementation to plug into the existing code paths. Or, if the more sophisticated version is so vastly different enough that the trait bound isn't sufficient, we can revert / rework relevant code paths at that time -- this is work that would have to be done in either case. This code change and the follow on PR are minimal enough that very little code needs to be changed in either case.

behzadnouri · 2025-02-07T02:22:59Z

I appreciate you contributing here and we definitely can benefit from more people helping with debugging and maintaining gossip or scrutinizing the design and implementation. I also fully acknowledge that these packet spikes are something which needs to be investigated and addressed. So please do not consider my comments as dismissive of your work or intentions.

However, my point is, before implementing any solutions, we need to take some prerequisite steps:

First we need to understand the nature and root cause of these spikes.
- Are these legit traffic or spams? Is there an implementation or design bug? Is there any correlation with the node's stake or node's leader slots? What is the distribution of packet types within these spikes? How many of these packets are duplicate? etc.
Then we need to figure out what best addresses the root cause.
- Is there a bug that needs to be patched? Does it require a design change? Should we prioritize certain packets? Should we do more load-shedding or a different kind of load-shedding? etc
- In particular if the issue is an implementation or design bug, then (any kind of) load-shedding will only obscure the symptoms and does not really address the root-cause. Not only that, it will even make it harder to investigate the root-cause.

So, again, I would appreciate it if we do not jump to any solution before we better understand the dynamics here. If you are interested to help with that I would suggest to chat with @gregcusack first. I am sure he has some good insights how to best approach debugging and investigating gossip traffic. Thank you.

t-nelson

i think the proposed changes have merit. let's reframe the problem and clarify priorities

the problem being addressed here is not in fact the traffic spikes, it is the memory pressure being imposed on the rest of the system when these spikes occur

we no doubt need to be addressing the root cause of these events, but i'll contend that that is not the mandate of the perf team. the network team has the expertise to resolve that problem most effectively

the perf team on the other hand is tasked with identifying and mitigating bottlenecks with minimal changes to logic and i think that's what they're doing here. replacing one lossy channel implementation with another, while retaining similar loss characteristics is about as close to zero logical changes as we can expect here. however, we trade the former's memory foot print elasticity for the latter's static allocation, isolating these gossip events from the rest of the validator. that is a win in my book. any decoupling, especially when it is implicit, is a win in my book

introducing a trait as an intermediary seems like a fine idea. sure there's only one implementation today, so it's a bit superfluous, but it does make the change set dead simple to review. i expect the followup swapping out the implementation will be just as reviewable. if the network team determines that a smarter channel implementation is needed in the future, there's little to no harm in the trait being there and it might even simplify prototyping

i'm r+ network team approval here. it is ultimately their code to maintain. i do believe that this is the right change to make today for the sake of the rest of agave

--

one criticism i will give is that we are a not a merge update house. @cpubot, please rebase that ugly commit away 😉

behzadnouri · 2025-02-11T03:11:46Z

it is the memory pressure being imposed on the rest of the system when these spikes occur

This PR was done with the presumption that the channel is growing unboundedly. I have explained on a call that neither the metrics nor the profiles "necessarily" indicate that the channel is growing too large.

I have also tested this presumption on a "staked" testnet node, which has shown that right before we start reading from the channel:
https://github.com/anza-xyz/agave/blob/da466b38f/gossip/src/cluster_info.rs#L2237
vast majority of the time the channel is "empty", meaning that we have to "wait" until something is pushed into the channel, and very very rarely there are more than 10 items already in the channel.

Also once we start reading from the channel we completely drain the channel and leave it again "empty". Meaning that vast majority of the time we start with an empty channel, we have to wait until something is pushed to the channel and "always" we leave the channel in an "empty" state.

Crossbeam implementation of unbounded channel also does not hold on to allocations:
https://github.com/crossbeam-rs/crossbeam/blob/f92b4a71e/crossbeam-channel/src/flavors/list.rs
so if there is a sudden spike of packets (which we pretty much prefer to be able to ingest fully) causing the channel to grow, that memory is immediately released as soon as run_socket_consume runs, which drains and empties the channel.

There are all sort of other issues, some of them pointed in the earlier comments, but lets leave it at this for the moment: we have not yet seen the evidence that the channel is growing unboundedly or causing any memory issues. Quite the opposite, most of the time the channel is empty, and hardly ever contains more than 10 entries. So the starting presumption here is uncorroborated by the evidence.

gregcusack · 2025-02-11T05:25:38Z

Agree somewhat with Behzad, we need to figure out what is causing the memory pressure:

Allocation/Deallocation overhead: Is the memory pressure due to constant/frequent allocations and deallocations from the channel rapidly filling and draining? If this is the case, then a bounded channel is not going to fix this problem.
Excessive Packets: If the memory pressure in the channel is from too many packets filling up the channel, then we need to find out if the additional allocations/deallocations beyond MAX_GOSSIP_TRAFFIC is causing significant memory pressure.

I think @cpubot is already working on running a mainnet node and checking to see if the unbounded channel specifically is actually causing significant memory pressure. As discussed today, let's wait to get this info before making a decision.

Side note/thinking out loud here. There seems to be a difference in how we're each defining "growing unboundedly" that is based on the timescale we are observing the channel

cpubot: The channel can grow beyond MAX_GOSSIP_TRAFFIC between reads, leading to unbounded growth in the short term (since we see dropped packets).
Behzad: On a longer timescale (across multiple reads), because the channel is emptied after each read, the channel does not grow unboundedly.

t-nelson · 2025-02-11T21:51:54Z

so if there is a sudden spike of packets (which we pretty much prefer to be able to ingest fully) causing the channel to grow, that memory is immediately released as soon as run_socket_consume runs, which drains and empties the channel.

this is the problem though. bursty allocation patterns like this are degenerate with jemalloc. that's how removing the allocation elasticity with a bounded channel intends to improve the situation. that is, there is really no problem with gossip service in isolation during these events, but gossip is coupled to the rest of the validator via the allocator, which is being thrown into a degenerate state

we have not yet seen the evidence that the channel is growing unboundedly or causing any memory issues

if we don't trust the profiler, what other evidence can we provide? our metrics aren't going to capture issues that manifest in the allocator

so long as drop counters are moved to the sending side, i'm failing to see how the proposed changes are going to have a material impact on the gossip service's functional behavior. we'll still see the spikes because the metrics are still there to count drops. so networking team can investigate their root cause without issue and perf team can move on to the next bottleneck.

can someone enumerate and elaborate the risks this change introduces. apologies if this was already done elsewhere and not documented here

behzadnouri · 2025-02-11T22:50:19Z

@t-nelson Please run a validator and inspect the channel size for yourself.

A patch introducing a bad design would only make things worse and harder to maintain.
In the worst case you are entirely breaking gossip by dropping packets the node was supposed to ingest, which can severely degrade message propagation or exacerbate duplicate packets.

I would also appreciate it if you can actually spend some time debugging any of the gossip real issues. At the very least that would provide you with some more insight and understanding of how the pipeline works, and perhaps some of my points above would make more sense to you then.

For the record below is the distribution of the channel size right before we start reading from it. This is based on a random sample of 6337 logs over ~18 hours from a staked node on testnet. A channel which is mostly empty is what we are wasting so much time on instead of the actual problems.

channel size       number of observations
           0       5798  
           1        409  
           2         80  
           3         18  
           4         13  
           5          2  
           6          3  
           7          5  
           8          2  
           9          2  
           11         2  
           19         1  
           26         1  
           30         1

Add ChannelSend trait to prep for lossy sender

74a014a

cpubot requested a review from alessandrod February 4, 2025 23:29

alessandrod approved these changes Feb 5, 2025

View reviewed changes

cpubot requested a review from behzadnouri February 5, 2025 01:17

Merge branch 'master' into streamer-channel-sender

aa12036

t-nelson reviewed Feb 11, 2025

View reviewed changes

t-nelson mentioned this pull request Feb 14, 2025

reworks gossip run_listen and process_packets #4971

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ChannelSend trait to prep for lossy sender #4797

Add ChannelSend trait to prep for lossy sender #4797

cpubot commented Feb 4, 2025

behzadnouri commented Feb 5, 2025

cpubot commented Feb 5, 2025

behzadnouri commented Feb 6, 2025

cpubot commented Feb 6, 2025

behzadnouri commented Feb 7, 2025

t-nelson left a comment

behzadnouri commented Feb 11, 2025

gregcusack commented Feb 11, 2025

t-nelson commented Feb 11, 2025

behzadnouri commented Feb 11, 2025

Add ChannelSend trait to prep for lossy sender #4797

Are you sure you want to change the base?

Add ChannelSend trait to prep for lossy sender #4797

Conversation

cpubot commented Feb 4, 2025

Problem

Summary of Changes

behzadnouri commented Feb 5, 2025

cpubot commented Feb 5, 2025

behzadnouri commented Feb 6, 2025

cpubot commented Feb 6, 2025

behzadnouri commented Feb 7, 2025

t-nelson left a comment

Choose a reason for hiding this comment

behzadnouri commented Feb 11, 2025

gregcusack commented Feb 11, 2025

t-nelson commented Feb 11, 2025

behzadnouri commented Feb 11, 2025