Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wrapper around jemalloc to track allocator usage by thread #4336

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

alexpyattaev
Copy link

a simple wrapper around jemalloc to track allocator usage by thread name in metrics.
Idea is to get a better idea why node crashes when OOM occurs (at least which threads were allocating memory).
This is for dev use only.

Problem

If/when agave starts leaking memory (or just clogging up some channel) it may be tricky to find where memory allocations are happening that cause the crash. Tracking per-pool allocations is not a replacement for valgrind, but has the advantage of fairly small overhead & integration into metrics.

Summary of Changes

Added feature-flag gated custom wrapper around jemalloc that tracks memory usage, grouped by thread pool name.

@alexpyattaev alexpyattaev force-pushed the memory_metrics branch 3 times, most recently from 0ec36fd to 5a8f13a Compare January 7, 2025 22:57
@alexpyattaev alexpyattaev force-pushed the memory_metrics branch 3 times, most recently from b2ac2f7 to 6df862a Compare January 8, 2025 22:59
Copy link

@gregcusack gregcusack left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like it's going to be a big help! just a few comments/questions. Thank you!

memory-management/src/jemalloc_monitor_metrics.rs Outdated Show resolved Hide resolved
memory-management/src/jemalloc_monitor.rs Show resolved Hide resolved
memory-management/src/jemalloc_monitor_metrics.rs Outdated Show resolved Hide resolved
memory-management/src/jemalloc_monitor.rs Show resolved Hide resolved
memory-management/src/jemalloc_monitor.rs Show resolved Hide resolved
@alexpyattaev alexpyattaev marked this pull request as ready for review January 17, 2025 17:06
@alexpyattaev alexpyattaev force-pushed the memory_metrics branch 2 times, most recently from 73161e0 to e1e6c16 Compare January 17, 2025 21:10
@gregcusack gregcusack self-requested a review January 20, 2025 01:22
memory-management/src/jemalloc_monitor_metrics.rs Outdated Show resolved Hide resolved
memory-management/src/jemalloc_monitor_metrics.rs Outdated Show resolved Hide resolved
"solGossipWork",
"solGossip",
"solRepair",
"FetchStage",
Copy link

@gregcusack gregcusack Jan 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where are you finding these thread names? I can't seem to find a few of them? EDIT: idk why it selected 4 lines. Meant to just select FetchStage

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are from my experiments with thread manager. Quite possible that some of those I got wrong and/or missed some. Listing all threads in agave is pretty much impossible in its current form. But this is no big deal as long as we get the main pools right. Arguably, we could cut this list down to top 10 and it would be equally useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya agree that we don't need to capture every single thread. But I do want to make sure the ones that we have hard coded here are actual thread names. when I search in the codebase for threads names like FetchStage or solClusterInfo I don't find any

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I've double-checked, now all names should be legit.

@gregcusack
Copy link

ok so maybe i don't fully understand how thread names work in rust. I know we define thread names in the code base like: solSigVerTpuVot defined here:

agave/core/src/tpu.rs

Lines 227 to 235 in 85e8f86

let vote_sigverify_stage = {
let verifier = TransactionSigVerifier::new_reject_non_vote(tpu_vote_sender);
SigVerifyStage::new(
vote_packet_receiver,
verifier,
"solSigVerTpuVot",
"tpu-vote-verifier",
)
};
.

But in the PR, you have thread names like: solVoteSigVerTpu. But I can't find any thread names in the code base that match solVoteSigVerTpu. So, my question is, if solVoteSigVerTpu exists as a thread at runtime, how is this thread named?

@alexpyattaev alexpyattaev force-pushed the memory_metrics branch 2 times, most recently from 4bbb420 to ed262a7 Compare January 28, 2025 21:23
@alexpyattaev
Copy link
Author

No magic transform, just got the grep script wrong. 2 slipped through.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants