Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for thread-local default recorder #523

Merged

Conversation

mateiidavid
Copy link
Contributor

Currently, the only way to install a thread-local recorder is to use the provided function with_local_recorder function. In asynchronous contexts, especially where single-threaded runtimes are used, using locally scoped recorders can be cumbersome since with_local_recorder does not support async execution (unless the closure itself spawns a runtime).

In order to provide an API that overcomes this limitation, for users that want to make use of locally-scoped default recorders, we make it so that the guard can be returned through a newly introduced set_local_default function that takes in a trait object. The returned guard is the same type we are using for with_local_recorder and when dropped will reset the thread local variable.

The change does not introduce any breaking changes to the existing API. It does however make minimal changes to the internals in order to uphold the safety guarantees that were previously in place. Since the guard is no longer tied to a scope that also encloses the reference, we need to have an explicit lifetime on the guard to guarantee the recorder is not dropped before the guard is. We achieve this through a PhantomData type which shouldn't result in any runtime overhead.

Closes #502


Review notes

Thanks a lot in advance for the review! :) Just to preface this, it's not often that I have to reason with unsafe code, I'm happy to make any necessary changes around the code and poke into the internals some more if needed. Sorry if any of this is hard to grok.

  • Motivation: I find that Support thread-local default recorder #502 outlines the motivation for the change. I saw that a tracing-style set_default() was a welcome addition so I thought I'd go ahead and implement the change.

  • Safety: Based on the SAFETY comments, my understanding is that we guarantee the pointer we store in thread-local storage is going to be valid since it does not outlive the reference to the Recorder trait object by virtue of being enclosed in the function's scope. Naturally, if we return a guard, we can no longer make the same assumptions. Although the compiler will not complain, we're susceptible to footguns if users move or drop their recorders.

I found a lifetime to be easiest to use here to guarantee memory safety, I'm sure there are other alternatives, but I wanted to ensure we do not make any breaking changes, and that we keep the change as minimal as possible.

Unfortunately, I had no inspiration for an automated test, but I did test it manually. In metrics/src/recorder/mod.rs, I added the following test

+    #[test]
+    fn it_compiles() {
+        let local = NoopRecorder;
+        let _guard = crate::set_local_default(&local);
+        drop(local);
+    }
+

Without the changes, it successfully compiles and runs. Based on my understanding of the code, this could potentially cause undefined behaviour at runtime since we can hold a reference to a value in memory that has essentially been invalidated.

:; cargo test recorder::tests::it_compiles -- --nocapture
running 1 test
test recorder::tests::it_compiles ... ok

With the changes, we'll get a compiler error:

:; cargo test recorder::tests::it_compiles -- --nocapture
   Compiling metrics v0.23.0 (/Users/matei.david/workspace/prophecy/metrics/metrics)
error[E0505]: cannot move out of `local` because it is borrowed
   --> metrics/src/recorder/mod.rs:328:14
    |
326 |         let local = NoopRecorder;
    |             ----- binding `local` declared here
327 |         let _guard = crate::set_local_default(&local);
    |                                               ------ borrow of `local` occurs here
328 |         drop(local);
    |              ^^^^^ move out of `local` occurs here
329 |     }
    |     - borrow might be used here, when `_guard` is dropped and runs the `Drop` code for type `LocalRecorderGuard`
  • Tests: I wanted to still have some confidence that the change will not break anything and that it will work as we expect it to. To this end, I added a simple test that exercises a recorder that increments a counter. We start three different recorders and assert the values are expected at the end (i.e. we don't update the wrong thing and we don't get any segfaults).

I also moved the existing recorder impl. (tracking drops) and the new recorder impl. into a module at the bottom of the test module. I wanted to declutter the tests. Let me know if it doesn't look right, I can always revert it. Happy to do what makes most sense for people actively maintaining this.

@tobz
Copy link
Member

tobz commented Oct 1, 2024

@mateiidavid Thanks for this contribution!

Overall, I think the approach/intent is reasonable, but reusing the existing LocalRecorderGuard machinery introduces a problem: we would be clearing out any previously-set local records. This is true for any ordering of sequential calls to set_local_default and with_local_recorder.

What needs to happen is that we need to store the current local recorder in the guard, so that when it drops, it can be reset. This is precisely what tracing does, as seen in the Drop implementation for DefaultGuard.

metrics/src/recorder/mod.rs Outdated Show resolved Hide resolved
@mateiidavid
Copy link
Contributor Author

@tobz thank you so much for the initial review.

Overall, I think the approach/intent is reasonable, but reusing the existing LocalRecorderGuard machinery introduces a problem: we would be clearing out any previously-set local records. This is true for any ordering of sequential calls to set_local_default and with_local_recorder.

I think that's a great idea. I want to preface by saying that I have only just started poking around tracing's internals so my understanding might be a little bit off in certain places.

I have just pushed a change that allows us to restore the previously set local recorder. I have also added a test in to exercise this. It would've been great to do some type assertions using Any + downcasting but that was a bit too involved and I couldn't get it to work.

There is also a bit of an issue with the current implementation, for example, suppose you have something like this:

    #[test]
    fn it_will_segfault() {
        // Create a recorder and install it as the default local.
        let root_recorder = test_recorders::SimpleCounterRecorder::new();
        let root_guard = crate::set_default_local_recorder(&root_recorder);

        // Install a new recorder.
        let next_recorder = test_recorders::SimpleCounterRecorder::new();
        let _next_guard = crate::set_default_local_recorder(&next_recorder);
    
        // Courtesy of the lifetime, we must drop the guard first.
        drop(root_guard);
        // drop() will now restore the previous recorder (in this case, `None`)
        drop(root_recorder);
  
        // drop() will now restore the previous recorder (in this case `root_recorder`)
        drop(_next_guard);
       // segfaults since root_recorder has been dropped a while back
        crate::counter!("test_counter").increment(20);
    }

I haven't really found a great way to work around this. In the tracing crate, they move the subscriber when installing it locally. This allows them to actually Arc it. I couldn't manage to reproduce the issue we are having here. I'm a little bit lost on how to proceed, here are some ideas that I had:

  1. Come up with a type where we can specify a lifetime such that a guard cannot outlive the previously set recorder. This didn't work and was a bit awkward, we also cannot have arbitrary lifetimes stored in thread_local afaict, we'd have to give the raw pointer a static lifetime.
  2. Use Arc / RefCell, but this unfortunately would break backwards compatibility since with_local_recorder takes a trait object. We could have different guard types and handle this differently, for what it's worth.

e.g.

fn set_default_local_recorder(recorder: impl Recorder) {}
  1. Provide documentation. Results in a bit of a leaky abstraction and potential footguns.

I realise some of this may sound a bit off, so please let me know if I got anything wrong. Happy to provide more snippets or links to some of the tracing code I've been looking at.

Also, as a side note, my hunch for taking a raw pointer here is that we want to preserve ownership rules for the recorder? i.e. allow people to have mutable borrows (which would be impossible if we held on to a borrow ourselves). Not sure it's really important here, but I was curious.

Comment on lines 153 to 156
// NOTE: the lifetime on the argument passed into the constructor is not
// strictly needed. It does protect against accidentally passing in a
// reference to a reference to a recorder which would result in invalid
// memory being accessed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is duplicative and potentially untrue.. or at least just very confusing. At face value, we obviously need a lifetime on the guard to tie to the lifetime of the reference for which the guard is protecting.

Were you trying to say something else here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have some trouble with phrasing this, I'll admit. What I meant is that we can theoretically remove the lifetime position from the recorder reference, e.g.

fn new(recorder: &dyn Recorder)

The compiler elides the lifetime for us. However, that would enable us to pass in a reference to the trait object when creating the guard.

pub fn set_default_local_recorder(recorder: &dyn Recorder) -> LocalRecorderGuard {
    // We take a reference to the trait obj
   // we will now get a segfault if we try to use the recorder
    LocalRecorderGuard::new(&recorder)
}

If we keep the lifetime position in the constructor, then the compiler will complain if we have an incorrect argument passed to the guard constructor. Since this is a bit subtle, I was hoping a well placed comment could help people out in the future. Communicating this succinctly is not easy though.

I can remove the comment or try to reformulate it if you'd like.

Copy link
Member

@tobz tobz Oct 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would say remove the NOTE comment you added, and alter the existing comment to read:

// SAFETY: While we take a lifetime-less pointer to the given reference, the reference we
// derive _from_ the pointer is given the same lifetime of the reference used to construct
// the guard -- captured in the guard type itself -- and so derived references never outlive
// the source reference.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion. That reads great!

Copy link
Member

@tobz tobz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few small changes I'd like to see to polish the documentation.

metrics/src/recorder/mod.rs Show resolved Hide resolved
@tobz tobz added C-core Component: core functionality such as traits, etc. E-intermediate Effort: intermediate. T-ergonomics Type: ergonomics. labels Oct 3, 2024
@mateiidavid mateiidavid requested a review from tobz October 4, 2024 09:58
Currently, the only way to install a thread-local recorder is to use the
provided function `with_local_recorder` function. In asynchronous
contexts, especially where single-threaded runtimes are used, using
locally scoped recorders can be cumbersome since `with_local_recorder`
does not support async execution (unless the closure itself spawns a
runtime).

In order to provide an API that overcomes this limitation, for users
that want to make use of locally-scoped default recorders, we make it so
that the guard can be returned through a newly introduced
`set_local_default` function that takes in a trait object. The returned
guard is the same type we are using for `with_local_recorder` and when
dropped will reset the thread local variable.

The change does not introduce any breaking changes to the existing API.
It does however make minimal changes to the internals in order to uphold
the safety guarantees that were previously in place. Since the guard is
no longer tied to a scope that also encloses the reference, we need to
have an explicit lifetime on the guard to guarantee the recorder is not
dropped before the guard is. We achieve this through a `PhantomData`
type which shouldn't result in any runtime overhead.

Closes metrics-rs#502

Signed-off-by: Matei <[email protected]>
@mateiidavid mateiidavid force-pushed the matei/add-support-for-tlocal-recorder branch from 5d7d870 to ee54925 Compare October 9, 2024 14:13
@mateiidavid
Copy link
Contributor Author

thanks for the review! quick update: merging was blocked since all commits had to be signed. I squashed all commits and force pushed thinking it'd help; it didn't.

@tobz
Copy link
Member

tobz commented Oct 9, 2024

Heh, no worries. :D

I'll merge this once tests pass again.

@tobz tobz merged commit b0d84ba into metrics-rs:main Oct 9, 2024
13 checks passed
@tobz tobz added the S-awaiting-release Status: awaiting a release to be considered fixed/implemented. label Oct 9, 2024
@tobz
Copy link
Member

tobz commented Oct 12, 2024

Released as part of [email protected].

Thanks again for your contribution!

@tobz tobz removed the S-awaiting-release Status: awaiting a release to be considered fixed/implemented. label Oct 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-core Component: core functionality such as traits, etc. E-intermediate Effort: intermediate. T-ergonomics Type: ergonomics.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support thread-local default recorder
2 participants