Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[illumos] switch waker impl back to being pipe-based #1824

Closed
wants to merge 1 commit into from

Conversation

sunshowers
Copy link

@sunshowers sunshowers commented Aug 7, 2024

In 1133ed0, it looks like the waker implementation on illumos (perhaps inadvertently) got switched from being pipe-based to being eventfd-based. Unfortunately, the eventfd impl seems to break the waker_multiple_wakeups_different_thread test on illumos:

thread 'waker_multiple_wakeups_different_thread' panicked at tests/waker.rs:157:5:
assertion failed: !events.is_empty()

While we separately figure out what might be going wrong in the eventfd impl, it's worth switching back to the old pipe-based impl which is known to work, and which does fix this test.

See oxidecomputer/helios#169.

@sunshowers
Copy link
Author

(Once this PR lands it would be great to get a release out. Thanks!)

@Thomasdezeeuw
Copy link
Collaborator

I made this change on purpose since illumos supported eventfd (according tot the manual). I would prefer to debug what is causing this test to fail because the eventfd implementation is more efficient than the pipe based on.

But before we focus on eventfd vs pipe, can you confirm the pipe implementation works by running with RUSTFLAGS=mio_unsupported_force_waker_pipe?

@sunshowers
Copy link
Author

sunshowers commented Aug 7, 2024

I made this change on purpose since illumos supported eventfd (according tot the manual).

Thanks. Was there any validation involved as part of the change? I'm asking because the test fails quite consistently for me.

(Not intending to blame here! The lack of publicly-available illumos CI is definitely a major factor here, and at Oxide we're considering ways to address that.)

I would prefer to debug what is causing this test to fail because the eventfd implementation is more efficient than the pipe based on.

To be clear I don't currently think the bug is in mio -- I think it's in illumos's eventfd impl. I read through waker.rs as well as the eventfd man pages on Linux and illumos, and as far as I understand mio's usage of eventfd is correct.

But before we focus on eventfd vs pipe, can you confirm the pipe implementation works by running with RUSTFLAGS=mio_unsupported_force_waker_pipe?

Yes, on master:

% RUSTFLAGS="--cfg mio_unsupported_force_waker_pipe" cargo test --all-features --test waker -- waker_multiple_wakeups_different_thread
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.01s
     Running tests/waker.rs (target/debug/deps/waker-798eb41be27ee8c9)

running 1 test
test waker_multiple_wakeups_different_thread ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 6 filtered out; finished in 0.05s

Thank you!

@sunshowers
Copy link
Author

I've also confirmed that switching back to the pipe impl makes the Tokio test suite pass as well.

In
tokio-rs@1133ed0,
it looks like the waker implementation on illumos (perhaps inadvertently) got
switched from being pipe-based to being eventfd-based. Unfortunately, the
eventfd impl seems to break the `waker_multiple_wakeups_different_thread` on illumos:

```
thread 'waker_multiple_wakeups_different_thread' panicked at tests/waker.rs:157:5:
assertion failed: !events.is_empty()
```

While we separately figure out what might be going wrong in the eventfd impl,
it's worth switching back to the old pipe-based impl which is known to work,
and which does fix this test.
@Thomasdezeeuw
Copy link
Collaborator

@sunshowers is there any (public) issue/ticket/etc. we can refer to, that we can keep track of to go back to eventfd?

@sunshowers
Copy link
Author

sunshowers commented Aug 8, 2024

Yes, we'll convert oxidecomputer/helios#169 over to that so feel free to refer to that.

@Thomasdezeeuw
Copy link
Collaborator

I meant in the illumos bug tracker (if a public one exists)

@sunshowers
Copy link
Author

sunshowers commented Aug 8, 2024

There is a public bug tracker but there isn't a bug on yet yet. We're planning to do some investigation later this week once we have a better idea, and file a bug then. I'm happy to comment on this PR once that is done. In the meantime the Helios issue I linked is authoritative.

But I don't think this PR should not block on that bug existing. This PR merely puts things back to a known-working state, using a portable POSIX implementation.

Another thing to note is that if there is truly a bug in the illumos eventfd implementation, and if there is no possible workaround to that within mio, then mio will have to stay with the pipe implementation for many years. That's because not everyone runs an up-to-date illumos.

Hope that helps, thanks!

@Thomasdezeeuw
Copy link
Collaborator

Helios issue I linked is authoritative.

Except it's not. I will not implement a workaround for a suspected OS bug without linking to a bug in OS's bug tracker.

@tokio-rs/illumos have any of you hit this bug before? Any know bugs we can link to?

@jclulow
Copy link
Contributor

jclulow commented Aug 8, 2024

G'day! I'm a member of the illumos core team. I actually hit the regression here myself when upgrading a project from tokio 1.38.X to 1.39.X the other day. My colleague, @sunshowers, picked it up and has done a lot of leg work to isolate the change that introduced the regression and get fixes out there, as this basically prevents any software that uses tokio from working on illumos today.

We don't have an OS bug to point to right now because we don't currently know that there is an OS bug -- though we will obviously continue to investigate that in the background! What has become clear in the last couple of days is that the original pipe-based mechanism works reliably on illumos systems (we have years, now, of production experience using it) and the switch to using eventfd has broken basically any software that uses tokio for us. Even this simple program no longer reliably works, when built with tokio 1.39.X:

use std::time::{Duration, Instant};

#[tokio::main]
async fn main() {
    let now = Instant::now();
    let res = tokio::time::timeout(Duration::from_secs(5), work()).await;
    let dur = Instant::now().checked_duration_since(now).unwrap().as_millis();
    println!("{res:?} after {dur} msec");
}

async fn work() {
    tokio::time::sleep(Duration::from_secs(1)).await;
}

The program either hangs forever, ostensibly due to a missed wakeup, or it runs for ~5 seconds instead of the expected 1 second. Downgrading tokio (and thus mio) makes the program work again.

We'd like to get this behavioural change, which as far as I can tell wasn't tested at all at the time, reverted promptly and a new release out there. That will allow tokio-based software to start working for illumos users once again. If we do eventually find an OS bug in our eventfd implementation, we still need to revert the change here: there is no crisp way to know at runtime if you're on a version of the OS that has that hypothetical fix or not, and depending on their risk profiles and processes, some users can take a long time to upgrade their base OS bits.

Thanks for looking at this! Please let us know how we can help to get this one over the line.

@jclulow
Copy link
Contributor

jclulow commented Aug 9, 2024

G'day again! It's been about 24 hours and I just wanted to check in and see if we could get this landed today, and a new release of mio cut. It's a critical defect that prevents tokio 1.39.X working on illumos systems, and we'd really like to get that sorted out!

Please let us know if there's anything left to do before this can be integrated and released. Thanks!

@Thomasdezeeuw
Copy link
Collaborator

Thomasdezeeuw commented Aug 9, 2024

@jclulow I work on Mio in my spare time, I'm not a help desk. I do not appropriate those kind of hurry up comments.

@Thomasdezeeuw
Copy link
Collaborator

Maybe we can apply the same trick we did to the pipe based waker to the eventfd implementation.

Looking at:

// The epoll emulation on some illumos systems currently requires
// the pipe buffer to be completely empty for an edge-triggered
// wakeup on the pipe read side.
#[cfg(target_os = "illumos")]
self.empty();

Can we apply that to:

pub(crate) fn wake(&self) -> io::Result<()> {

Can either of you try that and see if it works?

@jclulow
Copy link
Contributor

jclulow commented Aug 9, 2024

@jclulow I work on Mio in my spare time, I'm not a help desk. I do not appropriate those kind of hurry up comments.

Totally appreciate that you're busy! That's why we've picked the lowest risk change here (reverting to the behaviour we know has worked for years in production) and done all the testing already.

We can, and will, investigate eventfd and other options in the future, obviously, including your suggested path above. To get the quickest possible resolution for a critical defect, though, with the lowest likelihood of us coming back to you with another urgent PR in a few days, we need to go back to the original behaviour first.

We're also looking at providing illumos CI resources in the future as well, so that these sorts of behavioural changes can be tested prior to integration next time. Plus, I just want to say that we're always happy to be pinged for a review on any changes that you need to make to mio which affect illumos -- we currently do that for, e.g., the libc crate, which has been a fruitful collaboration!

Can we please get this change integrated as-is today and a new release cut? If there are other tokio maintainers we should pull in here, we're happy to do that too! Just let us know.

Thanks!

@Thomasdezeeuw
Copy link
Collaborator

Since neither of you wanted to write the code, I did: #1826. https://www.illumos.org/issues/16700 is very clear indication to me that this isn't limited to eventfd as we've also seen this in pipes. The workaround added in #1826 is copied from the pipe implementation that also wouldn't work without it.

@sunshowers
Copy link
Author

sunshowers commented Aug 9, 2024

Hi there! In the spirit of moving forward and being mindful of everyone's time, I'd like to provide an update on our progress:

We've identified this as an illumos bug. There's an open bug report with a proposed fix, which you can find here.

While a fix is in progress, it's important to note that it won't immediately resolve the problem for all systems, particularly those running older illumos versions. To ensure mio functions correctly on illumos, we'll need to maintain a workaround for the foreseeable future.

We've developed a potential workaround, which is available on the Helios issue I mentioned earlier. However, we're still in the process of thoroughly testing this solution. As you can imagine, we want to be as confident in this new approach as we are with our current implementation, which has a proven track record in production environments.

Another aspect we're exploring is the performance of eventfd on illumos. While eventfd offers performance benefits on Linux, we haven't yet confirmed similar improvements on illumos. We're committed to rigorously testing this to ensure it provides tangible benefits.

We're happy to continue validating eventfd, but this process will require some time. This includes testing and profiling the eventfd implementation on illumos. If our testing doesn't show significant performance improvements over pipes on illumos, we may need to reconsider the switch.

Given these factors, we believe it would be prudent to temporarily revert to a pipe-based implementation while we continue our validation of eventfd. This approach would allow us to address the critical defect in mio without delay, especially considering:

  • The defect affects all versions of mio 1.x.
  • Tokio 1.39 depends on mio 1.x with no way to go back to mio 0.8.
  • Third-party projects upgrading to a Tokio 1.39 minimum will encounter compatibility issues on illumos.

We believe this strategy will help minimize disruption for all parties involved. We appreciate your understanding as we work through this challenge. Thank you!

@Thomasdezeeuw
Copy link
Collaborator

Closing as solved by #1826.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants