Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Low: nfsserver: more appropriate default timeouts #607

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

davidvossel
Copy link
Contributor

No description provided.

@dmuhamedagic
Copy link
Contributor

dmuhamedagic commented May 5, 2015 via email

@davidvossel
Copy link
Contributor Author

@dmuhamedagic In some environments that use clvmd + clustered volume groups for shared storage, we noticed that it was possible for nfs to take longer than the default 40 second timeout to start for the first time. This is a common deployment, so we'd like the default timeout to "just work" for most people.

@dmuhamedagic
Copy link
Contributor

dmuhamedagic commented May 6, 2015 via email

@davidvossel
Copy link
Contributor Author

Default timeouts should error on the side of being too conservative. Using a 20s timeout for any nfs action is too aggressive. monitor of nfs is low impact, so I used 30s (which i'd consider to be the lowest default timeout i'd feel comfortable imposing on people). Stop is a bit more involved, so I chose 60s.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover". It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period. I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

@dmuhamedagic
Copy link
Contributor

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too
conservative.

That makes two of us. I'm all for setting longer timeouts.

Using a 20s timeout for any nfs action is too
aggressive. monitor of nfs is low impact, so I used 30s (which
i'd consider to be the lowest default timeout i'd feel
comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing
special happening in the operation which justifies extending
that, then 20s should be the default. Can you please give the
reason why the default needs to be extended here?

Stop is a bit more involved,
so I chose 60s.

Again, this doesn't give us the reason.

My philosophy is that It is easier to tell a user "tighten up your timeout values if you want to achieve quicker failover".

Normally, tighter timeouts don't result in faster failover.

It is more difficult and time consuming to field support questions that consist of "why is my resource timing out" only to realize in their specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these
defaults are actually not observed by pacemaker, but, optionally,
by various UI.

I've had to deal with this more than I'd like over the last few years so I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can
understand your sentiment here quite well. But we still had to
put the line somewhere. If we need to cross that line, we should
give some justification which pertains to the nature of the RA
and the actual actions performed within that RA. Note also that
the default timeout should be the minimum advisable timeout for
that particular kind of resource (but never less than 20s).

@davidvossel
Copy link
Contributor Author

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too
conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against
this change though.

Using a 20s timeout for any nfs action is too
aggressive. monitor of nfs is low impact, so I used 30s (which
i'd consider to be the lowest default timeout i'd feel
comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing
special happening in the operation which justifies extending
that, then 20s should be the default. Can you please give the
reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Stop is a bit more involved,
so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not technical.

My philosophy is that It is easier to tell a user "tighten up your timeout
values if you want to achieve quicker failover".

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always the case.

It is more difficult and time consuming to field support questions that
consist of "why is my resource timing out" only to realize in their
specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these
defaults are actually not observed by pacemaker, but, optionally,
by various UI.

right.

I've had to deal with this more than I'd like over the last few years so
I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can
understand your sentiment here quite well. But we still had to
put the line somewhere. If we need to cross that line, we should
give some justification which pertains to the nature of the RA
and the actual actions performed within that RA. Note also that
the default timeout should be the minimum advisable timeout for
that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start
timeout change. I was in the area and made the decision that I believed we
should be advertising more conservative timeout periods in the metadata
for other actions as well. honestly, if there's any push back here I
don't care enough (or feel strongly enough) about the non start default
timeout changes to discuss it further.


Reply to this email directly or view it on GitHub:
#607 (comment)

@dmuhamedagic
Copy link
Contributor

On Mon, May 11, 2015 at 09:29:16AM -0700, David Vossel wrote:

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too
conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against
this change though.

Well, both. We just need to argument the change.

Using a 20s timeout for any nfs action is too
aggressive. monitor of nfs is low impact, so I used 30s (which
i'd consider to be the lowest default timeout i'd feel
comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing
special happening in the operation which justifies extending
that, then 20s should be the default. Can you please give the
reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Most of the time it's fine. So, I guess that it is fine as a
default too. But if you disagree, why not raise the issue on the ML?

Stop is a bit more involved,
so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not technical.

I'd say that it is really technical. It's about knowing the RA
and what does it do and estimating how much time particular
commands in the particular operation's path may take.

My philosophy is that It is easier to tell a user "tighten up your timeout
values if you want to achieve quicker failover".

Users should never set timeouts lower than the RA defaults.

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always the case.

It is more difficult and time consuming to field support questions that
consist of "why is my resource timing out" only to realize in their
specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these
defaults are actually not observed by pacemaker, but, optionally,
by various UI.

right.

I've had to deal with this more than I'd like over the last few years so
I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can
understand your sentiment here quite well. But we still had to
put the line somewhere. If we need to cross that line, we should
give some justification which pertains to the nature of the RA
and the actual actions performed within that RA. Note also that
the default timeout should be the minimum advisable timeout for
that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start
timeout change. I was in the area and made the decision that I believed we
should be advertising more conservative timeout periods in the metadata
for other actions as well. honestly, if there's any push back here I
don't care enough (or feel strongly enough) about the non start default
timeout changes to discuss it further.

The defaults should be conservative, but not more conservative
than necessary. And to stress again:

The default timeout is the _minimum_ advisable timeout for
that particular kind of resource (but never less than 20s).

Further, once we increase defaults for the existing RA, the
working configurations will suddenly produce warnings about
insufficient operation timeouts. That wouldn't make a good
impression.

@davidvossel
Copy link
Contributor Author

----- Original Message -----

On Mon, May 11, 2015 at 09:29:16AM -0700, David Vossel wrote:

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too
conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against
this change though.

Well, both. We just need to argument the change.

Using a 20s timeout for any nfs action is too
aggressive. monitor of nfs is low impact, so I used 30s (which
i'd consider to be the lowest default timeout i'd feel
comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing
special happening in the operation which justifies extending
that, then 20s should be the default. Can you please give the
reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Most of the time it's fine. So, I guess that it is fine as a
default too. But if you disagree, why not raise the issue on the ML?

Stop is a bit more involved,
so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not
technical.

I'd say that it is really technical. It's about knowing the RA
and what does it do and estimating how much time particular
commands in the particular operation's path may take.

My philosophy is that It is easier to tell a user "tighten up your
timeout
values if you want to achieve quicker failover".

Users should never set timeouts lower than the RA defaults.

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always
the case.

It is more difficult and time consuming to field support questions that
consist of "why is my resource timing out" only to realize in their
specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these
defaults are actually not observed by pacemaker, but, optionally,
by various UI.

right.

I've had to deal with this more than I'd like over the last few years
so
I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can
understand your sentiment here quite well. But we still had to
put the line somewhere. If we need to cross that line, we should
give some justification which pertains to the nature of the RA
and the actual actions performed within that RA. Note also that
the default timeout should be the minimum advisable timeout for
that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start
timeout change. I was in the area and made the decision that I believed we
should be advertising more conservative timeout periods in the metadata
for other actions as well. honestly, if there's any push back here I
don't care enough (or feel strongly enough) about the non start default
timeout changes to discuss it further.

The defaults should be conservative, but not more conservative
than necessary. And to stress again:

The default timeout is the minimum advisable timeout for
that particular kind of resource (but never less than 20s).

That's interesting. I didn't realize that's how we documented this in
the metadata, and I'm not sure I agree it.

If we're advertising something as being an "okay" value to use why
would we give the absolute minimum value? The minimum value represents
the most aggressive timing we consider safe. In the case of nfsserver
there are far too many variables involved for us to advertise a
safe minimum value. What is safe for 90% of users might not be safe
for another 10% of users. If we raise the minimum value just to
account for the 10% use cases, then we're telling the other 90% of
people that they should never go below our advertised minimum value
even though in reality it would be safe.

The minimum value for some agent's actions vary so drastically between
deployments it would be impractical for us to even attempt to recommend
a minimum.

Take the galera or redis agents for example. A galera promotion involves
a syncing a galera instance with another active galera instance in the
cluster... How could I give a minimum value that makes any sense for
that? The timing period depends on network speed, how large the database
is, and potentially how loaded the donor galera instance is. The minimum
value for a small database could actually be 20 seconds... but in practice
we're seeing it can take nearly 300s in the real world. In this case, the
minimum timeout of 20s would work for proably 1% of users, the 300s timeout
would work for around 90% of users, and out of that 90% most of them could
tighten up the timeout value by entire minutes.

For galera I advertised promote timeout as 300s because I just want people
to be able to use these agents and for them to work.

Further, once we increase defaults for the existing RA, the
working configurations will suddenly produce warnings about
insufficient operation timeouts. That wouldn't make a good
impression.


Reply to this email directly or view it on GitHub:
#607 (comment)

@krig
Copy link
Contributor

krig commented May 18, 2015

Take the galera or redis agents for example. A galera promotion involves
a syncing a galera instance with another active galera instance in the
cluster... How could I give a minimum value that makes any sense for
that?

On this note, I'd argue for requiring explicit configuration of all timeouts. As a compromise, the defaults should be pessimistic rather than optimistic. (To sum up, I agree with both of you).

@dmuhamedagic
Copy link
Contributor

On Fri, May 15, 2015 at 09:45:49AM -0700, David Vossel wrote:

----- Original Message -----

On Mon, May 11, 2015 at 09:29:16AM -0700, David Vossel wrote:

----- Original Message -----

On Wed, May 06, 2015 at 08:55:21AM -0700, David Vossel wrote:

Default timeouts should error on the side of being too
conservative.

That makes two of us. I'm all for setting longer timeouts.

excellent, I still can't tell if you're arguing for or against
this change though.

Well, both. We just need to argument the change.

Using a 20s timeout for any nfs action is too
aggressive. monitor of nfs is low impact, so I used 30s (which
i'd consider to be the lowest default timeout i'd feel
comfortable imposing on people).

The default timeout in pacemaker is 20s. If there's nothing
special happening in the operation which justifies extending
that, then 20s should be the default. Can you please give the
reason why the default needs to be extended here?

I disagree with pacemaker's default timeout of 20s.

Most of the time it's fine. So, I guess that it is fine as a
default too. But if you disagree, why not raise the issue on the ML?

Stop is a bit more involved,
so I chose 60s.

Again, this doesn't give us the reason.

because 60s is more conservative that 40s. this is philosophical, not
technical.

I'd say that it is really technical. It's about knowing the RA
and what does it do and estimating how much time particular
commands in the particular operation's path may take.

My philosophy is that It is easier to tell a user "tighten up your
timeout
values if you want to achieve quicker failover".

Users should never set timeouts lower than the RA defaults.

Normally, tighter timeouts don't result in faster failover.

yes, normally interval results in faster failover, but that's not always
the case.

It is more difficult and time consuming to field support questions that
consist of "why is my resource timing out" only to realize in their
specific setup they need a more conservative timeout period.

Timeouts often need to be defined by the user. Note that these
defaults are actually not observed by pacemaker, but, optionally,
by various UI.

right.

I've had to deal with this more than I'd like over the last few years
so
I'm beginning to think more conservative timeouts make better defaults.

More conservative timeouts are IMO always better and I can
understand your sentiment here quite well. But we still had to
put the line somewhere. If we need to cross that line, we should
give some justification which pertains to the nature of the RA
and the actual actions performed within that RA. Note also that
the default timeout should be the minimum advisable timeout for
that particular kind of resource (but never less than 20s).

I'm really not interested in defending anything other than the start
timeout change. I was in the area and made the decision that I believed we
should be advertising more conservative timeout periods in the metadata
for other actions as well. honestly, if there's any push back here I
don't care enough (or feel strongly enough) about the non start default
timeout changes to discuss it further.

The defaults should be conservative, but not more conservative
than necessary. And to stress again:

The default timeout is the _minimum_ advisable timeout for
that particular kind of resource (but never less than 20s).

That's interesting. I didn't realize that's how we documented this in
the metadata, and I'm not sure I agree it.

If we're advertising something as being an "okay" value to use why
would we give the absolute minimum value?

To allow users to make better estimates for their installations.

The minimum value represents
the most aggressive timing we consider safe.

Yes. For some "typical" setup. What is "typical" setup is up to
the RA author to decide. After all, they should have the
necessary expertize.

In the case of nfsserver
there are far too many variables involved for us to advertise a
safe minimum value. What is safe for 90% of users might not be safe
for another 10% of users. If we raise the minimum value just to
account for the 10% use cases, then we're telling the other 90% of
people that they should never go below our advertised minimum value
even though in reality it would be safe.

The minimum value for some agent's actions vary so drastically between
deployments it would be impractical for us to even attempt to recommend
a minimum.

Take the galera or redis agents for example. A galera promotion involves
a syncing a galera instance with another active galera instance in the
cluster... How could I give a minimum value that makes any sense for
that? The timing period depends on network speed, how large the database
is, and potentially how loaded the donor galera instance is. The minimum
value for a small database could actually be 20 seconds... but in practice
we're seeing it can take nearly 300s in the real world. In this case, the
minimum timeout of 20s would work for proably 1% of users, the 300s timeout
would work for around 90% of users, and out of that 90% most of them could
tighten up the timeout value by entire minutes.

For galera I advertised promote timeout as 300s because I just want people
to be able to use these agents and for them to work.

Yes, it is very difficult to make estimates for some agents.

@dmuhamedagic
Copy link
Contributor

On Mon, May 18, 2015 at 12:26:47PM -0700, Kristoffer Grönlund wrote:

Take the galera or redis agents for example. A galera promotion involves
a syncing a galera instance with another active galera instance in the
cluster... How could I give a minimum value that makes any sense for
that?

On this note, I'd argue for requiring explicit configuration of all timeouts.

I can see your point and that's certainly true for stuff such as
databases or resources which depend on network. It is up to the
user to have those timeouts set depending on their environment.
Otherwise, setting timeouts for everything would probably make
the configuration even more unreadable than it is.

Perhaps we need a special value for some defaults:
SET_THIS_ONE_YOURSELF.

As a compromise, the defaults should be pessimistic rather than optimistic. (To sum up, I agree with both of you).

:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants