-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(cosmic-swingset): Split inbound queue length metrics by queue name #10904
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking more about these metrics, it's a little confusing that the add/remove are counters, but relative to the restart of the node. I almost feel they should be reported as delta aggregation to minimize the effect of a node restart (as we do not and probably cannot persist these values across a restart).
Besides that I'm still parsing through the changes.
/** | ||
* Return an array of Views defining explicit buckets for Histogram instruments | ||
* to which we record measurements. | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self: switch to using an advice: { explicitBucketBoundaries: number[] }
field when declaring the metric. Support for that was implemented in open-telemetry/opentelemetry-js#3876 and is documented at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-advisory-parameters
// But note that the Prometheus representation of the former will be a Gauge: | ||
// https://prometheus.io/docs/concepts/metric_types/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's surprising.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just an unfortunate vocabulary mismatch... Prometheus has monotonically increasing Counter vs. arbitrarily adjustable Gauge, while OpenTelemetry separates the latter into mutually additive up-down counter vs. non-mutually-additive Gauge.
const asyncInstrument = isMonotonic | ||
? metricMeter.createObservableCounter(name, instrumentOptions) | ||
: metricMeter.createObservableUpDownCounter(name, instrumentOptions); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was hoping we could get rid of the async nature of these instruments. I'm not sure why we had them async in the first place (well except for the fact that a gauge in otel is necessarily async I think, but we're not using a gauge here anyway, not sure why actually).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like sync gauges were "recently" added too: open-telemetry/opentelemetry-js#4296
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's on my to-do list; this is intended as a quick win to be followed up with bigger changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, with some minor nits (feel free to ignore the ones you don't consider worth it)
// "cosmic_swingset_inbound_queue_{length,add,remove}" measurements carry a | ||
// "queue" attribute. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a TODO that in future otel sdks, expected dimension keys could be documented as an advice: { attributes: string[] }
field?
if (newLength >= oldLength) { | ||
const map = counterData[QueueMetricAspect.IncrementCount]; | ||
nudge(map, queueName, newLength - oldLength); | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make the comparison strict superior above (newLength > oldLength
), and this one strictly inferior
} else { | |
} else if (newLength < oldLength) { |
Then we could verify in an else close that the length are equal or throw that we're getting unexpected values.
decStat: (delta = 1) => { | ||
length -= delta; | ||
remove += delta; | ||
decStat: (queueName, delta = 1) => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should assert that delta
is a positive integer
[QueueMetricAspect.DecrementCount]: makeQueueCounts(zeroEntries), | ||
}; | ||
|
||
const provideQueue = queueName => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use this lazy provideQueue
approach instead of asserting the queue exists ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because in the event of misconfigured reporting for an unknown queue, I'd rather accept the data than either ignore it or halt the chain. Comment added to that effect, but please let me know if you prefer one of the other options.
const initialQueueLengths = /** @type {Record<InboundQueueName, number>} */ ({ | ||
queued: actionQueue.size(), | ||
'high-priority': highPriorityQueue.size(), | ||
forced: runThisBlock.size(), | ||
}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like passing current values instead making metrics just to pass them back.
I wish the high-priority
phase had a name that better related to the queued
name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These {'leftover' | 'forced' | 'high-priority' | 'timer' | 'queued' | 'cleanup'} phase names are more externally relevant than the internal details to which they correspond, and have been introduced only recently. If we want to change them, there's no better time than now. Do you have any ideas?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed "high-priority" → "priority" and "queued" → "inbound" so that each phase is now a single word that looks good embedded in a metric name.
Deploying agoric-sdk with Cloudflare Pages
|
* variable and parameter names * imported OpenTelemetry types * "TODO" comments for future architectural cleanup * JSDoc descriptions
…values of a JSDoc enum
838e18d
to
529d2f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the CrankerPhase/InboundQueueName cleanup
Fixes #10900
Description
The first commit is pure refactoring. After that, kernel-stat.js is updated to document QUEUE_METRICS ({length,add,remove} triples sharing a common "cosmic_swingset_inbound_queue" prefix) and
makeInboundQueueMetrics
is updated to be an internal detail of that file (i.e., rather than constructinginboundQueueMetrics
, launch-chain.js passes in initial queue lengths toexportKernelStats
and extracts it from the response).This PR changes Prometheus
GET /metrics
output to add "queue" dimensional labels corresponding with the containing"forced"/"high-priority"/"queued" CrankerPhase("forced"/"priority"/"inbound" as of this work):And slog entries to capture those same dimensions in object property name suffixes:
Security Considerations
None known.
Scaling Considerations
This does increase the size of cosmic-swingset-begin-block and cosmic-swingset-end-block-finish slog entries, but I don't think to an extent that should dissuade us from including the detail.
Documentation Considerations
This introduces implicit documentation of cosmic-swingset metrics.
Testing Considerations
Expected output was confirmed manually (see the diffs above), but automated verification of expected instrument names, but automated verification of expected metric names would be better.
Upgrade Considerations
I believe that existing consumers of cosmic_swingset_inbound_queue_{length,add,remove} data should tolerate the new dimensionality and naïvely sum over all labels, giving the same data as before. Verification in live networks should cover that in addition to the above slogfile/Prometheus scrape1 checks.
Footnotes
as configured by environment variable
OTEL_EXPORTER_PROMETHEUS_PORT
. ↩