Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(cosmic-swingset): Split inbound queue length metrics by queue name #10904

Merged
merged 3 commits into from
Jan 29, 2025

Conversation

gibson042
Copy link
Member

@gibson042 gibson042 commented Jan 28, 2025

Fixes #10900

Description

The first commit is pure refactoring. After that, kernel-stat.js is updated to document QUEUE_METRICS ({length,add,remove} triples sharing a common "cosmic_swingset_inbound_queue" prefix) and makeInboundQueueMetrics is updated to be an internal detail of that file (i.e., rather than constructing inboundQueueMetrics, launch-chain.js passes in initial queue lengths to exportKernelStats and extracts it from the response).

This PR changes Prometheus GET /metrics output to add "queue" dimensional labels corresponding with the containing "forced"/"high-priority"/"queued" CrankerPhase ("forced"/"priority"/"inbound" as of this work):

 # HELP cosmic_swingset_inbound_queue_length inbound queue length
 # TYPE cosmic_swingset_inbound_queue_length gauge
-cosmic_swingset_inbound_queue_length 0 1738018661949
+cosmic_swingset_inbound_queue_length{queue="forced"} 0 1738018661949
+cosmic_swingset_inbound_queue_length{queue="priority"} 0 1738018661949
+cosmic_swingset_inbound_queue_length{queue="inbound"} 0 1738018661949
 # HELP cosmic_swingset_inbound_queue_add inbound queue increments
 # TYPE cosmic_swingset_inbound_queue_add counter
-cosmic_swingset_inbound_queue_add 0 1738018661949
+cosmic_swingset_inbound_queue_add{queue="forced"} 0 1738018661949
+cosmic_swingset_inbound_queue_add{queue="priority"} 0 1738018661949
+cosmic_swingset_inbound_queue_add{queue="inbound"} 0 1738018661949
 # HELP cosmic_swingset_inbound_queue_remove inbound queue decrements
 # TYPE cosmic_swingset_inbound_queue_remove counter
-cosmic_swingset_inbound_queue_remove 0 1738018661949
+cosmic_swingset_inbound_queue_remove{queue="forced"} 0 1738018661949
+cosmic_swingset_inbound_queue_remove{queue="priority"} 0 1738018661949
+cosmic_swingset_inbound_queue_remove{queue="inbound"} 0 1738018661949

And slog entries to capture those same dimensions in object property name suffixes:

 {
   "type": "cosmic-swingset-end-block-finish",
   "blockHeight": 21,
   "blockTime": 1738022131,
   "inboundQueueStats": {
+    "cosmic_swingset_inbound_queue_length_forced": 0,
+    "cosmic_swingset_inbound_queue_length_priority": 0,
+    "cosmic_swingset_inbound_queue_length_inbound": 0,
     "cosmic_swingset_inbound_queue_length": 0,
+    "cosmic_swingset_inbound_queue_add_forced": 4,
+    "cosmic_swingset_inbound_queue_add_priority": 0,
+    "cosmic_swingset_inbound_queue_add_inbound": 0,
     "cosmic_swingset_inbound_queue_add": 4,
+    "cosmic_swingset_inbound_queue_remove_forced": 4,
+    "cosmic_swingset_inbound_queue_remove_priority": 0,
+    "cosmic_swingset_inbound_queue_remove_inbound": 0,
     "cosmic_swingset_inbound_queue_remove": 4
   },
   "time": 1738022137.072313,
   "monotime": 733.9035001497268
 }

Security Considerations

None known.

Scaling Considerations

This does increase the size of cosmic-swingset-begin-block and cosmic-swingset-end-block-finish slog entries, but I don't think to an extent that should dissuade us from including the detail.

Documentation Considerations

This introduces implicit documentation of cosmic-swingset metrics.

Testing Considerations

Expected output was confirmed manually (see the diffs above), but automated verification of expected instrument names, but automated verification of expected metric names would be better.

Upgrade Considerations

I believe that existing consumers of cosmic_swingset_inbound_queue_{length,add,remove} data should tolerate the new dimensionality and naïvely sum over all labels, giving the same data as before. Verification in live networks should cover that in addition to the above slogfile/Prometheus scrape1 checks.

Footnotes

  1. as configured by environment variable OTEL_EXPORTER_PROMETHEUS_PORT.

@gibson042 gibson042 requested a review from a team as a code owner January 28, 2025 00:17
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking more about these metrics, it's a little confusing that the add/remove are counters, but relative to the restart of the node. I almost feel they should be reported as delta aggregation to minimize the effect of a node restart (as we do not and probably cannot persist these values across a restart).

Besides that I'm still parsing through the changes.

Comment on lines +92 to +138
/**
* Return an array of Views defining explicit buckets for Histogram instruments
* to which we record measurements.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: switch to using an advice: { explicitBucketBoundaries: number[] } field when declaring the metric. Support for that was implemented in open-telemetry/opentelemetry-js#3876 and is documented at https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-advisory-parameters

Comment on lines +342 to +354
// But note that the Prometheus representation of the former will be a Gauge:
// https://prometheus.io/docs/concepts/metric_types/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's surprising.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just an unfortunate vocabulary mismatch... Prometheus has monotonically increasing Counter vs. arbitrarily adjustable Gauge, while OpenTelemetry separates the latter into mutually additive up-down counter vs. non-mutually-additive Gauge.

Comment on lines +347 to +360
const asyncInstrument = isMonotonic
? metricMeter.createObservableCounter(name, instrumentOptions)
: metricMeter.createObservableUpDownCounter(name, instrumentOptions);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was hoping we could get rid of the async nature of these instruments. I'm not sure why we had them async in the first place (well except for the fact that a gauge in otel is necessarily async I think, but we're not using a gauge here anyway, not sure why actually).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like sync gauges were "recently" added too: open-telemetry/opentelemetry-js#4296

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's on my to-do list; this is intended as a quick win to be followed up with bigger changes.

packages/cosmic-swingset/src/launch-chain.js Show resolved Hide resolved
@mhofman mhofman self-requested a review January 28, 2025 03:03
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, with some minor nits (feel free to ignore the ones you don't consider worth it)

Comment on lines +109 to +111
// "cosmic_swingset_inbound_queue_{length,add,remove}" measurements carry a
// "queue" attribute.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a TODO that in future otel sdks, expected dimension keys could be documented as an advice: { attributes: string[] } field?

if (newLength >= oldLength) {
const map = counterData[QueueMetricAspect.IncrementCount];
nudge(map, queueName, newLength - oldLength);
} else {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make the comparison strict superior above (newLength > oldLength), and this one strictly inferior

Suggested change
} else {
} else if (newLength < oldLength) {

Then we could verify in an else close that the length are equal or throw that we're getting unexpected values.

decStat: (delta = 1) => {
length -= delta;
remove += delta;
decStat: (queueName, delta = 1) => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably should assert that delta is a positive integer

[QueueMetricAspect.DecrementCount]: makeQueueCounts(zeroEntries),
};

const provideQueue = queueName => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use this lazy provideQueue approach instead of asserting the queue exists ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because in the event of misconfigured reporting for an unknown queue, I'd rather accept the data than either ignore it or halt the chain. Comment added to that effect, but please let me know if you prefer one of the other options.

Comment on lines 455 to 490
const initialQueueLengths = /** @type {Record<InboundQueueName, number>} */ ({
queued: actionQueue.size(),
'high-priority': highPriorityQueue.size(),
forced: runThisBlock.size(),
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like passing current values instead making metrics just to pass them back.

I wish the high-priority phase had a name that better related to the queued name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These {'leftover' | 'forced' | 'high-priority' | 'timer' | 'queued' | 'cleanup'} phase names are more externally relevant than the internal details to which they correspond, and have been introduced only recently. If we want to change them, there's no better time than now. Do you have any ideas?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed "high-priority" → "priority" and "queued" → "inbound" so that each phase is now a single word that looks good embedded in a metric name.

Copy link

cloudflare-workers-and-pages bot commented Jan 29, 2025

Deploying agoric-sdk with  Cloudflare Pages  Cloudflare Pages

Latest commit: 529d2f6
Status: ✅  Deploy successful!
Preview URL: https://17eac7ad.agoric-sdk.pages.dev
Branch Preview URL: https://gibson-10900-inbound-queue-m.agoric-sdk.pages.dev

View logs

* variable and parameter names
* imported OpenTelemetry types
* "TODO" comments for future architectural cleanup
* JSDoc descriptions
@gibson042 gibson042 force-pushed the gibson-10900-inbound-queue-metrics branch from 838e18d to 529d2f6 Compare January 29, 2025 21:40
@gibson042 gibson042 added the automerge:rebase Automatically rebase updates, then merge label Jan 29, 2025
Copy link
Member

@mhofman mhofman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the CrankerPhase/InboundQueueName cleanup

@mergify mergify bot merged commit 74fc45b into master Jan 29, 2025
93 checks passed
@mergify mergify bot deleted the gibson-10900-inbound-queue-metrics branch January 29, 2025 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
automerge:rebase Automatically rebase updates, then merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cosmic-swingset inbound queue length metrics should be split by queue priority
2 participants