Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OTEL bottleneck, ability to decide programmatically decide which spans we want to create #6542

Open
samuelAndalon opened this issue Jan 14, 2025 · 1 comment

Comments

@samuelAndalon
Copy link
Contributor

samuelAndalon commented Jan 14, 2025

Describe the solution you'd like

Apollo router provides the capability of high throughput, with very few instances the router can handle very large workloads, however, we have identified a bottleneck in Open Telemetry and the batch processor

the query_planning span wraps the whole query_planning services, which includes cache lookups, those takes microseconds, and as a result the router is exporting hundreds of thousands of spans causing the OpenTelemetry channel to not being able to send all spans that are actually created (bummer that OTEL does not implement backpressure)

image

for our use case, these spans are completely useless, they don't give us any useful insight and they just consume memory and CPU of the node where our router instance is running, and memory and CPU of the datadog agent that sends the spans to the datadog api/

Its not only query_planning there are other spans that are useless for our use cases:

  1. execution
  2. http_request
  3. fetch
  4. parallel
  5. secuence

pretty much all query plan nodes.

Describe alternatives you've considered

We have tried multiple configurations for the batch_processor.

We have tried changing the log level of modules using the RUST_LOG env variable, which works for most of use cases, but it breaks the propagation of headers when changing the log level of apollo_router::services::http::service.

After doing some debugging found that the issue is here:

https://github.com/apollographql/router/blob/dev/apollo-router/src/services/http/service.rs#L276

the propagation of headers relies on the log level of modules. Which IMHO should be decoupled.

@BrynCooke
Copy link
Contributor

I think it would be possible to fix the propagation even if the http span is suppressed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants