Polishing

grafana · Dec 8, 2023 · d889c0e · d889c0e
1 parent 36b7098
commit d889c0e
Show file tree

Hide file tree

Showing 3 changed files with 93 additions and 69 deletions.
diff --git a/docs/sources/flow/reference/components/otelcol.exporter.loadbalancing.md b/docs/sources/flow/reference/components/otelcol.exporter.loadbalancing.md
@@ -143,9 +143,9 @@ Name | Type | Description | Default | Required
 
 ### kubernetes block
 
-You can use the `kubernetes` block to load balance across the pods of a Kubernetes service. The Agent will be notified
-by the Kubernetes API whenever a new pod is added or removed from the service. The `kubernetes` resolver has a
-much faster response time than the `dns` resolver, because it doesn't require polling.
+You can use the `kubernetes` block to load balance across the pods of a Kubernetes service. 
+The Agent will be notified by the Kubernetes API whenever a new pod is added or removed from the service. 
+The `kubernetes` resolver has a much faster response time than the `dns` resolver, because it doesn't require polling.
 
 The following arguments are supported:
 
@@ -292,49 +292,44 @@ All spans for a given `service.name` must go to the same spanmetrics Agent.
 * If this is not done, metrics generated from spans might be incorrect.
   * For example, if similar spans for the same `service.name` end up on different Agent instances,
     the two Agents will have identical metric series for calculating span latency, errors, and number of requests.
-  * When both Agents attempt to remote write the metrics to a database such as Mimir,
-    the series may clash with each other.
+  * When both Agents attempt to write the metrics to a database such as Mimir, the series may clash with each other.
   * At best this will lead to an error in the Agent and a rejected write to the metrics database. 
   * At worst, it could lead to an inaccurate data due to overlapping samples for the metric series.
 
 However, there are ways to scale `otelcol.connector.spanmetrics` without the need for a load balancer:
 1. Each Agent could add an attribute such as `agent.id` in order to make its series unique.
   * A `sum by` PromQL query could be used to aggregate the metrics from different Agents.
-  * An extra `agent.id` attribute has a downside that the metrics stored in the database will have higher cardinality.
+  * An extra `agent.id` attribute has a downside that the metrics stored in the database will have higher {{< term "cardinality" >}}cardinality{{< /term >}}.
 2. Spanmetrics could be generated in the backend database instead of the Agent.
   * For example, span metrics can be [generated][tempo-spanmetrics] in Grafana Cloud by the Tempo traces database.
 
 [tempo-spanmetrics]: https://grafana.com/docs/tempo/latest/metrics-generator/span_metrics/
 
 ### otelcol.connector.servicegraph
-Unfortunately, there is generally no reliable way to scale `otelcol.connector.servicegraph` over multiple Agent instance.
-<!-- Or is there? Can we fix the issue below using a PromQL query? There may still be higher than normal cardinality though? -->
+It is challenging to scale `otelcol.connector.servicegraph` over multiple Agent instances.
 * For `otelcol.connector.servicegraph` to work correctly, each "client" span must be paired 
   with a "server" span in order to calculate metrics such as span duration.
 * If a "client" span goes to one Agent, but a "server" span goes to another Agent, 
   then no single Agent will be able to pair the spans and a metric won't be generated.
 
 `otelcol.exporter.loadbalancing` can solve this problem partially if it is configured with `routing_key = "traceID"`.
-  * Each Agent will then be able to calculate service graph for each client/server pair in a trace.
+  * Each Agent will then be able to calculate service graph for each "client"/"server" pair in a trace.
   * However, it is possible to have a span with similar "server"/"client" values 
     in a different trace, processed by another Agent.
-  * If two different Agents, process similar "server"/"client" spans, 
+  * If two different Agents process similar "server"/"client" spans, 
     they will generate the same service graph metric series.
   * If the series from two Agents are the same, this will lead to issues 
-    when remote writing them in the backend database.
-
-  * Users could differentiate the series by adding a label such as `"agent_instance"`, 
-    but there is currently no method in the Agent to aggregate those series from different Agents and merge them into one series.
-
-There are ways to work around this:
-  1. Each Agent could add an attribute such as `agent.id` in order to make its series unique.
+    when writing them to the backend database.
+  * Users could differentiate the series by adding an attribute such as `"agent.id"`.
+      * Unfortunately, there is currently no method in the Agent to aggregate those series from different Agents and merge them into one series.
     * A PromQL query could be used to aggregate the metrics from different Agents.
-    * An extra `agent.id` attribute has a downside that the metrics stored in the database will have higher cardinality.
-  2. A simpler, more scalable alternative to generating service graph metrics 
-    in the Agent is to do them in the backend database. 
-    * For example service graphs can be [generated][tempo-servicegraphs] in Grafana Cloud by the Tempo traces database.
+    * If the metrics are stored in Grafana Mimir, cardinality issues due to `"agent.id"` labels can be solved using [Adaptive Metrics][adaptive-metrics].
+
+A simpler, more scalable alternative to generating service graph metrics in the Agent is to generate them entirely in the backend database. 
+For example, service graphs can be [generated][tempo-servicegraphs] in Grafana Cloud by the Tempo traces database.
 
 [tempo-servicegraphs]: https://grafana.com/docs/tempo/latest/metrics-generator/service_graphs/
+[adaptive-metrics]: https://grafana.com/docs/grafana-cloud/cost-management-and-billing/reduce-costs/metrics-costs/control-metrics-usage-via-adaptive-metrics/
 
 ### Mixing stateful components
 <!-- TODO: Add a picture of the architecture?  -->
@@ -343,7 +338,7 @@ Different Flow components may require a different `routing_key` for `otelcol.exp
   whereas `otelcol.connector.spanmetrics` requires `routing_key = "service"`.
 * Therefore, it is not possible to use the same load balancer for both tail sampling and span metrics.
 * Two different sets of load balancers have to be set up:
-  * One set of `otelcol.exporter.loadbalancing` with `routing_key = "traceID"`, sending spans to Agents doing tail sampling and no span metrics.
+  * One set of `otelcol.exporter.loadbalancing` with `routing_key = "traceID"`, sending spans to Agents which doing tail sampling and no span metrics.
   * Another set of `otelcol.exporter.loadbalancing` with `routing_key = "service"`, sending spans to Agents doing span metrics and no service graphs.
 * Unfortunately, this can also lead to side effects. For example, if `otelcol.connector.spanmetrics` is configured to generate exemplars,
   the tail sampling Agents might drop the trace which the exemplar points to. There is no coordination between the tail sampling Agents and
@@ -420,12 +415,12 @@ otelcol.exporter.loadbalancing "default" {
 ```
 
 Below is an example Kubernetes configuration which configures two sets of Agents:
-* A pool of "load balancer" Agents:
+* A pool of load-balancer Agents:
   * Spans are received from instrumented applications via `otelcol.receiver.otlp`
   * Spans are then exported via `otelcol.exporter.loadbalancing`.
-* A pool of "backing" Agents:
-  * The "backing" Agents run behind a headless service to enable the "load balancer" Agents to discover them.
-  * Spans are received from the "load balancer" Agents via `otelcol.receiver.otlp`
+* A pool of sampling Agents:
+  * The sampling Agents run behind a headless service to enable the load-balancer Agents to discover them.
+  * Spans are received from the load-balancer Agents via `otelcol.receiver.otlp`
   * Traces are then sampled via `otelcol.processor.tail_sampling`.
   * The traces are exported via `otelcol.exporter.otlp` to a an OTLP-compatible database such as Tempo.
 
@@ -649,7 +644,7 @@ data:
 {{< /collapse >}}
 
 You need to fill in correct OTLP credentials prior to running the above example.
-The example above can be started by using k3d:
+The example above can be started by using [k3d][]:
 <!-- TODO: Link to the k3d page -->
 ```bash
 k3d cluster create grafana-agent-lb-test
@@ -661,10 +656,13 @@ To delete the cluster, run:
 k3d cluster delete grafana-agent-lb-test
 ```
 
+[k3d]: https://k3d.io/v5.6.0/
+
 ### Kubernetes resolver
 
-`otelcol.exporter.loadbalancing` can export to a statically configured 
-list of Agents using configuration like this:
+When configured with a `kubernetes` resolver, `otelcol.exporter.loadbalancing` will be notified by 
+the Kubernetes API any time a pod has left or joined the `service`. 
+Spans will then be exported to the addresses from the Kubernetes API, combined with all the possible `ports`.
 
 ```river
 otelcol.exporter.loadbalancing "default" {
@@ -683,14 +681,14 @@ otelcol.exporter.loadbalancing "default" {
 ```
 
 Below is an example Kubernetes configuration which sets up two sets of Agents:
-* A pool of "load balancer" Agents:
+* A pool of load-balancer Agents:
   * Spans are received from instrumented applications via `otelcol.receiver.otlp`
   * Spans are then exported via `otelcol.exporter.loadbalancing`.
-  * The "load balancer" Agents will get notified by the Kubernetes API any time a pod 
-    is added or removed from the pool of "backing" Agents.
-* A pool of "backing" Agents:
-  * The "backing" Agents do not need to run behind a headless service.
-  * Spans are received from the "load balancer" Agents via `otelcol.receiver.otlp`
+  * The load-balancer Agents will get notified by the Kubernetes API any time a pod 
+    is added or removed from the pool of sampling Agents.
+* A pool of sampling Agents:
+  * The sampling Agents do not need to run behind a headless service.
+  * Spans are received from the load-balancer Agents via `otelcol.receiver.otlp`
   * Traces are then sampled via `otelcol.processor.tail_sampling`.
   * The traces are exported via `otelcol.exporter.otlp` to a an OTLP-compatible database such as Tempo.
 
@@ -706,13 +704,13 @@ metadata:
 apiVersion: v1
 kind: ServiceAccount
 metadata:
-  name: grafana-agent-traces
+  name: agent-traces
   namespace: grafana-cloud-monitoring
 ---
 apiVersion: rbac.authorization.k8s.io/v1
 kind: Role
 metadata:
-  name: grafana-agent-traces-role
+  name: agent-traces-role
   namespace: grafana-cloud-monitoring
 rules:
 - apiGroups:
@@ -727,15 +725,15 @@ rules:
 apiVersion: rbac.authorization.k8s.io/v1
 kind: RoleBinding
 metadata:
-  name: grafana-agent-traces-rolebinding
+  name: agent-traces-rolebinding
   namespace: grafana-cloud-monitoring
 roleRef:
   apiGroup: rbac.authorization.k8s.io
   kind: Role
-  name: grafana-agent-traces-role
+  name: agent-traces-role
 subjects:
 - kind: ServiceAccount
-  name: grafana-agent-traces
+  name: agent-traces
   namespace: grafana-cloud-monitoring
 ---
 apiVersion: apps/v1
@@ -815,7 +813,7 @@ spec:
         volumeMounts:
         - mountPath: /etc/agent
           name: agent-traces
-      serviceAccount: grafana-agent-traces
+      serviceAccount: agent-traces
       volumes:
       - configMap:
           name: agent-traces
@@ -945,8 +943,7 @@ data:
 {{< /collapse >}}
 
 You need to fill in correct OTLP credentials prior to running the above example.
-The example above can be started by using k3d:
-<!-- TODO: Link to the k3d page -->
+The example above can be started by using [k3d][]s:
 ```bash
 k3d cluster create grafana-agent-lb-test
 kubectl apply -f kubernetes_config.yaml

diff --git a/docs/sources/flow/setup/deploy-agent.md b/docs/sources/flow/setup/deploy-agent.md
@@ -13,14 +13,12 @@ weight: 900
 
 {{< docs/shared source="agent" lookup="/deploy-agent.md" version="<AGENT_VERSION>" >}}
 
-## Scaling Grafana Agent
+## Processing different types of telemetry in different Agent instances
 
-If the load on the Agents is small, it is recommended to process
-all necessary telemetry signals in the same Agent process. For example, 
-a single Agent can process all of the incoming metrics, logs, traces, and profiles.
+If the load on the Agents is small, it is recommended to process all necessary telemetry signals in the same Agent process. 
+For example, a single Agent can process all of the incoming metrics, logs, traces, and profiles.
 
-However, if the load on the Agents is big, it may be beneficial to
-process different telemetry signals in different deployments of Agents:
+However, if the load on the Agents is big, it may be beneficial to process different telemetry signals in different deployments of Agents:
 * This provides better stability due to the isolation between processes.
   * For example, an overloaded Agent processing traces won't impact an Agent processing metrics.
 * Different types of signal collection require different methods for scaling:
@@ -29,14 +27,17 @@ process different telemetry signals in different deployments of Agents:
 
 ### Traces
 
-<!-- TODO: Link to https://opentelemetry.io/docs/collector/scaling/ ? -->
+Scaling Agent instances for tracing is very similar to [scaling OpenTelemetry Collector][scaling-collector] instances.
+This is because most Flow components used for tracing are based on components from the Collector.
+
+[scaling-collector]: https://opentelemetry.io/docs/collector/scaling/
 
 #### When to scale
 
-<!-- 
-TODO: Include information from https://opentelemetry.io/docs/collector/scaling/#when-to-scale
-Unfortunately the Agent doesn't have many of the metrics they mention because they're instrumented with OpenCensus and not OpenTelemetry.
--->
+To decide whether scaling is necessary, check metrics such as:
+* `receiver_refused_spans_ratio_total` from receivers such as `otelcol.receiver.otlp`.
+* `processor_refused_spans_ratio_total` from processors such as `otelcol.processor.batch`.
+* `exporter_send_failed_spans_ratio_total` from exporters such as `otelcol.exporter.otlp` and `otelcol.exporter.loadbalancing`.
 
 #### Stateful and stateless components
 
@@ -56,9 +57,8 @@ Examples of stateful components:
 
 <!-- TODO: link to the otelcol.exporter.loadbalancing docs for more info -->
 
-A "stateless component" does not need to aggregate specific spans in 
-order to work correctly - it can work correctly even if it only has 
-some of the spans of a trace.
+A "stateless component" does not need to aggregate specific spans in order to work correctly - 
+it can work correctly even if it only has some of the spans of a trace.
 
 Stateless Agents can be scaled without using `otelcol.exporter.loadbalancing`.
 You could use an off-the-shelf load balancer to, for example, do a round-robin load balancing.

diff --git a/docs/sources/static/set-up/deploy-agent.md b/docs/sources/static/set-up/deploy-agent.md
@@ -13,10 +13,8 @@ weight: 300
 
 ## For scalable ingestion of traces
 
-For small workloads, it is normal to have just one Agent handle all incoming
-spans with no need of load balancing. However, for large workloads there it
-is desirable to spread out the load of ingesting spans over multiple Agent
-instances.
+For small workloads, it is normal to have just one Agent handle all incoming spans with no need of load balancing. 
+However, for large workloads it is desirable to spread out the load of processing spans over multiple Agent instances.
 
 To scale the Agent for trace ingestion, do the following:
 1. Set up the `load_balancing` section of the Agent's `traces` config.
@@ -30,15 +28,42 @@ To scale the Agent for trace ingestion, do the following:
 <!-- For more detail, send people over to the load_balancing section in traces_config -->
 
 ### tail_sampling
-Agents configured with `tail_sampling` must have all spans for 
-a given trace in order to work correctly. If some of the spans for a trace end up
-in a different Agent, `tail_sampling` will not sample correctly.
+
+If some of the spans for a trace end up in a different Agent, `tail_sampling` will not sample correctly.
+Enabling `load_balancing` is necessary if `tail_sampling` is enabled and when there could be more than one Agent instance processing spans for the same trace.
+`load_balancing` will make sure that all spans of a given trace will be processed by the same Agent instance.
 
 ### spanmetrics
-<!-- TODO: Also talk about span metrics -->
+
+All spans for a given `service.name` must be processed by the same `spanmetrics` Agent.
+To make sure that this is the case, set up `load_balancing` with `routing_key: service`.
 
 ### service_graphs
-<!-- TODO: Also talk about service_graphs -->
+
+It is challenging to scale `service_graphs` over multiple Agent instances.
+* For `service_graphs` to work correctly, each "client" span must be paired 
+  with a "server" span in order to calculate metrics such as span duration.
+* If a "client" span goes to one Agent, but a "server" span goes to another Agent, 
+  then no single Agent will be able to pair the spans and a metric won't be generated.
+
+`load_balancing` can solve this problem partially if it is configured with `routing_key: traceID`.
+  * Each Agent will then be able to calculate service graph for each "client"/"server" pair in a trace.
+  * However, it is possible to have a span with similar "server"/"client" values 
+    in a different trace, processed by another Agent.
+  * If two different Agents process similar "server"/"client" spans, 
+    they will generate the same service graph metric series.
+  * If the series from two Agents are the same, this will lead to issues 
+    when writing them to the backend database.
+  * Users could differentiate the series by adding a label such as `"agent_id"`.
+      * Unfortunately, there is currently no method in the Agent to aggregate those series from different Agents and merge them into one series.
+    * A PromQL query could be used to aggregate the metrics from different Agents.
+    * If the metrics are stored in Grafana Mimir, cardinality issues due to `"agent_id"` labels can be solved using [Adaptive Metrics][adaptive-metrics].
+
+A simpler, more scalable alternative to generating service graph metrics in the Agent is to generate them entirely in the backend database. 
+For example, service graphs can be [generated][tempo-servicegraphs] in Grafana Cloud by the Tempo traces database.
+
+[tempo-servicegraphs]: https://grafana.com/docs/tempo/latest/metrics-generator/service_graphs/
+[adaptive-metrics]: https://grafana.com/docs/grafana-cloud/cost-management-and-billing/reduce-costs/metrics-costs/control-metrics-usage-via-adaptive-metrics/
 
 ### Example Kubernetes configuration
 {{< collapse title="Example Kubernetes configuration with DNS load balancing" >}}
@@ -352,9 +377,9 @@ data:
 
 {{< /collapse >}}
 
-You need to fill in correct OTLP credentials prior to running the above example.
-The example above can be started by using k3d:
-<!-- TODO: Link to the k3d page -->
+You need to fill in correct OTLP credentials prior to running the above examples.
+The example above can be started by using [k3d][]:
+
 ```bash
 k3d cluster create grafana-agent-lb-test
 kubectl apply -f kubernetes_config.yaml
@@ -364,3 +389,5 @@ To delete the cluster, run:
 ```bash
 k3d cluster delete grafana-agent-lb-test
 ```
+
+[k3d]: https://k3d.io/v5.6.0/