Skip to content

Commit

Permalink
Polishing
Browse files Browse the repository at this point in the history
  • Loading branch information
ptodev committed Dec 8, 2023
1 parent 36b7098 commit d889c0e
Show file tree
Hide file tree
Showing 3 changed files with 93 additions and 69 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -143,9 +143,9 @@ Name | Type | Description | Default | Required

### kubernetes block

You can use the `kubernetes` block to load balance across the pods of a Kubernetes service. The Agent will be notified
by the Kubernetes API whenever a new pod is added or removed from the service. The `kubernetes` resolver has a
much faster response time than the `dns` resolver, because it doesn't require polling.
You can use the `kubernetes` block to load balance across the pods of a Kubernetes service.
The Agent will be notified by the Kubernetes API whenever a new pod is added or removed from the service.
The `kubernetes` resolver has a much faster response time than the `dns` resolver, because it doesn't require polling.

The following arguments are supported:

Expand Down Expand Up @@ -292,49 +292,44 @@ All spans for a given `service.name` must go to the same spanmetrics Agent.
* If this is not done, metrics generated from spans might be incorrect.
* For example, if similar spans for the same `service.name` end up on different Agent instances,
the two Agents will have identical metric series for calculating span latency, errors, and number of requests.
* When both Agents attempt to remote write the metrics to a database such as Mimir,
the series may clash with each other.
* When both Agents attempt to write the metrics to a database such as Mimir, the series may clash with each other.
* At best this will lead to an error in the Agent and a rejected write to the metrics database.
* At worst, it could lead to an inaccurate data due to overlapping samples for the metric series.

However, there are ways to scale `otelcol.connector.spanmetrics` without the need for a load balancer:
1. Each Agent could add an attribute such as `agent.id` in order to make its series unique.
* A `sum by` PromQL query could be used to aggregate the metrics from different Agents.
* An extra `agent.id` attribute has a downside that the metrics stored in the database will have higher cardinality.
* An extra `agent.id` attribute has a downside that the metrics stored in the database will have higher {{< term "cardinality" >}}cardinality{{< /term >}}.
2. Spanmetrics could be generated in the backend database instead of the Agent.
* For example, span metrics can be [generated][tempo-spanmetrics] in Grafana Cloud by the Tempo traces database.

[tempo-spanmetrics]: https://grafana.com/docs/tempo/latest/metrics-generator/span_metrics/

### otelcol.connector.servicegraph
Unfortunately, there is generally no reliable way to scale `otelcol.connector.servicegraph` over multiple Agent instance.
<!-- Or is there? Can we fix the issue below using a PromQL query? There may still be higher than normal cardinality though? -->
It is challenging to scale `otelcol.connector.servicegraph` over multiple Agent instances.
* For `otelcol.connector.servicegraph` to work correctly, each "client" span must be paired
with a "server" span in order to calculate metrics such as span duration.
* If a "client" span goes to one Agent, but a "server" span goes to another Agent,
then no single Agent will be able to pair the spans and a metric won't be generated.

`otelcol.exporter.loadbalancing` can solve this problem partially if it is configured with `routing_key = "traceID"`.
* Each Agent will then be able to calculate service graph for each client/server pair in a trace.
* Each Agent will then be able to calculate service graph for each "client"/"server" pair in a trace.
* However, it is possible to have a span with similar "server"/"client" values
in a different trace, processed by another Agent.
* If two different Agents, process similar "server"/"client" spans,
* If two different Agents process similar "server"/"client" spans,
they will generate the same service graph metric series.
* If the series from two Agents are the same, this will lead to issues
when remote writing them in the backend database.

* Users could differentiate the series by adding a label such as `"agent_instance"`,
but there is currently no method in the Agent to aggregate those series from different Agents and merge them into one series.

There are ways to work around this:
1. Each Agent could add an attribute such as `agent.id` in order to make its series unique.
when writing them to the backend database.
* Users could differentiate the series by adding an attribute such as `"agent.id"`.
* Unfortunately, there is currently no method in the Agent to aggregate those series from different Agents and merge them into one series.
* A PromQL query could be used to aggregate the metrics from different Agents.
* An extra `agent.id` attribute has a downside that the metrics stored in the database will have higher cardinality.
2. A simpler, more scalable alternative to generating service graph metrics
in the Agent is to do them in the backend database.
* For example service graphs can be [generated][tempo-servicegraphs] in Grafana Cloud by the Tempo traces database.
* If the metrics are stored in Grafana Mimir, cardinality issues due to `"agent.id"` labels can be solved using [Adaptive Metrics][adaptive-metrics].

A simpler, more scalable alternative to generating service graph metrics in the Agent is to generate them entirely in the backend database.
For example, service graphs can be [generated][tempo-servicegraphs] in Grafana Cloud by the Tempo traces database.

[tempo-servicegraphs]: https://grafana.com/docs/tempo/latest/metrics-generator/service_graphs/
[adaptive-metrics]: https://grafana.com/docs/grafana-cloud/cost-management-and-billing/reduce-costs/metrics-costs/control-metrics-usage-via-adaptive-metrics/

### Mixing stateful components
<!-- TODO: Add a picture of the architecture? -->
Expand All @@ -343,7 +338,7 @@ Different Flow components may require a different `routing_key` for `otelcol.exp
whereas `otelcol.connector.spanmetrics` requires `routing_key = "service"`.
* Therefore, it is not possible to use the same load balancer for both tail sampling and span metrics.
* Two different sets of load balancers have to be set up:
* One set of `otelcol.exporter.loadbalancing` with `routing_key = "traceID"`, sending spans to Agents doing tail sampling and no span metrics.
* One set of `otelcol.exporter.loadbalancing` with `routing_key = "traceID"`, sending spans to Agents which doing tail sampling and no span metrics.
* Another set of `otelcol.exporter.loadbalancing` with `routing_key = "service"`, sending spans to Agents doing span metrics and no service graphs.
* Unfortunately, this can also lead to side effects. For example, if `otelcol.connector.spanmetrics` is configured to generate exemplars,
the tail sampling Agents might drop the trace which the exemplar points to. There is no coordination between the tail sampling Agents and
Expand Down Expand Up @@ -420,12 +415,12 @@ otelcol.exporter.loadbalancing "default" {
```

Below is an example Kubernetes configuration which configures two sets of Agents:
* A pool of "load balancer" Agents:
* A pool of load-balancer Agents:
* Spans are received from instrumented applications via `otelcol.receiver.otlp`
* Spans are then exported via `otelcol.exporter.loadbalancing`.
* A pool of "backing" Agents:
* The "backing" Agents run behind a headless service to enable the "load balancer" Agents to discover them.
* Spans are received from the "load balancer" Agents via `otelcol.receiver.otlp`
* A pool of sampling Agents:
* The sampling Agents run behind a headless service to enable the load-balancer Agents to discover them.
* Spans are received from the load-balancer Agents via `otelcol.receiver.otlp`
* Traces are then sampled via `otelcol.processor.tail_sampling`.
* The traces are exported via `otelcol.exporter.otlp` to a an OTLP-compatible database such as Tempo.

Expand Down Expand Up @@ -649,7 +644,7 @@ data:
{{< /collapse >}}
You need to fill in correct OTLP credentials prior to running the above example.
The example above can be started by using k3d:
The example above can be started by using [k3d][]:
<!-- TODO: Link to the k3d page -->
```bash
k3d cluster create grafana-agent-lb-test
Expand All @@ -661,10 +656,13 @@ To delete the cluster, run:
k3d cluster delete grafana-agent-lb-test
```

[k3d]: https://k3d.io/v5.6.0/

### Kubernetes resolver

`otelcol.exporter.loadbalancing` can export to a statically configured
list of Agents using configuration like this:
When configured with a `kubernetes` resolver, `otelcol.exporter.loadbalancing` will be notified by
the Kubernetes API any time a pod has left or joined the `service`.
Spans will then be exported to the addresses from the Kubernetes API, combined with all the possible `ports`.

```river
otelcol.exporter.loadbalancing "default" {
Expand All @@ -683,14 +681,14 @@ otelcol.exporter.loadbalancing "default" {
```

Below is an example Kubernetes configuration which sets up two sets of Agents:
* A pool of "load balancer" Agents:
* A pool of load-balancer Agents:
* Spans are received from instrumented applications via `otelcol.receiver.otlp`
* Spans are then exported via `otelcol.exporter.loadbalancing`.
* The "load balancer" Agents will get notified by the Kubernetes API any time a pod
is added or removed from the pool of "backing" Agents.
* A pool of "backing" Agents:
* The "backing" Agents do not need to run behind a headless service.
* Spans are received from the "load balancer" Agents via `otelcol.receiver.otlp`
* The load-balancer Agents will get notified by the Kubernetes API any time a pod
is added or removed from the pool of sampling Agents.
* A pool of sampling Agents:
* The sampling Agents do not need to run behind a headless service.
* Spans are received from the load-balancer Agents via `otelcol.receiver.otlp`
* Traces are then sampled via `otelcol.processor.tail_sampling`.
* The traces are exported via `otelcol.exporter.otlp` to a an OTLP-compatible database such as Tempo.

Expand All @@ -706,13 +704,13 @@ metadata:
apiVersion: v1
kind: ServiceAccount
metadata:
name: grafana-agent-traces
name: agent-traces
namespace: grafana-cloud-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: grafana-agent-traces-role
name: agent-traces-role
namespace: grafana-cloud-monitoring
rules:
- apiGroups:
Expand All @@ -727,15 +725,15 @@ rules:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: grafana-agent-traces-rolebinding
name: agent-traces-rolebinding
namespace: grafana-cloud-monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: Role
name: grafana-agent-traces-role
name: agent-traces-role
subjects:
- kind: ServiceAccount
name: grafana-agent-traces
name: agent-traces
namespace: grafana-cloud-monitoring
---
apiVersion: apps/v1
Expand Down Expand Up @@ -815,7 +813,7 @@ spec:
volumeMounts:
- mountPath: /etc/agent
name: agent-traces
serviceAccount: grafana-agent-traces
serviceAccount: agent-traces
volumes:
- configMap:
name: agent-traces
Expand Down Expand Up @@ -945,8 +943,7 @@ data:
{{< /collapse >}}
You need to fill in correct OTLP credentials prior to running the above example.
The example above can be started by using k3d:
<!-- TODO: Link to the k3d page -->
The example above can be started by using [k3d][]s:
```bash
k3d cluster create grafana-agent-lb-test
kubectl apply -f kubernetes_config.yaml
Expand Down
28 changes: 14 additions & 14 deletions docs/sources/flow/setup/deploy-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,14 +13,12 @@ weight: 900

{{< docs/shared source="agent" lookup="/deploy-agent.md" version="<AGENT_VERSION>" >}}

## Scaling Grafana Agent
## Processing different types of telemetry in different Agent instances

If the load on the Agents is small, it is recommended to process
all necessary telemetry signals in the same Agent process. For example,
a single Agent can process all of the incoming metrics, logs, traces, and profiles.
If the load on the Agents is small, it is recommended to process all necessary telemetry signals in the same Agent process.
For example, a single Agent can process all of the incoming metrics, logs, traces, and profiles.

However, if the load on the Agents is big, it may be beneficial to
process different telemetry signals in different deployments of Agents:
However, if the load on the Agents is big, it may be beneficial to process different telemetry signals in different deployments of Agents:
* This provides better stability due to the isolation between processes.
* For example, an overloaded Agent processing traces won't impact an Agent processing metrics.
* Different types of signal collection require different methods for scaling:
Expand All @@ -29,14 +27,17 @@ process different telemetry signals in different deployments of Agents:

### Traces

<!-- TODO: Link to https://opentelemetry.io/docs/collector/scaling/ ? -->
Scaling Agent instances for tracing is very similar to [scaling OpenTelemetry Collector][scaling-collector] instances.
This is because most Flow components used for tracing are based on components from the Collector.

[scaling-collector]: https://opentelemetry.io/docs/collector/scaling/

#### When to scale

<!--
TODO: Include information from https://opentelemetry.io/docs/collector/scaling/#when-to-scale
Unfortunately the Agent doesn't have many of the metrics they mention because they're instrumented with OpenCensus and not OpenTelemetry.
-->
To decide whether scaling is necessary, check metrics such as:
* `receiver_refused_spans_ratio_total` from receivers such as `otelcol.receiver.otlp`.
* `processor_refused_spans_ratio_total` from processors such as `otelcol.processor.batch`.
* `exporter_send_failed_spans_ratio_total` from exporters such as `otelcol.exporter.otlp` and `otelcol.exporter.loadbalancing`.

#### Stateful and stateless components

Expand All @@ -56,9 +57,8 @@ Examples of stateful components:

<!-- TODO: link to the otelcol.exporter.loadbalancing docs for more info -->

A "stateless component" does not need to aggregate specific spans in
order to work correctly - it can work correctly even if it only has
some of the spans of a trace.
A "stateless component" does not need to aggregate specific spans in order to work correctly -
it can work correctly even if it only has some of the spans of a trace.

Stateless Agents can be scaled without using `otelcol.exporter.loadbalancing`.
You could use an off-the-shelf load balancer to, for example, do a round-robin load balancing.
Expand Down
51 changes: 39 additions & 12 deletions docs/sources/static/set-up/deploy-agent.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,10 +13,8 @@ weight: 300

## For scalable ingestion of traces

For small workloads, it is normal to have just one Agent handle all incoming
spans with no need of load balancing. However, for large workloads there it
is desirable to spread out the load of ingesting spans over multiple Agent
instances.
For small workloads, it is normal to have just one Agent handle all incoming spans with no need of load balancing.
However, for large workloads it is desirable to spread out the load of processing spans over multiple Agent instances.

To scale the Agent for trace ingestion, do the following:
1. Set up the `load_balancing` section of the Agent's `traces` config.
Expand All @@ -30,15 +28,42 @@ To scale the Agent for trace ingestion, do the following:
<!-- For more detail, send people over to the load_balancing section in traces_config -->

### tail_sampling
Agents configured with `tail_sampling` must have all spans for
a given trace in order to work correctly. If some of the spans for a trace end up
in a different Agent, `tail_sampling` will not sample correctly.

If some of the spans for a trace end up in a different Agent, `tail_sampling` will not sample correctly.
Enabling `load_balancing` is necessary if `tail_sampling` is enabled and when there could be more than one Agent instance processing spans for the same trace.
`load_balancing` will make sure that all spans of a given trace will be processed by the same Agent instance.

### spanmetrics
<!-- TODO: Also talk about span metrics -->

All spans for a given `service.name` must be processed by the same `spanmetrics` Agent.
To make sure that this is the case, set up `load_balancing` with `routing_key: service`.

### service_graphs
<!-- TODO: Also talk about service_graphs -->

It is challenging to scale `service_graphs` over multiple Agent instances.
* For `service_graphs` to work correctly, each "client" span must be paired
with a "server" span in order to calculate metrics such as span duration.
* If a "client" span goes to one Agent, but a "server" span goes to another Agent,
then no single Agent will be able to pair the spans and a metric won't be generated.

`load_balancing` can solve this problem partially if it is configured with `routing_key: traceID`.
* Each Agent will then be able to calculate service graph for each "client"/"server" pair in a trace.
* However, it is possible to have a span with similar "server"/"client" values
in a different trace, processed by another Agent.
* If two different Agents process similar "server"/"client" spans,
they will generate the same service graph metric series.
* If the series from two Agents are the same, this will lead to issues
when writing them to the backend database.
* Users could differentiate the series by adding a label such as `"agent_id"`.
* Unfortunately, there is currently no method in the Agent to aggregate those series from different Agents and merge them into one series.
* A PromQL query could be used to aggregate the metrics from different Agents.
* If the metrics are stored in Grafana Mimir, cardinality issues due to `"agent_id"` labels can be solved using [Adaptive Metrics][adaptive-metrics].

A simpler, more scalable alternative to generating service graph metrics in the Agent is to generate them entirely in the backend database.
For example, service graphs can be [generated][tempo-servicegraphs] in Grafana Cloud by the Tempo traces database.

[tempo-servicegraphs]: https://grafana.com/docs/tempo/latest/metrics-generator/service_graphs/
[adaptive-metrics]: https://grafana.com/docs/grafana-cloud/cost-management-and-billing/reduce-costs/metrics-costs/control-metrics-usage-via-adaptive-metrics/

### Example Kubernetes configuration
{{< collapse title="Example Kubernetes configuration with DNS load balancing" >}}
Expand Down Expand Up @@ -352,9 +377,9 @@ data:
{{< /collapse >}}
You need to fill in correct OTLP credentials prior to running the above example.
The example above can be started by using k3d:
<!-- TODO: Link to the k3d page -->
You need to fill in correct OTLP credentials prior to running the above examples.
The example above can be started by using [k3d][]:
```bash
k3d cluster create grafana-agent-lb-test
kubectl apply -f kubernetes_config.yaml
Expand All @@ -364,3 +389,5 @@ To delete the cluster, run:
```bash
k3d cluster delete grafana-agent-lb-test
```

[k3d]: https://k3d.io/v5.6.0/

0 comments on commit d889c0e

Please sign in to comment.