Goroutine/Memory leak in Loki #311

nickelghost · 2023-11-03T12:30:12Z

What's wrong?

We're getting constant memory leaks in our GA flow deployment. After attaching profiling, it looks like there is a goroutine leak in Loki processing, due to which the CRI config also leaks memory. We're getting Kubernetes pod logs inside a cluster. We've had larger leakage issues before - it turned out that it was due to trying to tail logs that didn't exist - scoping the Kubernetes discovery to the current node helped. We've noticed that if the leaks still occur, they happen mostly on dev environments - I suspect there might be a lack of cleanup of the targets once pods are deleted.

Although this might be a configuration error that still resolves log files that don't exist, it'd be great if GA handled them gracefully.

goroutine-pprof.pb.gz
inuse_space-pprof.pb.gz

Steps to reproduce

Apply the below config in a dynamic k8s environment and observe. It could take about 24h for the memory to accumulate significantly.

System information

GKE, amd64

Software version

0.37.2

Configuration

// grafana-agent-flow.river
logging {
	level  = "warn"
	format = "logfmt"
}

module.file "env" {
	filename = "/etc/agent/env.river"
}

module.file "all_pods" {
	filename = "/etc/agent/all-pods.river"
}

module.file "loki" {
	filename = "/etc/agent/loki.river"

	arguments {
		project             = module.file.env.exports.project
		project_id          = module.file.env.exports.project_id
		cluster             = module.file.env.exports.cluster
		all_pods_targets    = module.file.all_pods.exports.targets
		gcplog_subscription = module.file.env.exports.gcplog_subscription
	}
}

module.file "tempo" {
	filename = "/etc/agent/tempo.river"

	arguments {
		project = module.file.env.exports.project
		cluster = module.file.env.exports.cluster
	}
}

module.file "pyroscope" {
	filename = "/etc/agent/pyroscope.river"

	arguments {
		project = module.file.env.exports.project
		cluster = module.file.env.exports.cluster
	}
}

// all-pods.river
discovery.kubernetes "this" {
	role = "pod"

	selectors {
		role  = "pod"
		field = "spec.nodeName=" + env("KUBE_NODE_NAME")
	}
}

module.file "relabel" {
	filename = "/etc/agent/relabel.river"

	arguments {
		targets = discovery.kubernetes.this.targets
	}
}

export "targets" {
	value = module.file.relabel.exports.output
}

// loki.river
argument "project" { }

argument "project_id" { }

argument "cluster" { }

argument "all_pods_targets" { }

argument "gcplog_subscription" { }

loki.write "default" {
	endpoint {
		url = "REDACTED"
	}

	external_labels = {
		project = argument.project.value,
		cluster = argument.cluster.value,
	}
}

loki.process "this" {
	forward_to = [loki.write.default.receiver]

	stage.drop {
		older_than          = "3h"
		drop_counter_reason = "too old"
	}
}

module.file "pods_log" {
	filename = "/etc/agent/pods-log.river"

	arguments {
		targets   = argument.all_pods_targets.value
		receivers = [loki.process.this.receiver]
	}
}

module.file "kubernetes_events_log" {
	filename = "/etc/agent/kubernetes-events-log.river"

	arguments {
		cluster   = argument.cluster.value
		receivers = [loki.process.this.receiver]
	}
}

module.file "gcp_log" {
	filename = "/etc/agent/gcp-log.river"

	arguments {
		project      = argument.project.value
		project_id   = argument.project_id.value
		subscription = argument.gcplog_subscription.value
		receivers    = [loki.process.this.receiver]
	}
}

// pods-log.river
argument "targets" { }

argument "receivers" { }

discovery.relabel "this" {
	targets = argument.targets.value

	rule {
		source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
		target_label  = "__path__"
		separator     = "/"
		replacement   = "/var/log/pods/*$1/*.log"
	}

	rule {
		source_labels = [
			"__meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash",
			"__meta_kubernetes_pod_annotation_kubernetes_io_config_hash",
			"__meta_kubernetes_pod_container_name",
		]
		regex        = "true/(.*)"
		target_label = "__path__"
		separator    = "/"
		replacement  = "/var/log/pods/*$1/*.log"
	}
}

local.file_match "logs" {
	path_targets = discovery.relabel.this.output
}

loki.source.file "this" {
	targets    = local.file_match.logs.targets
	forward_to = [loki.process.this.receiver]
}

loki.process "this" {
	forward_to = argument.receivers.value

	stage.cri { }

	// log level detection
	stage.match {
		selector = "{kind=\"node-js\"}"

		stage.json {
			expressions = {level = "", type = ""}
		}

		stage.labels {
			values = {level = "", type = ""}
		}

		stage.template {
			source   = "level"
			template = "{{ if eq .Value \"10\" }}{{ Replace .Value \"10\" \"trace\" -1 }}{{ else if eq .Value \"20\" }}{{ Replace .Value \"20\" \"debug\" -1 }}{{ else if eq .Value \"30\" }}{{ Replace .Value \"30\" \"info\" -1 }}{{ else if eq .Value \"40\" }}{{ Replace .Value \"40\" \"warn\" -1 }}{{ else if eq .Value \"50\" }}{{ Replace .Value \"50\" \"error\" -1 }}{{ else if eq .Value \"60\" }}{{ Replace .Value \"60\" \"fatal\" -1 }}{{ else }}{{ .Value }}{{ end }}"
		}

		stage.labels {
			values = {level = ""}
		}
	}
}

Logs

No response

The text was updated successfully, but these errors were encountered:

nickelghost · 2023-11-10T08:10:21Z

Another thing that we've noticed was that the number of goroutines increased every 5 minutes, every time by ~100.

sgrzemski · 2023-11-26T18:03:04Z

Hola!

After lots of debugging I narrowed it down to loki.process in pods-log.river file. Disabling the label templating and label append resolved the goroutines leak issue. I've tested around 30 different configurations for the match stage with its substages. I finally got it fixed with the following:

loki.process "this" {
	forward_to = argument.receivers.value

	stage.cri { }

	// log level detection
	stage.match {
		selector      = "{kind=\"node-js\"}"
		pipeline_name = "node-js-log-level"

		stage.json {
			expressions = {
				level = "level",
			}
		}

		stage.template {
			source   = "level"
			template = "{{ if eq .Value \"10\" }}{{ Replace .Value \"10\" \"trace\" -1 }}{{ else if eq .Value \"20\" }}{{ Replace .Value \"20\" \"debug\" -1 }}{{ else if eq .Value \"30\" }}{{ Replace .Value \"30\" \"info\" -1 }}{{ else if eq .Value \"40\" }}{{ Replace .Value \"40\" \"warn\" -1 }}{{ else if eq .Value \"50\" }}{{ Replace .Value \"50\" \"error\" -1 }}{{ else if eq .Value \"60\" }}{{ Replace .Value \"60\" \"fatal\" -1 }}{{ else }}{{ .Value }}{{ end }}"
		}

		stage.labels {
			values = {
				level = "level",
			}
		}
	}
}

Looks like setting the mapping keys explicitly was the key factor here. Then I've set level = "level" in both stage.labels and stage.json it stopped producing thousands of goroutines. The I let just the stage.json and stage.template on, without appending the level label the leak was not present as well. It's probably somewhere here: https://github.com/grafana/agent/blob/main/component/loki/process/stages/inspector.go#L49
Hope this helps @rfratto @tpaschalis @mattdurham

github-actions · 2023-12-27T00:10:10Z

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

tpaschalis · 2024-01-09T14:28:41Z

Hey there @nickelghost, @sgrzemski apologies for the belated response, most of the team was out during the holidays.

Just to make sure, have you two been working together on this? Is the experiment by @sgrzemski applicable to the use case and setup of the first report by @nickelghost, or entirely different?

nickelghost · 2024-01-09T14:43:21Z

Hi @tpaschalis

No worries. Yes indeed, this is the same use case, we're working together.

Cheers

rfratto · 2024-04-11T20:25:34Z

Hi there 👋

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

ptodev · 2024-08-07T09:11:50Z

Hi, @nickelghost and @sgrzemski!👋 Apologies for the late response!

I believe this memory leak is caused by two bugs. I opened two PRs to fix them - #1426 and #1431.

elcomtik · 2024-08-20T14:21:21Z

Will this get into the grafana-agent as well?

ptodev · 2024-08-20T17:09:56Z

Hi @elcomtik, yes I opened a PR for the Agent too.

nickelghost added the bug Something isn't working label Nov 3, 2023

github-actions bot added the needs-attention label Dec 27, 2023

github-actions bot removed the needs-attention label Jan 14, 2024

rfratto transferred this issue from grafana/agent Apr 11, 2024

ptodev mentioned this issue Aug 7, 2024

Fix memory leak in loki.process on config update #1431

Merged

4 tasks

ptodev closed this as completed in #1431 Aug 20, 2024

github-actions bot added the frozen-due-to-age label Sep 20, 2024

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Goroutine/Memory leak in Loki #311

Goroutine/Memory leak in Loki #311

nickelghost commented Nov 3, 2023

nickelghost commented Nov 10, 2023

sgrzemski commented Nov 26, 2023 •

edited

Loading

github-actions bot commented Dec 27, 2023

tpaschalis commented Jan 9, 2024

nickelghost commented Jan 9, 2024

rfratto commented Apr 11, 2024

ptodev commented Aug 7, 2024

elcomtik commented Aug 20, 2024

ptodev commented Aug 20, 2024

Goroutine/Memory leak in Loki #311

Goroutine/Memory leak in Loki #311

Comments

nickelghost commented Nov 3, 2023

What's wrong?

Steps to reproduce

System information

Software version

Configuration

Logs

nickelghost commented Nov 10, 2023

sgrzemski commented Nov 26, 2023 • edited Loading

github-actions bot commented Dec 27, 2023

tpaschalis commented Jan 9, 2024

nickelghost commented Jan 9, 2024

rfratto commented Apr 11, 2024

ptodev commented Aug 7, 2024

elcomtik commented Aug 20, 2024

ptodev commented Aug 20, 2024

sgrzemski commented Nov 26, 2023 •

edited

Loading