Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Goroutine/Memory leak in Loki #311

Closed
nickelghost opened this issue Nov 3, 2023 · 9 comments · Fixed by #1431
Closed

Goroutine/Memory leak in Loki #311

nickelghost opened this issue Nov 3, 2023 · 9 comments · Fixed by #1431
Labels
bug Something isn't working frozen-due-to-age

Comments

@nickelghost
Copy link

What's wrong?

We're getting constant memory leaks in our GA flow deployment. After attaching profiling, it looks like there is a goroutine leak in Loki processing, due to which the CRI config also leaks memory. We're getting Kubernetes pod logs inside a cluster. We've had larger leakage issues before - it turned out that it was due to trying to tail logs that didn't exist - scoping the Kubernetes discovery to the current node helped. We've noticed that if the leaks still occur, they happen mostly on dev environments - I suspect there might be a lack of cleanup of the targets once pods are deleted.

Although this might be a configuration error that still resolves log files that don't exist, it'd be great if GA handled them gracefully.

Screenshot 2023-11-03 at 12 44 39

goroutine-pprof.pb.gz
inuse_space-pprof.pb.gz

Steps to reproduce

Apply the below config in a dynamic k8s environment and observe. It could take about 24h for the memory to accumulate significantly.

System information

GKE, amd64

Software version

0.37.2

Configuration

// grafana-agent-flow.river
logging {
	level  = "warn"
	format = "logfmt"
}

module.file "env" {
	filename = "/etc/agent/env.river"
}

module.file "all_pods" {
	filename = "/etc/agent/all-pods.river"
}

module.file "loki" {
	filename = "/etc/agent/loki.river"

	arguments {
		project             = module.file.env.exports.project
		project_id          = module.file.env.exports.project_id
		cluster             = module.file.env.exports.cluster
		all_pods_targets    = module.file.all_pods.exports.targets
		gcplog_subscription = module.file.env.exports.gcplog_subscription
	}
}

module.file "tempo" {
	filename = "/etc/agent/tempo.river"

	arguments {
		project = module.file.env.exports.project
		cluster = module.file.env.exports.cluster
	}
}

module.file "pyroscope" {
	filename = "/etc/agent/pyroscope.river"

	arguments {
		project = module.file.env.exports.project
		cluster = module.file.env.exports.cluster
	}
}

// all-pods.river
discovery.kubernetes "this" {
	role = "pod"

	selectors {
		role  = "pod"
		field = "spec.nodeName=" + env("KUBE_NODE_NAME")
	}
}

module.file "relabel" {
	filename = "/etc/agent/relabel.river"

	arguments {
		targets = discovery.kubernetes.this.targets
	}
}

export "targets" {
	value = module.file.relabel.exports.output
}

// loki.river
argument "project" { }

argument "project_id" { }

argument "cluster" { }

argument "all_pods_targets" { }

argument "gcplog_subscription" { }

loki.write "default" {
	endpoint {
		url = "REDACTED"
	}

	external_labels = {
		project = argument.project.value,
		cluster = argument.cluster.value,
	}
}

loki.process "this" {
	forward_to = [loki.write.default.receiver]

	stage.drop {
		older_than          = "3h"
		drop_counter_reason = "too old"
	}
}

module.file "pods_log" {
	filename = "/etc/agent/pods-log.river"

	arguments {
		targets   = argument.all_pods_targets.value
		receivers = [loki.process.this.receiver]
	}
}

module.file "kubernetes_events_log" {
	filename = "/etc/agent/kubernetes-events-log.river"

	arguments {
		cluster   = argument.cluster.value
		receivers = [loki.process.this.receiver]
	}
}

module.file "gcp_log" {
	filename = "/etc/agent/gcp-log.river"

	arguments {
		project      = argument.project.value
		project_id   = argument.project_id.value
		subscription = argument.gcplog_subscription.value
		receivers    = [loki.process.this.receiver]
	}
}

// pods-log.river
argument "targets" { }

argument "receivers" { }

discovery.relabel "this" {
	targets = argument.targets.value

	rule {
		source_labels = ["__meta_kubernetes_pod_uid", "__meta_kubernetes_pod_container_name"]
		target_label  = "__path__"
		separator     = "/"
		replacement   = "/var/log/pods/*$1/*.log"
	}

	rule {
		source_labels = [
			"__meta_kubernetes_pod_annotationpresent_kubernetes_io_config_hash",
			"__meta_kubernetes_pod_annotation_kubernetes_io_config_hash",
			"__meta_kubernetes_pod_container_name",
		]
		regex        = "true/(.*)"
		target_label = "__path__"
		separator    = "/"
		replacement  = "/var/log/pods/*$1/*.log"
	}
}

local.file_match "logs" {
	path_targets = discovery.relabel.this.output
}

loki.source.file "this" {
	targets    = local.file_match.logs.targets
	forward_to = [loki.process.this.receiver]
}

loki.process "this" {
	forward_to = argument.receivers.value

	stage.cri { }

	// log level detection
	stage.match {
		selector = "{kind=\"node-js\"}"

		stage.json {
			expressions = {level = "", type = ""}
		}

		stage.labels {
			values = {level = "", type = ""}
		}

		stage.template {
			source   = "level"
			template = "{{ if eq .Value \"10\" }}{{ Replace .Value \"10\" \"trace\" -1 }}{{ else if eq .Value \"20\" }}{{ Replace .Value \"20\" \"debug\" -1 }}{{ else if eq .Value \"30\" }}{{ Replace .Value \"30\" \"info\" -1 }}{{ else if eq .Value \"40\" }}{{ Replace .Value \"40\" \"warn\" -1 }}{{ else if eq .Value \"50\" }}{{ Replace .Value \"50\" \"error\" -1 }}{{ else if eq .Value \"60\" }}{{ Replace .Value \"60\" \"fatal\" -1 }}{{ else }}{{ .Value }}{{ end }}"
		}

		stage.labels {
			values = {level = ""}
		}
	}
}

Logs

No response

@nickelghost nickelghost added the bug Something isn't working label Nov 3, 2023
@nickelghost
Copy link
Author

Another thing that we've noticed was that the number of goroutines increased every 5 minutes, every time by ~100.

@sgrzemski
Copy link

sgrzemski commented Nov 26, 2023

Hola!

After lots of debugging I narrowed it down to loki.process in pods-log.river file. Disabling the label templating and label append resolved the goroutines leak issue. I've tested around 30 different configurations for the match stage with its substages. I finally got it fixed with the following:

loki.process "this" {
	forward_to = argument.receivers.value

	stage.cri { }

	// log level detection
	stage.match {
		selector      = "{kind=\"node-js\"}"
		pipeline_name = "node-js-log-level"

		stage.json {
			expressions = {
				level = "level",
			}
		}

		stage.template {
			source   = "level"
			template = "{{ if eq .Value \"10\" }}{{ Replace .Value \"10\" \"trace\" -1 }}{{ else if eq .Value \"20\" }}{{ Replace .Value \"20\" \"debug\" -1 }}{{ else if eq .Value \"30\" }}{{ Replace .Value \"30\" \"info\" -1 }}{{ else if eq .Value \"40\" }}{{ Replace .Value \"40\" \"warn\" -1 }}{{ else if eq .Value \"50\" }}{{ Replace .Value \"50\" \"error\" -1 }}{{ else if eq .Value \"60\" }}{{ Replace .Value \"60\" \"fatal\" -1 }}{{ else }}{{ .Value }}{{ end }}"
		}

		stage.labels {
			values = {
				level = "level",
			}
		}
	}
}

Looks like setting the mapping keys explicitly was the key factor here. Then I've set level = "level" in both stage.labels and stage.json it stopped producing thousands of goroutines. The I let just the stage.json and stage.template on, without appending the level label the leak was not present as well. It's probably somewhere here: https://github.com/grafana/agent/blob/main/component/loki/process/stages/inspector.go#L49
Hope this helps @rfratto @tpaschalis @mattdurham

Copy link
Contributor

This issue has not had any activity in the past 30 days, so the needs-attention label has been added to it.
If the opened issue is a bug, check to see if a newer release fixed your issue. If it is no longer relevant, please feel free to close this issue.
The needs-attention label signals to maintainers that something has fallen through the cracks. No action is needed by you; your issue will be kept open and you do not have to respond to this comment. The label will be removed the next time this job runs if there is new activity.
Thank you for your contributions!

@tpaschalis
Copy link
Member

Hey there @nickelghost, @sgrzemski apologies for the belated response, most of the team was out during the holidays.

Just to make sure, have you two been working together on this? Is the experiment by @sgrzemski applicable to the use case and setup of the first report by @nickelghost, or entirely different?

@nickelghost
Copy link
Author

Hi @tpaschalis

No worries. Yes indeed, this is the same use case, we're working together.

Cheers

@rfratto
Copy link
Member

rfratto commented Apr 11, 2024

Hi there 👋

On April 9, 2024, Grafana Labs announced Grafana Alloy, the spirital successor to Grafana Agent and the final form of Grafana Agent flow mode. As a result, Grafana Agent has been deprecated and will only be receiving bug and security fixes until its end-of-life around November 1, 2025.

To make things easier for maintainers, we're in the process of migrating all issues tagged variant/flow to the Grafana Alloy repository to have a single home for tracking issues. This issue is likely something we'll want to address in both Grafana Alloy and Grafana Agent, so just because it's being moved doesn't mean we won't address the issue in Grafana Agent :)

@rfratto rfratto transferred this issue from grafana/agent Apr 11, 2024
@ptodev
Copy link
Collaborator

ptodev commented Aug 7, 2024

Hi, @nickelghost and @sgrzemski!👋 Apologies for the late response!

I believe this memory leak is caused by two bugs. I opened two PRs to fix them - #1426 and #1431.

@elcomtik
Copy link

Will this get into the grafana-agent as well?

@ptodev
Copy link
Collaborator

ptodev commented Aug 20, 2024

Hi @elcomtik, yes I opened a PR for the Agent too.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working frozen-due-to-age
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants