Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow Synchronous WAL Plugins can lock system #26003

Open
jacksonrnewhouse opened this issue Feb 12, 2025 · 1 comment
Open

Slow Synchronous WAL Plugins can lock system #26003

jacksonrnewhouse opened this issue Feb 12, 2025 · 1 comment
Assignees

Comments

@jacksonrnewhouse
Copy link
Contributor

In working on asynchronous execution of plugins I also explored the behavior of plugins running synchronously. There's an issue with WAL Contentt plugins, namely that once the channel we send Arc<WalContents> along fills up then writing the WAL files are blocked on the next run of the plugin running.

For a concrete example, assume you're regularly writing data with a wal write interval of the default 1 second.

if the plugin is

from time import sleep
def process_writes(influxdb3_local, table_batches, args=None):
    sleep(61)

Then after 61 seconds you'll have written 61 wal files, but the plugin will have only processed 1. This will mean the channel for PluginEvents is full, and the next wal file's notify() call will block in the notify() call on ProcessingEngineManagerImpl for 61 seconds.

There's a couple options for us, including

  1. Use an unbounded channel. However, this'll just push off the problem, as it will fill up with Arc, basically causing a memory leak of all WalContents that have been written.
  2. Replace the send() call with a try_send(), that will error out when the channel is full. From there we could do one of (A) Skip sending the contents with some warning, (B) Fail the plugin, (C) Fail the Server.
  3. Institute some mandatory timeout on plugins, and then pick from a similar set of options as above.
  4. Make completion of Wal plugins a precondition of starting the next wal file. This would at least immediately expose the issue.
  5. Only have async execution.

Another issue this exposes is that we use the channel events channel to disable plugins, but it will be backed up behind any wal contents inside it.

@jacksonrnewhouse jacksonrnewhouse self-assigned this Feb 12, 2025
@pauldix
Copy link
Member

pauldix commented Feb 12, 2025

In #25947 I logged that WAL triggers should be configurable so that they either run in order (which for the example you give, would only produce an error after some period of time) or they should run in parallel, meaning each wal flush should spawn a new trigger execution. So in your example, once you've been running for a minute, you'd have 61 of those running in parallel.

The behavior for what happens on in order execution should either be to stop accepting writes until the buffer gets more room, or to log the error and move on (i.e. there is some wal flush that doesn't get processed).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants