Excessive job activations on remote agents due to missed ticks and PC time sync issues #416

Alex-stewart1 · 2024-10-23T13:50:10Z

Describe the bug
We are experiencing an issue where jobs scheduled to run at regular intervals (e.g., once per minute using cron syntax * * * * *) occasionally trigger far too many times, seemingly at random, on certain remote agents. This tends to happen shortly after a cold start of the Windows service running Coravel. Instead of running the job once per minute, the job can be triggered tens or even hundreds of times within a very short span, and then the behaviour corrects itself without requiring a code change or service restart.

The issue has been difficult to consistently reproduce, but we have observed a pattern where the agents experiencing this issue had incorrect local PC times (out of sync with UTC). After thoroughly stripping down our application to isolate the problem (reducing the job to a simple IInvocable that logs a message), we suspect the issue is related to Coravel's "missed ticks" functionality, potentially exacerbated by time synchronization problems. The problem disappears when Coravel is replaced with a simple BackgroundService and a Task.Delay()

We've identified this section of Coravel's code as a potential cause, which is responsible for catching up missed ticks:

private async void RunSchedulerPerSecondAsync(object state)
        {
            if (this._schedulerEnabled)
            {
                // This will get any missed ticks that might arise from the Timer triggering a little too late. 
                // If under CPU load or if the Timer is for some reason a little slow, then it's possible to
                // miss a tick - which we want to make sure the scheduler doesn't miss and catches up.
                var now = DateTime.UtcNow;
                DateTime[] ticks = null;
                lock (_tickLockObj)
                {
                    // This class isn't thread-safe.
                    ticks = this._ensureContinuousSecondTicks.GetTicksBetweenPreviousAndNext(now).ToArray();
                    this._ensureContinuousSecondTicks.SetNextTick(now);
                }

                if (ticks.Length > 0)
                {       
                    if(this._config.GetValue<bool?>("Coravel:Schedule:LogTickCatchUp") ?? false) 
                    {        
                        this._logger.LogInformation($"Coravel's scheduler is behind {ticks.Length} ticks and is catching-up to the current tick. Triggered at {now.ToString("o")}.");
                    }
                    
                    foreach (var tick in ticks)
                    {
                        await this._scheduler.RunAtAsync(tick);
                    }
                }

                // If we've processed any missed ticks, we also need to explicitly run the current tick.
                await this._scheduler.RunAtAsync(now);
            }
        }

Affected Coravel Feature
Scheduling

Expected behaviour
Jobs should run at their scheduled intervals without excessive activations, regardless of any temporary local time drift or sync issues.

Actual behaviour
On affected agents, jobs scheduled to run once per minute occasionally trigger tens or hundreds of times within the span of a few seconds. After this burst of activity, the job resumes its correct schedule.

Possible solutions
Adding a configure option similar to the logging ("Coravel:Schedule:LogTickCatchUp") to prevent Coravel from activating jobs while catching up with missed ticks. I would imagine other people have scheduling requirements that are time sensitive, i.e. if I miss 10 activation it still only make sense to trigger the job once now, not 10 times for the previously missed activations

The text was updated successfully, but these errors were encountered:

Alex-stewart1 · 2024-10-25T08:53:29Z

I forked the project and removed the suspected code, leaving in the logging and the issue is resolved. For reference the scheduler (running in a windows service) seemed to fall behind on certain PC's while they were logged out until they logged back in again, not sure why

I'm surprised no one else has run into this issue thus far, its not only one PC, seems to happen on any regardless of similarities between the PC's themselves.

Would it be possible to make the "catch up" configurable, as we NEVER want missed jobs to be ran multiple times

Logs:
Coravel's scheduler is behind 56137 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:06:42.3044403Z.
Coravel's scheduler is behind 55952 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:14:39.5139747Z.
Coravel's scheduler is behind 55947 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:07:01.2471406Z.
Coravel's scheduler is behind 55576 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:31:36.4646435Z.
Coravel's scheduler is behind 55513 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:27:35.4275124Z.
Coravel's scheduler is behind 55449 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:27:20.7338323Z.

jamesmh · 2024-10-29T12:51:36Z

Thanks for reaching out with the detailed issue.

This does seem like an "odd" behavior of any PC to have. Not only because Coravel would be affected, but this would also negatively affect many other things including:

Logging times
Data creation/modification timestamps
Any external cron jobs
Windows task scheduler jobs
Scheduled messages in messaging systems like rabbitMQ or a library on top of messaging like NServiceBus
Potentially cache system timeouts depending on the algorithms used (.g Redis, Memcached, etc.)
etc.

Typically for production servers, you would want the system to remain logged in as the same user (not only for date/time purposes but also for security reasons around permissions, reproducibility, auditability, etc.)

My first thought here is that this is not a Coravel issue but a bigger issue that could affect many other tools, etc.

We could look to adding new features/config to Coravel but this really sounds like a bigger issue that should be solved?

Thoughts?

Alex-stewart1 · 2024-10-29T14:37:07Z

Hi James, thanks for the response.

For a bit of context the software experiencing the issue is part of a remote monitoring & management system. Its the remote agent that is responsible for sending telemetry to our cloud infrastructure and so is installed on a very wide range of machines. Some of these are production servers running backup software, firewalls, virtual machines etc and I would agree if the issue was present on these machines its the machine that needs fixing not our agent. However, most of the machines this windows service is installed on are simple workstations that unfortunately may be on outdated versions of windows, missing updates, and strangely have out of sync system clocks. Our use case is such that the agent needs to be as flexible as possible and work on as many systems (including ones that are poorly maintained) as possible.

I understand it's not Coravel's problem if the system clock is unreliable, and it's understandable that you have to draw a line somewhere. However, we would like to continue using Coravel if possible. It seems like a minor configuration adjustment would accommodate these edge cases, allowing Coravel to remain operational even on less reliable machines.

Note:
The logs provided in the previous comment all came from different machines. That was copy and pasted from one central source. Reading it back it was not obvious if the logs provided we're from a single machine of multiple

Alex-stewart1 changed the title ~~Excessive Job Activations on Remote Agents Due to Missed Ticks and PC Time Sync Issues~~ Excessive job activations on remote agents due to missed ticks and PC time sync issues Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Excessive job activations on remote agents due to missed ticks and PC time sync issues #416

Excessive job activations on remote agents due to missed ticks and PC time sync issues #416

Alex-stewart1 commented Oct 23, 2024

Alex-stewart1 commented Oct 25, 2024 •

edited

Loading

jamesmh commented Oct 29, 2024

Alex-stewart1 commented Oct 29, 2024 •

edited

Loading

Excessive job activations on remote agents due to missed ticks and PC time sync issues #416

Excessive job activations on remote agents due to missed ticks and PC time sync issues #416

Comments

Alex-stewart1 commented Oct 23, 2024

Alex-stewart1 commented Oct 25, 2024 • edited Loading

jamesmh commented Oct 29, 2024

Alex-stewart1 commented Oct 29, 2024 • edited Loading

Alex-stewart1 commented Oct 25, 2024 •

edited

Loading

Alex-stewart1 commented Oct 29, 2024 •

edited

Loading