Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive job activations on remote agents due to missed ticks and PC time sync issues #416

Open
Alex-stewart1 opened this issue Oct 23, 2024 · 3 comments

Comments

@Alex-stewart1
Copy link

Describe the bug
We are experiencing an issue where jobs scheduled to run at regular intervals (e.g., once per minute using cron syntax * * * * *) occasionally trigger far too many times, seemingly at random, on certain remote agents. This tends to happen shortly after a cold start of the Windows service running Coravel. Instead of running the job once per minute, the job can be triggered tens or even hundreds of times within a very short span, and then the behaviour corrects itself without requiring a code change or service restart.

The issue has been difficult to consistently reproduce, but we have observed a pattern where the agents experiencing this issue had incorrect local PC times (out of sync with UTC). After thoroughly stripping down our application to isolate the problem (reducing the job to a simple IInvocable that logs a message), we suspect the issue is related to Coravel's "missed ticks" functionality, potentially exacerbated by time synchronization problems. The problem disappears when Coravel is replaced with a simple BackgroundService and a Task.Delay()

We've identified this section of Coravel's code as a potential cause, which is responsible for catching up missed ticks:

private async void RunSchedulerPerSecondAsync(object state)
        {
            if (this._schedulerEnabled)
            {
                // This will get any missed ticks that might arise from the Timer triggering a little too late. 
                // If under CPU load or if the Timer is for some reason a little slow, then it's possible to
                // miss a tick - which we want to make sure the scheduler doesn't miss and catches up.
                var now = DateTime.UtcNow;
                DateTime[] ticks = null;
                lock (_tickLockObj)
                {
                    // This class isn't thread-safe.
                    ticks = this._ensureContinuousSecondTicks.GetTicksBetweenPreviousAndNext(now).ToArray();
                    this._ensureContinuousSecondTicks.SetNextTick(now);
                }

                if (ticks.Length > 0)
                {       
                    if(this._config.GetValue<bool?>("Coravel:Schedule:LogTickCatchUp") ?? false) 
                    {        
                        this._logger.LogInformation($"Coravel's scheduler is behind {ticks.Length} ticks and is catching-up to the current tick. Triggered at {now.ToString("o")}.");
                    }
                    
                    foreach (var tick in ticks)
                    {
                        await this._scheduler.RunAtAsync(tick);
                    }
                }

                // If we've processed any missed ticks, we also need to explicitly run the current tick.
                await this._scheduler.RunAtAsync(now);
            }
        }

Affected Coravel Feature
Scheduling

Expected behaviour
Jobs should run at their scheduled intervals without excessive activations, regardless of any temporary local time drift or sync issues.

Actual behaviour
On affected agents, jobs scheduled to run once per minute occasionally trigger tens or hundreds of times within the span of a few seconds. After this burst of activity, the job resumes its correct schedule.

Possible solutions
Adding a configure option similar to the logging ("Coravel:Schedule:LogTickCatchUp") to prevent Coravel from activating jobs while catching up with missed ticks. I would imagine other people have scheduling requirements that are time sensitive, i.e. if I miss 10 activation it still only make sense to trigger the job once now, not 10 times for the previously missed activations

@Alex-stewart1 Alex-stewart1 changed the title Excessive Job Activations on Remote Agents Due to Missed Ticks and PC Time Sync Issues Excessive job activations on remote agents due to missed ticks and PC time sync issues Oct 23, 2024
@Alex-stewart1
Copy link
Author

Alex-stewart1 commented Oct 25, 2024

I forked the project and removed the suspected code, leaving in the logging and the issue is resolved. For reference the scheduler (running in a windows service) seemed to fall behind on certain PC's while they were logged out until they logged back in again, not sure why

I'm surprised no one else has run into this issue thus far, its not only one PC, seems to happen on any regardless of similarities between the PC's themselves.

Would it be possible to make the "catch up" configurable, as we NEVER want missed jobs to be ran multiple times

Logs:
Coravel's scheduler is behind 56137 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:06:42.3044403Z.
Coravel's scheduler is behind 55952 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:14:39.5139747Z.
Coravel's scheduler is behind 55947 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:07:01.2471406Z.
Coravel's scheduler is behind 55576 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:31:36.4646435Z.
Coravel's scheduler is behind 55513 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:27:35.4275124Z.
Coravel's scheduler is behind 55449 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:27:20.7338323Z.

@jamesmh
Copy link
Owner

jamesmh commented Oct 29, 2024

Thanks for reaching out with the detailed issue.

This does seem like an "odd" behavior of any PC to have. Not only because Coravel would be affected, but this would also negatively affect many other things including:

  • Logging times
  • Data creation/modification timestamps
  • Any external cron jobs
  • Windows task scheduler jobs
  • Scheduled messages in messaging systems like rabbitMQ or a library on top of messaging like NServiceBus
  • Potentially cache system timeouts depending on the algorithms used (.g Redis, Memcached, etc.)
  • etc.

Typically for production servers, you would want the system to remain logged in as the same user (not only for date/time purposes but also for security reasons around permissions, reproducibility, auditability, etc.)

My first thought here is that this is not a Coravel issue but a bigger issue that could affect many other tools, etc.

We could look to adding new features/config to Coravel but this really sounds like a bigger issue that should be solved?

Thoughts?

@Alex-stewart1
Copy link
Author

Alex-stewart1 commented Oct 29, 2024

Hi James, thanks for the response.

For a bit of context the software experiencing the issue is part of a remote monitoring & management system. Its the remote agent that is responsible for sending telemetry to our cloud infrastructure and so is installed on a very wide range of machines. Some of these are production servers running backup software, firewalls, virtual machines etc and I would agree if the issue was present on these machines its the machine that needs fixing not our agent. However, most of the machines this windows service is installed on are simple workstations that unfortunately may be on outdated versions of windows, missing updates, and strangely have out of sync system clocks. Our use case is such that the agent needs to be as flexible as possible and work on as many systems (including ones that are poorly maintained) as possible.

I understand it's not Coravel's problem if the system clock is unreliable, and it's understandable that you have to draw a line somewhere. However, we would like to continue using Coravel if possible. It seems like a minor configuration adjustment would accommodate these edge cases, allowing Coravel to remain operational even on less reliable machines.

Note:
The logs provided in the previous comment all came from different machines. That was copy and pasted from one central source. Reading it back it was not obvious if the logs provided we're from a single machine of multiple

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants