-
-
Notifications
You must be signed in to change notification settings - Fork 261
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Excessive job activations on remote agents due to missed ticks and PC time sync issues #416
Comments
I forked the project and removed the suspected code, leaving in the logging and the issue is resolved. For reference the scheduler (running in a windows service) seemed to fall behind on certain PC's while they were logged out until they logged back in again, not sure why I'm surprised no one else has run into this issue thus far, its not only one PC, seems to happen on any regardless of similarities between the PC's themselves. Would it be possible to make the "catch up" configurable, as we NEVER want missed jobs to be ran multiple times Logs: |
Thanks for reaching out with the detailed issue. This does seem like an "odd" behavior of any PC to have. Not only because Coravel would be affected, but this would also negatively affect many other things including:
Typically for production servers, you would want the system to remain logged in as the same user (not only for date/time purposes but also for security reasons around permissions, reproducibility, auditability, etc.) My first thought here is that this is not a Coravel issue but a bigger issue that could affect many other tools, etc. We could look to adding new features/config to Coravel but this really sounds like a bigger issue that should be solved? Thoughts? |
Hi James, thanks for the response. For a bit of context the software experiencing the issue is part of a remote monitoring & management system. Its the remote agent that is responsible for sending telemetry to our cloud infrastructure and so is installed on a very wide range of machines. Some of these are production servers running backup software, firewalls, virtual machines etc and I would agree if the issue was present on these machines its the machine that needs fixing not our agent. However, most of the machines this windows service is installed on are simple workstations that unfortunately may be on outdated versions of windows, missing updates, and strangely have out of sync system clocks. Our use case is such that the agent needs to be as flexible as possible and work on as many systems (including ones that are poorly maintained) as possible. I understand it's not Coravel's problem if the system clock is unreliable, and it's understandable that you have to draw a line somewhere. However, we would like to continue using Coravel if possible. It seems like a minor configuration adjustment would accommodate these edge cases, allowing Coravel to remain operational even on less reliable machines. Note: |
Describe the bug
We are experiencing an issue where jobs scheduled to run at regular intervals (e.g., once per minute using cron syntax * * * * *) occasionally trigger far too many times, seemingly at random, on certain remote agents. This tends to happen shortly after a cold start of the Windows service running Coravel. Instead of running the job once per minute, the job can be triggered tens or even hundreds of times within a very short span, and then the behaviour corrects itself without requiring a code change or service restart.
The issue has been difficult to consistently reproduce, but we have observed a pattern where the agents experiencing this issue had incorrect local PC times (out of sync with UTC). After thoroughly stripping down our application to isolate the problem (reducing the job to a simple IInvocable that logs a message), we suspect the issue is related to Coravel's "missed ticks" functionality, potentially exacerbated by time synchronization problems. The problem disappears when Coravel is replaced with a simple BackgroundService and a Task.Delay()
We've identified this section of Coravel's code as a potential cause, which is responsible for catching up missed ticks:
Affected Coravel Feature
Scheduling
Expected behaviour
Jobs should run at their scheduled intervals without excessive activations, regardless of any temporary local time drift or sync issues.
Actual behaviour
On affected agents, jobs scheduled to run once per minute occasionally trigger tens or hundreds of times within the span of a few seconds. After this burst of activity, the job resumes its correct schedule.
Possible solutions
Adding a configure option similar to the logging ("Coravel:Schedule:LogTickCatchUp") to prevent Coravel from activating jobs while catching up with missed ticks. I would imagine other people have scheduling requirements that are time sensitive, i.e. if I miss 10 activation it still only make sense to trigger the job once now, not 10 times for the previously missed activations
The text was updated successfully, but these errors were encountered: