Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive job activations on remote agents due to missed ticks and PC time sync issues #416

Open
Alex-stewart1 opened this issue Oct 23, 2024 · 1 comment

Comments

@Alex-stewart1
Copy link

Describe the bug
We are experiencing an issue where jobs scheduled to run at regular intervals (e.g., once per minute using cron syntax * * * * *) occasionally trigger far too many times, seemingly at random, on certain remote agents. This tends to happen shortly after a cold start of the Windows service running Coravel. Instead of running the job once per minute, the job can be triggered tens or even hundreds of times within a very short span, and then the behaviour corrects itself without requiring a code change or service restart.

The issue has been difficult to consistently reproduce, but we have observed a pattern where the agents experiencing this issue had incorrect local PC times (out of sync with UTC). After thoroughly stripping down our application to isolate the problem (reducing the job to a simple IInvocable that logs a message), we suspect the issue is related to Coravel's "missed ticks" functionality, potentially exacerbated by time synchronization problems. The problem disappears when Coravel is replaced with a simple BackgroundService and a Task.Delay()

We've identified this section of Coravel's code as a potential cause, which is responsible for catching up missed ticks:

private async void RunSchedulerPerSecondAsync(object state)
        {
            if (this._schedulerEnabled)
            {
                // This will get any missed ticks that might arise from the Timer triggering a little too late. 
                // If under CPU load or if the Timer is for some reason a little slow, then it's possible to
                // miss a tick - which we want to make sure the scheduler doesn't miss and catches up.
                var now = DateTime.UtcNow;
                DateTime[] ticks = null;
                lock (_tickLockObj)
                {
                    // This class isn't thread-safe.
                    ticks = this._ensureContinuousSecondTicks.GetTicksBetweenPreviousAndNext(now).ToArray();
                    this._ensureContinuousSecondTicks.SetNextTick(now);
                }

                if (ticks.Length > 0)
                {       
                    if(this._config.GetValue<bool?>("Coravel:Schedule:LogTickCatchUp") ?? false) 
                    {        
                        this._logger.LogInformation($"Coravel's scheduler is behind {ticks.Length} ticks and is catching-up to the current tick. Triggered at {now.ToString("o")}.");
                    }
                    
                    foreach (var tick in ticks)
                    {
                        await this._scheduler.RunAtAsync(tick);
                    }
                }

                // If we've processed any missed ticks, we also need to explicitly run the current tick.
                await this._scheduler.RunAtAsync(now);
            }
        }

Affected Coravel Feature
Scheduling

Expected behaviour
Jobs should run at their scheduled intervals without excessive activations, regardless of any temporary local time drift or sync issues.

Actual behaviour
On affected agents, jobs scheduled to run once per minute occasionally trigger tens or hundreds of times within the span of a few seconds. After this burst of activity, the job resumes its correct schedule.

Possible solutions
Adding a configure option similar to the logging ("Coravel:Schedule:LogTickCatchUp") to prevent Coravel from activating jobs while catching up with missed ticks. I would imagine other people have scheduling requirements that are time sensitive, i.e. if I miss 10 activation it still only make sense to trigger the job once now, not 10 times for the previously missed activations

@Alex-stewart1 Alex-stewart1 changed the title Excessive Job Activations on Remote Agents Due to Missed Ticks and PC Time Sync Issues Excessive job activations on remote agents due to missed ticks and PC time sync issues Oct 23, 2024
@Alex-stewart1
Copy link
Author

Alex-stewart1 commented Oct 25, 2024

I forked the project and removed the suspected code, leaving in the logging and the issue is resolved. For reference the scheduler (running in a windows service) seemed to fall behind on certain PC's while they were logged out until they logged back in again, not sure why

I'm surprised no one else has run into this issue thus far, its not only one PC, seems to happen on any regardless of similarities between the PC's themselves.

Would it be possible to make the "catch up" configurable, as we NEVER want missed jobs to be ran multiple times

Logs:
Coravel's scheduler is behind 56137 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:06:42.3044403Z.
Coravel's scheduler is behind 55952 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:14:39.5139747Z.
Coravel's scheduler is behind 55947 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:07:01.2471406Z.
Coravel's scheduler is behind 55576 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:31:36.4646435Z.
Coravel's scheduler is behind 55513 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:27:35.4275124Z.
Coravel's scheduler is behind 55449 ticks and is catching-up to the current tick. Triggered at 2024-10-25T07:27:20.7338323Z.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant