Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a new maintenance window in 30 seconds or less #19352

Closed
2 of 6 tasks
rachaelshaw opened this issue May 29, 2024 · 7 comments
Closed
2 of 6 tasks

Create a new maintenance window in 30 seconds or less #19352

rachaelshaw opened this issue May 29, 2024 · 7 comments
Assignees
Labels
~dogfood Issue resulted from Fleet's product dogfooding. #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Milestone

Comments

@rachaelshaw
Copy link
Member

rachaelshaw commented May 29, 2024

Goal

User story
As an end user who deletes or moves my maintenance window to a time in the past,
I want to see a new, future maintenance window in 30 seconds or less
so that I know downtime is still going to happen.

Context

Changes

Product

  • Changes:
    • Subscribe to calendar event changes
    • When user moves a maintenance window calendar event to the past or deletes it, create a new event within ~30 seconds. See example of what Reclaim does here.
  • Documentation
    • Fleet server watches for potential changes for up to 1 week after original event time. If event is moved forward more than 1 week, then after 1 week Fleet server will check for event changes once every 30 minutes.
    • These near real-time updates may add additional load to the Google Calendar API, so it is recommended to use API usage alerts or other monitoring methods. Otherwise, if the Google API is overloaded, calendar updates and/or webhooks may be delayed.

Engineering

  • Database schema migrations: TODO
  • Load testing: TODO

ℹ️  Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

  • Requires load testing: Yes
  • Risk level: High
  • Risk description: Performance risk when many calendar events are updated simultaneously.

Manual testing steps

  1. Set up load test environment for calendar testing (with FLEET_GOOGLE_CALENDAR_PLUS_ADDRESSING).
  • Since we are using the same calendar for multiple events, the calendars are going to generate a lot of callbacks for each event change. So, for 100 events that are being all changed at the same time, the Fleet server will see 100*? <= 10,000 callbacks. This is larger than real life. For the purposes of load testing this feature, it should be sufficient to have 100 events on each calendar, but we can try and push it up to 1000 if things are behaving OK.
  1. Make some unrelated simultaneous change to the calendars (like create/delete a random event). Monitor DB and Redis for any spikes. All events should remain the same.
  2. Move all calendar events to the past (use move-events.go) -- make sure they are recreated.
  3. Delete events (use delete-events.go) -- make sure they are recreated.
  4. Redo the move/delete steps. And while it is happening, also trigger the calendar cron job.
  • To make cron also refetch all calendar events, set the event update times to >30 minutes but <1 day earlier, like this MySQL command: update calendar_events set updated_at = '2024-07-22 12:21:31';
  1. Move all calendar events to current time -- make sure webhooks fire in the next ~5 minutes. Use Dave's Tines instance to receive webhooks since it has more bandwidth.

Testing notes

We now have a new tools/calendar/move-events/move-events.go script that can be used to check calendar events for users, including catching duplicates.

When creating a bunch of events on the same calendar, you may see these warnings on the server:

msg="Received calendar callback, but did not find corresponding event in database" event_uuid=6782ffb0-4d4b-4110-b458-3f8962c53d85 channel_id=0ac2ef07-272d-47e4-a8df-ae3ec63cb166

This occurs because callbacks are happening before we actually saved the event in our DB. This is fine and should not happen when there is only 1 event being created on 1 calendar.

Confirmation

  1. Engineer (@____): Added comment to user story confirming successful completion of QA.
  2. QA (@____): Added comment to user story confirming successful completion of QA.
@rachaelshaw rachaelshaw added :product Product Design department (shows up on 🦢 Drafting board) ~feature fest Will be reviewed at next Feature Fest ~dogfood Issue resulted from Fleet's product dogfooding. labels May 29, 2024
@noahtalerman noahtalerman removed the :product Product Design department (shows up on 🦢 Drafting board) label May 30, 2024
@noahtalerman noahtalerman changed the title Increase frequency of checking for changes to calendar events Create a new calendar event in 30 seconds or less May 30, 2024
@noahtalerman noahtalerman changed the title Create a new calendar event in 30 seconds or less Create a new maintenance window in 30 seconds or less May 30, 2024
@noahtalerman noahtalerman added story A user story defining an entire feature :product Product Design department (shows up on 🦢 Drafting board) #g-endpoint-ops Endpoint ops product group and removed ~feature fest Will be reviewed at next Feature Fest labels May 31, 2024
@rachaelshaw
Copy link
Member Author

@noahtalerman is this a duplicate of #19491?

@rachaelshaw rachaelshaw assigned sharon-fdm and unassigned rachaelshaw Jun 5, 2024
@noahtalerman
Copy link
Member

Not a duplicate for now. #19491 is a hidden config. This story is about subscribing to calendar. Then we can remove the config.

@sharon-fdm sharon-fdm added :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. and removed :product Product Design department (shows up on 🦢 Drafting board) labels Jun 25, 2024
@sharon-fdm sharon-fdm assigned getvictor and unassigned sharon-fdm Jun 25, 2024
@lukeheath lukeheath added this to the 4.54.0-tentative milestone Jul 3, 2024
getvictor added a commit that referenced this issue Jul 8, 2024
#19352 

Video explaining code changes:
https://www.loom.com/share/370200a276b84aa388effd6ebd762e01?sid=038508c4-f3c2-40c0-baf6-6b6df682d1f0

In maintenance windows using Google Calendar, calendar event is now
recreated within 30 seconds if deleted or moved to the past.
- Added new endpoint for Google Calendar:
`/api/_version_/fleet/calendar/webhook/{event_uuid}`
- Added UUID to `calendar_events` table to make webhook lookup more
efficient
- webhook endpoint will only recreate event if needed -- it will not
fire webhook. Webhook is still done by the cron job.

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

<!-- Note that API documentation changes are now addressed by the
product design team. -->

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
See [Changes
files](https://fleetdm.com/docs/contributing/committing-changes#changes-files)
for more information.
- [x] Input data is properly validated, `SELECT *` is avoided, SQL
injection is prevented (using placeholders for values in statements)
- [x] Added/updated tests
- [x] If database migrations are included, checked table schema to
confirm autoupdate
- For database migrations:
- [x] Checked schema for all modified table for columns that will
auto-update timestamps during migration.
- [x] Confirmed that updating the timestamps is acceptable, and will not
cause unwanted side effects.
- [x] Ensured the correct collation is explicitly set for character
columns (`COLLATE utf8mb4_unicode_ci`).
- [x] Manual QA for all new/changed functionality
  - For Orbit and Fleet Desktop changes:
@lukeheath lukeheath modified the milestones: 4.54.0, 4.55.0-tentative Jul 9, 2024
@getvictor getvictor modified the milestones: 4.55.0-tentative, 4.54.0 Jul 9, 2024
getvictor added a commit that referenced this issue Jul 10, 2024
…20277)

#19352

Fix for code review comment:
#20156 (comment)

Also includes changes from #20252

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

<!-- Note that API documentation changes are now addressed by the
product design team. -->

- [x] Added/updated tests
- [x] If database migrations are included, checked table schema to
confirm autoupdate
- For database migrations:
- [x] Checked schema for all modified table for columns that will
auto-update timestamps during migration.
- [x] Confirmed that updating the timestamps is acceptable, and will not
cause unwanted side effects.
- [x] Ensured the correct collation is explicitly set for character
columns (`COLLATE utf8mb4_unicode_ci`).
- [x] Manual QA for all new/changed functionality
getvictor added a commit that referenced this issue Jul 10, 2024
…20277)

#19352

Fix for code review comment:
#20156 (comment)

Also includes changes from #20252

# Checklist for submitter

If some of the following don't apply, delete the relevant line.

<!-- Note that API documentation changes are now addressed by the
product design team. -->

- [x] Added/updated tests
- [x] If database migrations are included, checked table schema to
confirm autoupdate
- For database migrations:
- [x] Checked schema for all modified table for columns that will
auto-update timestamps during migration.
- [x] Confirmed that updating the timestamps is acceptable, and will not
cause unwanted side effects.
- [x] Ensured the correct collation is explicitly set for character
columns (`COLLATE utf8mb4_unicode_ci`).
- [x] Manual QA for all new/changed functionality

(cherry picked from commit 7bcd61a)
getvictor added a commit that referenced this issue Jul 15, 2024
getvictor added a commit that referenced this issue Jul 15, 2024
Unreleased bug fix for #19352

(cherry picked from commit dc7a3cd)
@sharon-fdm sharon-fdm modified the milestones: 4.54.0, 4.55.0-tentative Jul 17, 2024
getvictor added a commit that referenced this issue Jul 24, 2024
#19352
Includes the following changes:
- Re-enable calendar callback
- Introduced a new Redis key that indicates event was updated by
calendar callback. In that case, we ignore subsequent callbacks for 10
seconds.
- This reduces the amount of Google API calls, including handling of the
unneeded callback generated by our own event change.
- Read event from DB after acquiring lock. This is critical since we get
the updated ETag of the Google Calendar event from our DB. Using the
previous ETag when fetching event sometimes returns stale data,
resulting in duplicate events.
- Fixed bug in getCalendarLock where calendar cron would always think it
got the lock
- Do not refetch timezone during calendar callback to reduce Google API
load
- Watch for calendar event changes for 1 week after event end (to
account for user moving event into the future)
- #20442: Speculative improvement for Google callback latency by keeping
the same notification channel (callback URL).
- processCalendarAsync now takes at least 1 sec to process all events,
to reduce CPU/Redis load
- Increased lock expiration time from 1 minute to 20 minutes to account
for potential Google API retries, fixing occasional duplicate events.
- Added `get-events.go` helper script that gets maintenance events from
user calendars, and checks for duplicates

# Checklist for submitter

- [x] Changes file added for user-visible changes in `changes/`,
`orbit/changes/` or `ee/fleetd-chrome/changes`.
- [x] Added/updated tests
- [x] Manual QA for all new/changed functionality
getvictor added a commit that referenced this issue Aug 1, 2024
# Checklist for submitter

Fixing unreleased bug for #19352

- [x] Manual QA for all new/changed functionality
mostlikelee pushed a commit that referenced this issue Aug 1, 2024
# Checklist for submitter

Fixing unreleased bug for #19352

- [x] Manual QA for all new/changed functionality
@noahtalerman
Copy link
Member

Documentation

Fleet server watches for potential changes for up to 1 week after original event time. If event is moved forward more than 1 week, then after 1 week Fleet server will check for event changes once every 30 minutes.

These near real-time updates may add additional load to the Google Calendar API, so it is recommended to use API usage alerts or other monitoring methods. Otherwise, if the Google API is overloaded, calendar updates and/or webhooks may be delayed.

Hey @getvictor do you know if this is/will be documented in a guide?

@lukeheath lukeheath added :product Product Design department (shows up on 🦢 Drafting board) and removed :release Ready to write code. Scheduled in a release. See "Making changes" in handbook. labels Aug 9, 2024
@noahtalerman
Copy link
Member

Documentation

Fleet server watches for potential changes for up to 1 week after original event time. If event is moved forward more than 1 week, then after 1 week Fleet server will check for event changes once every 30 minutes.

These near real-time updates may add additional load to the Google Calendar API, so it is recommended to use API usage alerts or other monitoring methods. Otherwise, if the Google API is overloaded, calendar updates and/or webhooks may be delayed.

Hey @getvictor just giving you another ping :)

Is this already documented in a guide? If not can you please help document it in one?

@getvictor
Copy link
Member

@noahtalerman This is documented in this PR https://github.com/fleetdm/fleet/pull/20974/files

@noahtalerman
Copy link
Member

Closing this issue even though the article hasn't been shipped. Product team is tracking shipping the article as part of a separate story here: #20763

@fleet-release
Copy link
Contributor

Quick as the falcon,
New window forms in the cloud,
Uptime ensured, proud.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
~dogfood Issue resulted from Fleet's product dogfooding. #g-endpoint-ops Endpoint ops product group :product Product Design department (shows up on 🦢 Drafting board) story A user story defining an entire feature
Development

No branches or pull requests

6 participants