Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: add events when pre- or post-upgrade check fails #3211

Merged

Conversation

james-munson
Copy link
Contributor

@james-munson james-munson commented Oct 14, 2024

Which issue(s) this PR fixes:

longhorn/longhorn#9569

What this PR does / why we need it:

Add an event when upgrade pre- or post-check job completes, either with error or success message.

Special notes for your reviewer:

Additional documentation or context

Summary by CodeRabbit

  • New Features

    • Enhanced event broadcasting and recording capabilities during the upgrade process.
    • Improved logging for pre-upgrade checks and outcomes.
  • Bug Fixes

    • Simplified event handling by streamlining the event broadcaster functionality.
  • Documentation

    • Updated comments for better clarity on upgrade path checks.

@james-munson james-munson requested review from PhanLe1010 and a team October 14, 2024 22:35
// hang around so logs cam be collected.
// TODO - make this a --ttl argument.
time.Sleep(1 * time.Hour)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that with the event captured, this is not as necessary, but it might be useful. I'm not sure what to use for the time to wait. The pre-upgrade job itself has spec.activeDeadlineSeconds: 900 so the pod will be killed after 15 minutes anyway, and perhaps that is a reasonable value to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the event is emitted, does it need to sleep for minutes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really. The event will last for an hour, so if a support bundle is collected in that time, the event should be there.
It would be the way to accomplish the goal in longhorn/longhorn#9448.
I can certainly take it out if preferred.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it back. Without it, sometimes the panic from the "fatal" error means the event does not get created.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it back. Without it, sometimes the panic from the "fatal" error means the event does not get created.

Can you elaborate more on the statement?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it back. Without it, sometimes the panic from the "fatal" error means the event does not get created.

Looks like the AI suggestion can solve this one. WDYT @james-munson ?

https://github.com/longhorn/longhorn-manager/pull/3211/files#r1803865276

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The events are queued by Event/Eventf, but not necessarily propagated to the sink before the Event() call returns. They can be lost when the next thing that happens is an os.Exit as part of the log.Fatal.
However, the AI suggestion is a good one. I tested it, and it looks like eventBroadcaster.Shutdown() forces a flush of the queued events, so they show up even without a sleep to delay the exit. I have pushed up the change.

constant/events.go Outdated Show resolved Hide resolved
Copy link

coderabbitai bot commented Oct 16, 2024

📝 Walkthrough
📝 Walkthrough

Walkthrough

The changes in this pull request focus on enhancing the upgrade process in the Longhorn application. Key modifications include the introduction of new constants for event handling, updates to the postUpgrade and preUpgrade functions to incorporate event broadcasting and recording, and the restructuring of related components to improve organization and clarity. Additionally, a new utility function for creating event broadcasters is added, while some functions are removed or modified for simplicity.

Changes

File Path Change Summary
app/post_upgrade.go Introduced constant PostUpgradeEventer, modified postUpgrade for event handling, updated newPostUpgrader to accept eventRecorder, and modified Run method for event recording.
app/pre_upgrade.go Defined constant PreUpgradeEventer, improved logging in PreUpgradeCmd, expanded preUpgrade for event handling, added newPreUpgrader method, and created preUpgrader struct with Run method.
app/recurring_job.go Removed createEventBroadcaster function, modified eventCreate method for direct event handling.
app/util.go Added createEventBroadcaster function to initialize event broadcasters.
constant/events.go Added new constants for upgrade events and updated EventReasonUpgrade signature.
upgrade/upgrade.go Updated comments in doResourceUpgrade function for clarity on upgrade path checks.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant PreUpgrader
    participant PostUpgrader
    participant EventRecorder

    User->>PreUpgrader: Initiate Pre-Upgrade
    PreUpgrader->>EventRecorder: Create Event
    PreUpgrader->>PreUpgrader: Run Pre-Upgrade Checks
    PreUpgrader->>EventRecorder: Record Outcome
    PreUpgrader-->>User: Pre-Upgrade Complete

    User->>PostUpgrader: Initiate Post-Upgrade
    PostUpgrader->>EventRecorder: Create Event
    PostUpgrader->>PostUpgrader: Run Post-Upgrade Checks
    PostUpgrader->>EventRecorder: Record Outcome
    PostUpgrader-->>User: Post-Upgrade Complete
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (4)
app/util.go (1)

22-23: Address the TODO comment regarding client usage.

There's a TODO comment indicating that the wrapper should be removed when all clients have moved to use the clientset. This suggests that there might be ongoing refactoring or migration work.

Consider creating a tracking issue for this TODO item to ensure it's not forgotten. Additionally, it would be helpful to provide more context about the timeline or conditions for removing this wrapper.

upgrade/upgrade.go (1)

Line range hint 290-315: LGTM: Resource status upgrades and final cleanup steps look good

The implementation correctly handles resource status upgrades for various version paths, consistent with the previous upgrade steps. The final calls to update resource statuses, delete removed settings, and update the Longhorn version setting are crucial for maintaining system consistency after the upgrade process.

One minor suggestion for improved readability:

Consider extracting the repeated semver comparison logic into a helper function to reduce code duplication. For example:

func shouldUpgrade(currentVersion, targetVersion string) bool {
    return semver.Compare(currentVersion, targetVersion) < 0
}

// Usage
if shouldUpgrade(lhVersionBeforeUpgrade, "v1.5.0") {
    // Upgrade logic here
}

This would make the code more concise and easier to maintain.

app/post_upgrade.go (2)

29-29: Consider renaming PostUpgradeEventer for clarity

The constant PostUpgradeEventer represents the event source component name. To enhance readability and alignment with naming conventions, consider renaming it to PostUpgradeEventComponent or PostUpgradeEventSource.


113-113: Consider adding context with timeout to waitManagerUpgradeComplete

The waitManagerUpgradeComplete method uses a fixed retry count and interval, potentially causing long waits. Incorporate a context with a timeout or deadline to allow for cancellation and better control over the waiting period.

Example modification:

func (u *postUpgrader) waitManagerUpgradeComplete(ctx context.Context) error {
	// Use ctx in API calls and add select statements to handle cancellation.
}

Ensure that when calling this function, you pass an appropriate context, possibly with a timeout.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8387bcf and 31ffc3c.

⛔ Files ignored due to path filters (1)
  • coverage.out is excluded by !**/*.out
📒 Files selected for processing (6)
  • app/post_upgrade.go (3 hunks)
  • app/pre_upgrade.go (3 hunks)
  • app/recurring_job.go (0 hunks)
  • app/util.go (1 hunks)
  • constant/events.go (1 hunks)
  • upgrade/upgrade.go (4 hunks)
💤 Files with no reviewable changes (1)
  • app/recurring_job.go
🧰 Additional context used
🔇 Additional comments (15)
app/util.go (3)

3-12: LGTM: Imports are well-organized and relevant.

The imports are correctly structured, following good practices such as grouping and ordering. All imported packages are relevant to the function being implemented.


14-26: LGTM: Well-implemented event broadcaster creation.

The createEventBroadcaster function is well-structured and correctly implements the creation of a Kubernetes event broadcaster. It properly handles error cases, sets up logging, and configures event recording to a Kubernetes sink.


1-26: Summary: New utility function aligns with PR objectives.

The introduction of the createEventBroadcaster function in app/util.go aligns well with the PR objectives of enhancing the upgrade process by adding events. This utility function provides a centralized way to create event broadcasters, which can be used to log pre- and post-upgrade check results.

The implementation is solid, following good practices in error handling, resource initialization, and code structure. It sets a good foundation for improving the visibility and tracking of the upgrade process as intended by this PR.

constant/events.go (1)

65-68: LGTM! Consistent and clear event reason constants added.

The new constants for upgrade events (EventReasonFailedUpgradePreCheck, EventReasonFailedUpgradePostCheck, and EventReasonPassedUpgradeCheck) are well-named and consistent with the existing naming conventions. They effectively address the PR's objective of adding events for pre- and post-upgrade checks.

The reformatting of EventReasonUpgrade improves overall consistency. The naming convention now clearly distinguishes between failure and success scenarios, addressing the concerns raised in previous discussions.

upgrade/upgrade.go (5)

Line range hint 245-259: LGTM: Upgrade path for v1.4.x to v1.5.2 looks good

The implementation correctly handles the upgrade paths from v1.4.x to v1.5.0 and v1.5.1 to v1.5.2. The use of semver comparison ensures that upgrades are only performed when necessary.


259-266: LGTM: Upgrade path for v1.5.x to v1.6.0 is correctly implemented

The code segment properly handles the upgrade path from v1.5.x to v1.6.0, using semver comparison to determine if the upgrade is necessary. This implementation is consistent with the overall upgrade strategy.


Line range hint 266-279: LGTM: Upgrade paths for v1.6.x to v1.7.1 are properly implemented

The code correctly handles the upgrade paths from v1.6.x to v1.7.0 and v1.7.0 to v1.7.1. The use of semver comparisons ensures that upgrades are performed only when necessary, maintaining consistency with the overall upgrade strategy.


Line range hint 279-290: LGTM: Upgrade path for v1.7.x to v1.8.0 and final resource update look good

The code correctly implements the upgrade path from v1.7.x to v1.8.0, consistent with the previous upgrade steps. The final call to upgradeutil.UpdateResources ensures that all resources are updated after the version-specific upgrades, which is a good practice for maintaining system consistency.


Line range hint 1-315: Overall, the upgrade implementation looks solid and well-structured

The changes in this file successfully enhance the upgrade process for Longhorn, addressing various version paths and ensuring proper resource and status updates. The implementation is consistent, follows a clear pattern, and aligns well with the PR objectives.

Key points:

  1. Proper use of semver comparisons for version checks
  2. Consistent handling of upgrade paths for different versions
  3. Appropriate updating of resources and their statuses
  4. Final cleanup steps to maintain system consistency

The code quality is good, with only a minor suggestion for improving readability by extracting the repeated semver comparison logic into a helper function.

app/post_upgrade.go (6)

44-47: Ensure FlagNamespace environment variable is properly set

The FlagNamespace flag is now required and uses types.EnvPodNamespace as its environment variable. Verify that this environment variable is correctly set in all deployment environments to prevent potential issues with namespace resolution.


79-79: Confirm the event recorder is correctly initialized

The event recorder is initialized with the new scheme and event source. Ensure that the scheme includes all necessary types and that PostUpgradeEventer accurately represents the component emitting events.


86-88: Check error handling for newPostUpgrader().Run()

The error returned by newPostUpgrader(namespace, kubeClient, eventRecorder).Run() is assigned to err and returned. Ensure that any errors are appropriately logged or handled upstream to provide clear diagnostics in case of failures.


92-94: Addition of eventRecorder enhances event handling

Adding the eventRecorder to the postUpgrader struct allows the upgrade process to emit events, improving observability.


97-98: Update constructor to include eventRecorder

The newPostUpgrader function now accepts eventRecorder as a parameter, aligning with the updated struct definition. This change ensures that the recorder is properly passed and available within postUpgrader.


69-73: Ensure createEventBroadcaster function is defined and error handling is comprehensive

Verify that the createEventBroadcaster function exists and properly initializes the event broadcaster. Ensure comprehensive error handling within this function to prevent nil returns without errors.

If the function is not defined, you might need to implement it or import the correct package.

app/pre_upgrade.go Show resolved Hide resolved
app/pre_upgrade.go Outdated Show resolved Hide resolved
app/post_upgrade.go Outdated Show resolved Hide resolved
app/post_upgrade.go Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 31ffc3c and eaa24a8.

📒 Files selected for processing (6)
  • app/post_upgrade.go (3 hunks)
  • app/pre_upgrade.go (3 hunks)
  • app/recurring_job.go (0 hunks)
  • app/util.go (1 hunks)
  • constant/events.go (1 hunks)
  • upgrade/upgrade.go (4 hunks)
💤 Files with no reviewable changes (1)
  • app/recurring_job.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • app/post_upgrade.go
  • app/util.go
  • constant/events.go
  • upgrade/upgrade.go
🧰 Additional context used
🔇 Additional comments (1)
app/pre_upgrade.go (1)

103-108: Ensure the ObjectReference in events refers to a valid Kubernetes object

The ObjectReference used in eventRecorder.Event should refer to an existing Kubernetes object. Using Name: PreUpgradeEventer may not correspond to a valid object, which could affect event visibility and association. Consider referencing a relevant object, such as a Pod, Deployment, or a Longhorn custom resource.

To confirm the validity of the ObjectReference, run the following script:

If no such object exists, update the ObjectReference to point to an existing resource.

app/pre_upgrade.go Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
app/pre_upgrade.go (2)

60-84: LGTM: Enhanced pre-upgrade process with event recording.

The changes to the preUpgrade function significantly improve the pre-upgrade process by adding event recording capabilities. The event broadcaster, scheme, and recorder are correctly set up and used.

However, consider improving the error handling in the newPreUpgrader().Run() call:

 err = newPreUpgrader(namespace, lhClient, eventRecorder).Run()
 if err != nil {
-    logrus.Warnf("Done with Run() ... err is %v", err)
+    logrus.Errorf("Pre-upgrade encountered an error: %v", err)
 }

This change provides more clarity about the nature of the log message.


96-116: LGTM: Well-implemented Run method with proper event recording.

The Run method effectively encapsulates the pre-upgrade process logic, including proper event recording for both success and failure scenarios. The removal of the sleep after checks is a good improvement.

Consider wrapping the error from upgradeutil.CheckUpgradePath for better context:

 if err = upgradeutil.CheckUpgradePath(u.namespace, u.lhClient, u.eventRecorder, true); err != nil {
-    return err
+    return errors.Wrap(err, "failed to check upgrade path")
 }

This change provides more context when the error is logged or returned.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between eaa24a8 and e8cf7fb.

📒 Files selected for processing (6)
  • app/post_upgrade.go (3 hunks)
  • app/pre_upgrade.go (3 hunks)
  • app/recurring_job.go (0 hunks)
  • app/util.go (1 hunks)
  • constant/events.go (1 hunks)
  • upgrade/upgrade.go (4 hunks)
💤 Files with no reviewable changes (1)
  • app/recurring_job.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • app/post_upgrade.go
  • app/util.go
  • constant/events.go
  • upgrade/upgrade.go
🧰 Additional context used
🔇 Additional comments (5)
app/pre_upgrade.go (5)

8-8: LGTM: New imports are appropriate for the added functionality.

The new imports are necessary and correctly added to support the enhanced pre-upgrade process, including event recording and Longhorn-specific operations.

Also applies to: 10-10, 12-14, 17-17


22-24: LGTM: New constant for event recording.

The PreUpgradeEventer constant is well-named and appropriately used for identifying the component in event recording.


42-43: Consider moving the completion log after error handling.

The deferred log statement defer logrus.Info("Completed pre-upgrade.") may not execute if the program exits before returning from the function, such as when logrus.WithError(err).Fatalf is called. This is because deferred functions are not run when os.Exit() is called within Fatalf.

To verify this behavior, we can search for similar patterns in the codebase:

#!/bin/bash
# Search for deferred logs followed by Fatalf calls
rg --type go 'defer\s+logrus\..*\n.*logrus\..*\.Fatalf'

86-90: LGTM: Well-structured preUpgrader struct.

The preUpgrader struct is well-designed, containing all necessary fields for the pre-upgrade process. It encapsulates the required dependencies, promoting better organization and modularity of the code.


92-94: LGTM: Proper constructor for preUpgrader.

The newPreUpgrader function serves as a clean and concise constructor for the preUpgrader struct, correctly initializing all required fields.

Copy link
Contributor

@PhanLe1010 PhanLe1010 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@derekbit derekbit merged commit e6ce3f2 into longhorn:master Oct 18, 2024
9 checks passed
@derekbit
Copy link
Member

@mergify backport v1.6.x v1.7.x

Copy link

mergify bot commented Oct 18, 2024

backport v1.6.x v1.7.x

✅ Backports have been created

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants