Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NAS-128368 / 24.10 / nvme-of: Remove/Add disk on discovery change event #13780

Merged
merged 1 commit into from
Jun 17, 2024

Conversation

ixhamza
Copy link
Contributor

@ixhamza ixhamza commented May 22, 2024

Jira Ticket: NAS-128368

NAS-128368 reports three different issues related to NVMe-oF hotplugging:

Problem:

NAS-128368 reports three different issues related to NVMe-oF hotplugging:

  1. When a drive is physically removed while part of the pool, the Linux/zfs still sees it as ONLINE.
  2. A drive is not visible to the OS upon insertion until the next drive is inserted, i.e., drive addition by OS is lagged by one event. For example, if a drive is inserted into Slot-1, it won't be added by Linux until another drive is inserted into a different slot. However, the addition of the next drive will then be lagged by the drive inserted after that.
  3. If fenced is holding a drive, the new drive still points to the previous NVMe subsystem, showing the previous serial number of the last drive inserted.

Solution:

  1. The discovery controller receives the udev event, and by default, the NVMe-oF service runs the connect-all command, which is insufficient to remove the drive. To address this, we check the drives that are part of the enclosure and disconnect those not matched with the discovery log page.
  2. It takes up to 3 seconds after receiving the discovery event for the drives to be ready for connection on the client side. To accommodate this, we added a timeout of five seconds to ensure the drive connects properly.
  3. This issue is fixed by a Linux kernel patch: NAS-128368 / None / nvme: acquire a new subsystem if no valid controllers assigned linux#178. We now create a new subsystem if the previous one does not have any valid controllers.

Additionally, by default, systemd does not queue the events and ignores the service start if the service is already running. To resolve this, we added a simple queue to handle at most one event and created a wrapper to unblock systemd for the next udev events.

@ixhamza ixhamza requested review from amotin and usaleem-ix May 22, 2024 19:14
@bugclerk bugclerk changed the title nvme-of: Remove/Add disk on discovery change event NAS-128368 / 24.10 / nvme-of: Remove/Add disk on discovery change event May 22, 2024
Copy link
Contributor

@yocalebo yocalebo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please explain what this does. The ticket explains what the issue is but this PR has no context on how it fixes the problem.

@yocalebo yocalebo requested a review from bmeagherix May 23, 2024 11:26
@ixhamza
Copy link
Contributor Author

ixhamza commented May 23, 2024

@yocalebo - I have added a detailed explanation.

@ixhamza ixhamza force-pushed the NAS-128368 branch 3 times, most recently from 0212f2d to 06ab594 Compare May 23, 2024 14:28
@yocalebo
Copy link
Contributor

@ixhamza thanks! I'll start review here soon.

@ixhamza ixhamza force-pushed the NAS-128368 branch 2 times, most recently from f97b949 to c6e26d3 Compare May 24, 2024 16:10
@ixhamza ixhamza requested a review from yocalebo June 3, 2024 12:42
@ixhamza
Copy link
Contributor Author

ixhamza commented Jun 7, 2024

@yocalebo - would be appreciated if you could review this, please.

@yocalebo yocalebo requested a review from bmeagherix June 7, 2024 18:02
@yocalebo yocalebo merged commit 9b76876 into master Jun 17, 2024
3 checks passed
@yocalebo yocalebo deleted the NAS-128368 branch June 17, 2024 14:34
@bugclerk
Copy link
Contributor

This PR has been merged and conversations have been locked.
If you would like to discuss more about this issue please use our forums or raise a Jira ticket.

@truenas truenas locked as resolved and limited conversation to collaborators Jun 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants