Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): add test case Test Longhorn components recovery #2143

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions e2e/keywords/backing_image.resource
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,9 @@ Clean up backing image ${backing_image_name} from a disk

Delete backing image ${backing_image_name}
delete_backing_image ${backing_image_name}

Delete backing image managers and wait for recreation
delete_all_backing_image_managers_and_wait_for_recreation

Wait backing image managers running
wait_all_backing_image_managers_running
16 changes: 16 additions & 0 deletions e2e/keywords/longhorn.resource
Original file line number Diff line number Diff line change
Expand Up @@ -66,3 +66,19 @@ Check all Longhorn CRD removed

Install Longhorn
install_longhorn_system

Delete instance-manager of volume ${volume_id}
${volume_name} = generate_name_with_suffix volume ${volume_id}
${node_name} = get_volume_node ${volume_name}
${pod_name} = get_instance_manager_on_node ${node_name}
delete_pod ${pod_name} longhorn-system
Comment on lines +70 to +74
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding error handling and verification

While the implementation is logically correct, consider enhancing it with:

  1. Error handling for cases where the volume or instance manager pod doesn't exist
  2. Verification that the pod was successfully deleted

Example enhancement:

Delete instance-manager of volume ${volume_id}
    ${volume_name} =    generate_name_with_suffix    volume    ${volume_id}
+    ${volume_exists} =    Run Keyword And Return Status    get_volume_node    ${volume_name}
+    Run Keyword If    not ${volume_exists}    Fail    Volume ${volume_name} not found
    ${node_name} =    get_volume_node    ${volume_name}
    ${pod_name} =     get_instance_manager_on_node    ${node_name}
+    Should Not Be Equal    ${pod_name}    ${None}    Instance manager pod not found on node ${node_name}
    delete_pod    ${pod_name}    longhorn-system
+    Wait Until Keyword Succeeds    30s    5s    Should Not Exist    pod    ${pod_name}    longhorn-system
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
Delete instance-manager of volume ${volume_id}
${volume_name} = generate_name_with_suffix volume ${volume_id}
${node_name} = get_volume_node ${volume_name}
${pod_name} = get_instance_manager_on_node ${node_name}
delete_pod ${pod_name} longhorn-system
Delete instance-manager of volume ${volume_id}
${volume_name} = generate_name_with_suffix volume ${volume_id}
${volume_exists} = Run Keyword And Return Status get_volume_node ${volume_name}
Run Keyword If not ${volume_exists} Fail Volume ${volume_name} not found
${node_name} = get_volume_node ${volume_name}
${pod_name} = get_instance_manager_on_node ${node_name}
Should Not Be Equal ${pod_name} ${None} Instance manager pod not found on node ${node_name}
delete_pod ${pod_name} longhorn-system
Wait Until Keyword Succeeds 30s 5s Should Not Exist pod ${pod_name} longhorn-system


Delete instance-manager of deployment ${deployment_id} volume
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
${node_name} = get_volume_node ${volume_name}
${pod_name} = get_instance_manager_on_node ${node_name}
delete_pod ${pod_name} longhorn-system
Comment on lines +76 to +81
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor to reduce code duplication and add error handling

This keyword shares similar logic with Delete instance-manager of volume. Consider:

  1. Creating a common helper keyword for the shared pod deletion logic
  2. Adding error handling and verification similar to the previous suggestion

Example refactor:

+Delete instance-manager pod
+    [Arguments]    ${node_name}
+    ${pod_name} =     get_instance_manager_on_node    ${node_name}
+    Should Not Be Equal    ${pod_name}    ${None}    Instance manager pod not found on node ${node_name}
+    delete_pod    ${pod_name}    longhorn-system
+    Wait Until Keyword Succeeds    30s    5s    Should Not Exist    pod    ${pod_name}    longhorn-system

Delete instance-manager of deployment ${deployment_id} volume
    ${deployment_name} =   generate_name_with_suffix    deployment    ${deployment_id}
+    ${deployment_exists} =    Run Keyword And Return Status    get_workload_volume_name    ${deployment_name}
+    Run Keyword If    not ${deployment_exists}    Fail    Deployment ${deployment_name} not found
    ${volume_name} =    get_workload_volume_name    ${deployment_name}
    ${node_name} =    get_volume_node    ${volume_name}
-    ${pod_name} =     get_instance_manager_on_node    ${node_name}
-    delete_pod    ${pod_name}    longhorn-system
+    Delete instance-manager pod    ${node_name}

Committable suggestion was skipped due to low confidence.


Wait for Longhorn components all running
wait_for_namespace_pods_running longhorn-system
Comment on lines +83 to +84
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding a configurable timeout parameter

The keyword looks good but consider adding a timeout parameter to handle environments where pod startup might take longer.

Example enhancement:

-Wait for Longhorn components all running
+Wait for Longhorn components all running
+    [Arguments]    ${timeout}=300s
-    wait_for_namespace_pods_running    longhorn-system
+    wait_for_namespace_pods_running    longhorn-system    timeout=${timeout}

Committable suggestion was skipped due to low confidence.

10 changes: 10 additions & 0 deletions e2e/keywords/sharemanager.resource
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,13 @@ Check sharemanager ${condition} using headless service

Wait for all sharemanager to be deleted
wait_for_sharemanagers_deleted

Delete sharemanager of deployment ${deployment_id} and wait for recreation
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
delete_sharemanager_and_wait_for_recreation ${volume_name}

Wait for sharemanager of deployment ${deployment_id} running
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
wait_for_share_manager_running ${volume_name}
16 changes: 15 additions & 1 deletion e2e/keywords/workload.resource
Original file line number Diff line number Diff line change
Expand Up @@ -187,5 +187,19 @@ Check ${workload_kind} ${workload_id} pod is ${expect_state} on another node
Should Not Be Equal ${node_name} ${last_volume_node}

Delete Longhorn ${workload_kind} ${workload_name} pod on node ${node_id}

${node_name} = get_node_by_index ${node_id}
delete_workload_pod_on_node ${workload_name} ${node_name} longhorn-system

IF '${workload_name}' == 'engine-image'
${label_selector} = Set Variable longhorn.io/component=engine-image
ELSE IF '${workload_name}' == 'instance-manager'
${label_selector} = Set Variable longhorn.io/component=instance-manager
ELSE
${label_selector} = Set Variable ${EMPTY}
END
delete_workload_pod_on_node ${workload_name} ${node_name} longhorn-system ${label_selector}

Delete Longhorn ${workload_kind} ${workload_name} pod
${pod_name} = get_workload_pod_name ${workload_name} longhorn-system
Log ${pod_name}
delete_pod ${pod_name} longhorn-system
18 changes: 17 additions & 1 deletion e2e/libs/backing_image/backing_image.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
from backing_image.base import Base
from backing_image.rest import Rest

from backing_image.crd import CRD
from strategy import LonghornOperationStrategy


Expand Down Expand Up @@ -30,3 +30,19 @@ def delete(self, bi_name):

def cleanup_backing_images(self):
return self.backing_image.cleanup_backing_images()

def delete_backing_image_manager(self, name):
self.backing_image = CRD()
return self.backing_image.delete_backing_image_manager(name)

def wait_all_backing_image_managers_running(self):
self.backing_image = CRD()
return self.backing_image.wait_all_backing_image_managers_running()

def wait_backing_image_manager_restart(self, name, last_creation_time):
self.backing_image = CRD()
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

def list_backing_image_manager(self):
self.backing_image = CRD()
return self.backing_image.list_backing_image_manager()
Comment on lines +34 to +48
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

⚠️ Potential issue

The shell script results provide clear evidence about the strategy pattern implementation across the codebase. Let me generate the final response:

Strategy pattern implementation needs revision to match codebase standards

The current implementation in backing_image.py deviates from the established pattern used consistently across other modules. All other modules:

  1. Initialize the strategy object only once in __init__
  2. Handle both REST and CRD cases properly with if-else blocks
  3. Use the initialized strategy object throughout their methods

Here's the correct implementation matching the codebase standards:

def __init__(self):
    if self._strategy == LonghornOperationStrategy.CRD:
        self.backing_image = CRD()
    else:
        self.backing_image = Rest()

Remove the redundant CRD instantiation from:

  • delete_backing_image_manager
  • wait_all_backing_image_managers_running
  • wait_backing_image_manager_restart
  • list_backing_image_manager
🔗 Analysis chain

Strategy pattern implementation needs revision

The current implementation has several architectural concerns:

  1. The new methods bypass the strategy pattern by directly instantiating CRD, which violates the pattern's purpose.
  2. Creating a new CRD instance per method call is inefficient and could lead to resource management issues.
  3. Overwriting self.backing_image could affect other methods expecting the REST implementation.

Consider refactoring to:

  1. Respect the strategy pattern:
 class BackingImage(Base):
     _strategy = LonghornOperationStrategy.REST
 
     def __init__(self):
         if self._strategy == LonghornOperationStrategy.REST:
             self.backing_image = Rest()
+        else:
+            self.backing_image = CRD()
 
     def delete_backing_image_manager(self, name):
-        self.backing_image = CRD()
         return self.backing_image.delete_backing_image_manager(name)
  1. If both REST and CRD implementations are needed, consider using separate attributes:
def __init__(self):
    self.rest_client = Rest()
    self.crd_client = CRD()

The wait_backing_image_manager_restart method doesn't return the result from the delegate call:

 def wait_backing_image_manager_restart(self, name, last_creation_time):
     self.backing_image = CRD()
-    self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
+    return self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

Consider renaming list_backing_image_manager to list_backing_image_managers for consistency with other method names.

Let's verify the strategy pattern usage across the codebase:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if other classes follow the same pattern
rg -l "LonghornOperationStrategy" | xargs rg -A 5 "_strategy\s*="

Length of output: 5660

16 changes: 16 additions & 0 deletions e2e/libs/backing_image/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,3 +30,19 @@ def delete(self, bi_name):
@abstractmethod
def cleanup_backing_images(self):
return NotImplemented

@abstractmethod
def wait_all_backing_image_managers_running(self):
return NotImplemented

@abstractmethod
def list_backing_image_manager(self):
return NotImplemented

@abstractmethod
def delete_backing_image_manager(self, name):
return NotImplemented

@abstractmethod
def wait_backing_image_manager_restart(self, name, last_creation_time):
return NotImplemented
91 changes: 91 additions & 0 deletions e2e/libs/backing_image/crd.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
from kubernetes import client
from datetime import datetime
from backing_image.base import Base

from utility.utility import logging
from utility.utility import get_retry_count_and_interval
import time

class CRD(Base):
def __init__(self):
self.obj_api = client.CustomObjectsApi()
self.retry_count, self.retry_interval = get_retry_count_and_interval()

def create(self, bi_name, source_type, url, expected_checksum):
return NotImplemented

def get(self, bi_name):
return NotImplemented

def all_disk_file_status_are_ready(self, bi_name):
return NotImplemented
def clean_up_backing_image_from_a_random_disk(self, bi_name):
return NotImplemented

def delete(self, bi_name):
return NotImplemented

def wait_for_backing_image_disk_cleanup(self, bi_name, disk_id):
return NotImplemented

def wait_for_backing_image_delete(self, bi_name):
return NotImplemented

def cleanup_backing_images(self):
return NotImplemented

def list_backing_image_manager(self):
label_selector = 'longhorn.io/component=backing-image-manager'
return self.obj_api.list_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
label_selector=label_selector)

def delete_backing_image_manager(self, name):
logging(f"deleting backing image manager {name} ...")
self.obj_api.delete_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)

def wait_all_backing_image_managers_running(self):
for i in range(self.retry_count):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Rename unused loop variable 'i' to '_'

The loop variable i is not used within the loop body. Renaming it to '_' improves code readability by indicating that the variable is intentionally unused.

Apply this diff to rename the unused variable:

-        for i in range(self.retry_count):
+        for _ in range(self.retry_count):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i in range(self.retry_count):
for _ in range(self.retry_count):
🧰 Tools
🪛 Ruff

57-57: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

all_running = True
backing_image_managers = self.list_backing_image_manager()
for backing_image_manager in backing_image_managers["items"]:
current_state = backing_image_manager["status"]["currentState"]
name = backing_image_manager["metadata"]["name"]
logging(f"backing image manager {name} currently in {current_state} state")
if current_state != "running":
all_running = False
Comment on lines +61 to +65
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add error handling for missing status or currentState keys

There is a potential KeyError if status or currentState is missing from backing_image_manager. To make the code more robust, add checks or use the get method with default values.

Apply this diff to safely access dictionary keys:

-                current_state = backing_image_manager["status"]["currentState"]
+                current_state = backing_image_manager.get("status", {}).get("currentState", "unknown")

Optionally, add a check for the 'unknown' state:

if current_state == "unknown":
    logging(f"Unable to determine the current state of backing image manager {name}")
    all_running = False

if all_running is True:
return
time.sleep(self.retry_interval)
assert False, f"Waiting all backing image manager in running state timeout"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace assert False with raising an exception and remove unnecessary f prefix

Using assert False can be problematic because assertions are removed when Python is run with optimizations (-O flag). Instead, raise an explicit AssertionError. Additionally, the f prefix is unnecessary since there are no placeholders in the string.

Apply this diff to raise an exception and fix the string:

-        assert False, f"Waiting all backing image manager in running state timeout"
+        raise AssertionError("Timeout while waiting for all backing image managers to be in running state")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert False, f"Waiting all backing image manager in running state timeout"
raise AssertionError("Timeout while waiting for all backing image managers to be in running state")
🧰 Tools
🪛 Ruff

69-69: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


69-69: f-string without any placeholders

Remove extraneous f prefix

(F541)


def wait_backing_image_manager_restart(self, name, last_creation_time):
for i in range(self.retry_count):
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Rename unused loop variable 'i' to '_'

Similar to the previous loop, the variable i is not used within the loop body. Renaming it to '_' clarifies that it is intentionally unused.

Apply this diff to rename the unused variable:

-        for i in range(self.retry_count):
+        for _ in range(self.retry_count):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for i in range(self.retry_count):
for _ in range(self.retry_count):
🧰 Tools
🪛 Ruff

72-72: Loop control variable i not used within loop body

Rename unused i to _i

(B007)

time.sleep(self.retry_interval)
try:
backing_image_manager = self.obj_api.get_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
except Exception as e:
logging(f"Finding backing image manager {name} failed with error {e}")
continue
Comment on lines +75 to +84
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Handle exceptions more specifically and improve logging

Catching all exceptions with a bare except can obscure unexpected errors. It's better to catch specific exceptions. Additionally, consider logging the stack trace for better debugging.

Apply this diff to catch specific exceptions and log the stack trace:

-            except Exception as e:
-                logging(f"Finding backing image manager {name} failed with error {e}")
+            except client.exceptions.ApiException as e:
+                logging(f"Failed to find backing image manager {name}: {e}")
+            except Exception as e:
+                logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
backing_image_manager = self.obj_api.get_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
except Exception as e:
logging(f"Finding backing image manager {name} failed with error {e}")
continue
backing_image_manager = self.obj_api.get_namespaced_custom_object(
group="longhorn.io",
version="v1beta2",
namespace="longhorn-system",
plural="backingimagemanagers",
name=name
)
except client.exceptions.ApiException as e:
logging(f"Failed to find backing image manager {name}: {e}")
except Exception as e:
logging(f"An unexpected error occurred while finding backing image manager {name}: {e}", exc_info=True)
continue


creation_time = backing_image_manager["metadata"]["creationTimestamp"]
fmt = "%Y-%m-%dT%H:%M:%SZ"
if datetime.strptime(creation_time, fmt) > datetime.strptime(last_creation_time, fmt):
return

assert False, f"Wait backing image manager {name} restart failed ..."
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Replace assert False with raising an exception and remove unnecessary f prefix

Again, replace assert False with an explicit exception. The f prefix is unnecessary since the string does not contain any placeholders.

Apply this diff to correct the assertion:

-        assert False, f"Wait backing image manager {name} restart failed ..."
+        raise AssertionError(f"Waiting for backing image manager '{name}' to restart failed")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
assert False, f"Wait backing image manager {name} restart failed ..."
raise AssertionError(f"Waiting for backing image manager '{name}' to restart failed")
🧰 Tools
🪛 Ruff

91-91: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

12 changes: 12 additions & 0 deletions e2e/libs/backing_image/rest.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,3 +110,15 @@ def cleanup_backing_images(self):
break
time.sleep(self.retry_interval)
assert len(get_longhorn_client().list_backing_image()) == 0

def delete_backing_image_manager(self, name):
return NotImplemented

def wait_all_backing_image_managers_running(self):
return NotImplemented

def wait_backing_image_manager_restart(self, name, last_creation_time):
return NotImplemented

def list_backing_image_manager(self):
return NotImplemented
27 changes: 25 additions & 2 deletions e2e/libs/k8s/k8s.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
import time
import subprocess
import asyncio
import os
from kubernetes import client
from kubernetes.client.rest import ApiException
from workload.pod import create_pod
from workload.pod import delete_pod
from workload.pod import new_pod_manifest
from workload.pod import wait_for_pod_status
from workload.pod import get_pod
Comment on lines +8 to +9
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove unused imports.

The following imports are not used in the code:

  • wait_for_pod_status
  • get_pod

Apply this diff to remove the unused imports:

-from workload.pod import wait_for_pod_status
-from workload.pod import get_pod
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
from workload.pod import wait_for_pod_status
from workload.pod import get_pod
🧰 Tools
🪛 Ruff

8-8: workload.pod.wait_for_pod_status imported but unused

Remove unused import: workload.pod.wait_for_pod_status

(F401)


9-9: workload.pod.get_pod imported but unused

Remove unused import: workload.pod.get_pod

(F401)

from workload.constant import IMAGE_UBUNTU
from utility.utility import subprocess_exec_cmd
from utility.utility import logging
Expand Down Expand Up @@ -95,6 +95,7 @@ def check_instance_manager_pdb_not_exist(instance_manager):
exec_cmd = ["kubectl", "get", "pdb", "-n", "longhorn-system"]
res = subprocess_exec_cmd(exec_cmd)
assert instance_manager not in res.decode('utf-8')

def wait_namespaced_job_complete(job_label, namespace):
retry_count, retry_interval = get_retry_count_and_interval()
api = client.BatchV1Api()
Expand Down Expand Up @@ -170,3 +171,25 @@ def delete_namespace(namespace):
api.delete_namespace(name=namespace)
except ApiException as e:
assert e.status == 404

def wait_for_namespace_pods_running(namespace):
retry_count, retry_interval = get_retry_count_and_interval()

for i in range(retry_count):
time.sleep(retry_interval)
pod_list = list_namespace_pods(namespace)
all_running = True

for pod in pod_list.items:
pod_name = pod.metadata.name
pod_status = pod.status.phase

if pod_status != "Running":
logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
all_running = False

if all_running:
logging(f"All pods in namespace {namespace} are in Running state!")
return

assert False, f"wait all pod in namespace {namespace} running failed"
Comment on lines +175 to +195
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing pod status verification for Longhorn components.

Since this function is part of Longhorn components recovery testing, consider enhancing the verification to include:

  1. Container readiness checks in addition to pod phase
  2. Specific checks for critical Longhorn component pods

Here's a suggested enhancement:

 def wait_for_namespace_pods_running(namespace):    
     retry_count, retry_interval = get_retry_count_and_interval()
 
     for i in range(retry_count):
         pod_list = list_namespace_pods(namespace)        
         all_running = True
 
         for pod in pod_list.items:
             pod_name = pod.metadata.name
             pod_status = pod.status.phase
+            is_longhorn_component = pod.metadata.labels.get("longhorn.io/component")
 
-            if pod_status != "Running":
+            # Check both pod phase and container readiness
+            if pod_status != "Running" or not all(
+                container.ready for container in pod.status.container_statuses
+            ):
                 logging(f"Pod {pod_name} is in {pod_status} state, waiting... (attempt {i + 1}/{retry_count})")
+                if is_longhorn_component:
+                    logging(f"Critical Longhorn component {pod_name} not ready")
                 all_running = False
 
         if all_running:
             logging(f"All pods in namespace {namespace} are in Running state!")
             return
 
         time.sleep(retry_interval)
 
     raise AssertionError(f"Timed out waiting for all pods in namespace {namespace} to reach Running state")

Committable suggestion was skipped due to low confidence.

🧰 Tools
🪛 Ruff

178-178: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


195-195: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)


⚠️ Potential issue

Optimize the wait_for_namespace_pods_running implementation.

The function logic is sound, but there are a few improvements to be made:

  1. Move the sleep to the end of the loop to avoid delaying the first check
  2. Include the retry count in logs for better debugging
  3. Replace assert False with a proper exception

Apply these improvements:

 def wait_for_namespace_pods_running(namespace):    
     retry_count, retry_interval = get_retry_count_and_interval()
 
-    for i in range(retry_count):        
-        time.sleep(retry_interval)
+    for i in range(retry_count):
         pod_list = list_namespace_pods(namespace)        
         all_running = True
 
         for pod in pod_list.items:
             pod_name = pod.metadata.name
             pod_status = pod.status.phase
 
             if pod_status != "Running":
-                logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
+                logging(f"Pod {pod_name} is in {pod_status} state, waiting... (attempt {i + 1}/{retry_count})")
                 all_running = False
 
         if all_running:
             logging(f"All pods in namespace {namespace} are in Running state!")
             return
 
+        time.sleep(retry_interval)
 
-    assert False, f"wait all pod in namespace {namespace} running failed"
+    raise AssertionError(f"Timed out waiting for all pods in namespace {namespace} to reach Running state")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def wait_for_namespace_pods_running(namespace):
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
time.sleep(retry_interval)
pod_list = list_namespace_pods(namespace)
all_running = True
for pod in pod_list.items:
pod_name = pod.metadata.name
pod_status = pod.status.phase
if pod_status != "Running":
logging(f"Pod {pod_name} is in {pod_status} state, waiting...")
all_running = False
if all_running:
logging(f"All pods in namespace {namespace} are in Running state!")
return
assert False, f"wait all pod in namespace {namespace} running failed"
def wait_for_namespace_pods_running(namespace):
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
pod_list = list_namespace_pods(namespace)
all_running = True
for pod in pod_list.items:
pod_name = pod.metadata.name
pod_status = pod.status.phase
if pod_status != "Running":
logging(f"Pod {pod_name} is in {pod_status} state, waiting... (attempt {i + 1}/{retry_count})")
all_running = False
if all_running:
logging(f"All pods in namespace {namespace} are in Running state!")
return
time.sleep(retry_interval)
raise AssertionError(f"Timed out waiting for all pods in namespace {namespace} to reach Running state")
🧰 Tools
🪛 Ruff

178-178: Loop control variable i not used within loop body

Rename unused i to _i

(B007)


195-195: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)

20 changes: 20 additions & 0 deletions e2e/libs/keywords/backing_image_keywords.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,23 @@ def delete_backing_image(self, bi_name):

def cleanup_backing_images(self):
self.backing_image.cleanup_backing_images()

def delete_backing_image_manager(self, name):
self.backing_image.delete_backing_image_manager(name)

def wait_all_backing_image_managers_running(self):
self.backing_image.wait_all_backing_image_managers_running()

def wait_backing_image_manager_restart(self, name, last_creation_time):
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

def list_backing_image_manager(self):
return self.backing_image.list_backing_image_manager()
Comment on lines +24 to +34
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add docstrings and error handling to the new methods.

These test automation keywords would benefit from:

  1. Docstrings explaining the purpose, parameters, and expected behavior
  2. Timeout parameters for the wait methods to prevent hanging tests
  3. Basic input validation and error handling

Here's a suggested improvement for one of the methods as an example:

     def wait_backing_image_manager_restart(self, name, last_creation_time):
+        """Wait for a backing image manager to restart after deletion.
+        
+        Args:
+            name (str): Name of the backing image manager
+            last_creation_time (str): Previous creation timestamp
+            
+        Raises:
+            TimeoutError: If manager doesn't restart within timeout
+            ValueError: If name or timestamp is invalid
+        """
+        if not name or not last_creation_time:
+            raise ValueError("Name and last_creation_time must be provided")
         self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)

Committable suggestion was skipped due to low confidence.


def delete_all_backing_image_managers_and_wait_for_recreation(self):
backing_image_managers = self.backing_image.list_backing_image_manager()
for backing_image in backing_image_managers["items"]:
name = backing_image["metadata"]["name"]
last_creation_time = backing_image["metadata"]["creationTimestamp"]
self.backing_image.delete_backing_image_manager(name)
self.backing_image.wait_backing_image_manager_restart(name, last_creation_time)
4 changes: 4 additions & 0 deletions e2e/libs/keywords/k8s_keywords.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from k8s.k8s import check_node_cordoned
from k8s.k8s import get_instance_manager_on_node
from k8s.k8s import check_instance_manager_pdb_not_exist
from k8s.k8s import wait_for_namespace_pods_running
from utility.utility import logging
from node import Node

Expand Down Expand Up @@ -78,3 +79,6 @@ def get_instance_manager_on_node(self, node_name):

def check_instance_manager_pdb_not_exist(self, instance_manager):
return check_instance_manager_pdb_not_exist(instance_manager)

def wait_for_namespace_pods_running(self, namespace):
return wait_for_namespace_pods_running(namespace)
12 changes: 12 additions & 0 deletions e2e/libs/keywords/sharemanager_keywords.py
Original file line number Diff line number Diff line change
Expand Up @@ -47,3 +47,15 @@ def wait_for_sharemanagers_deleted(self, name=[]):
time.sleep(retry_interval)

assert AssertionError, f"Failed to wait for all sharemanagers to be deleted"

def delete_sharemanager(self, name):
return self.sharemanager.delete(name)

def delete_sharemanager_and_wait_for_recreation(self, name):
sharemanager = self.sharemanager.get(name)
last_creation_time = sharemanager["metadata"]["creationTimestamp"]
self.sharemanager.delete(name)
self.sharemanager.wait_for_restart(name, last_creation_time)
Comment on lines +54 to +58
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Enhance robustness and documentation of recreation verification.

The method needs documentation and additional error handling to improve reliability:

 def delete_sharemanager_and_wait_for_recreation(self, name):
+    """Delete a share manager and verify it gets recreated.
+    
+    Args:
+        name (str): Name of the share manager
+    
+    Raises:
+        ValueError: If name is empty or share manager not found
+        TimeoutError: If recreation takes too long
+    """
+    if not name:
+        raise ValueError("Share manager name cannot be empty")
+
     sharemanager = self.sharemanager.get(name)
+    if not sharemanager:
+        raise ValueError(f"Share manager {name} not found")
+
     last_creation_time = sharemanager["metadata"]["creationTimestamp"]
     self.sharemanager.delete(name)
-    self.sharemanager.wait_for_restart(name, last_creation_time)
+    try:
+        self.sharemanager.wait_for_restart(name, last_creation_time)
+    except Exception as e:
+        logging(f"Failed waiting for share manager {name} recreation: {str(e)}")
+        raise

Committable suggestion was skipped due to low confidence.


def wait_for_share_manager_running(self, name):
return self.sharemanager.wait_for_running(name)
12 changes: 6 additions & 6 deletions e2e/libs/keywords/workload_keywords.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,9 +46,9 @@ def create_pod(self, pod_name, claim_name):
logging(f'Creating pod {pod_name} using pvc {claim_name}')
create_pod(new_busybox_manifest(pod_name, claim_name))

def delete_pod(self, pod_name):
def delete_pod(self, pod_name, namespace='default'):
logging(f'Deleting pod {pod_name}')
delete_pod(pod_name)
delete_pod(pod_name, namespace)

def cleanup_pods(self):
cleanup_pods()
Expand All @@ -61,15 +61,15 @@ def check_pod_data_checksum(self, expected_checksum, pod_name, file_name):
logging(f'Checking checksum for file {file_name} in pod {pod_name}')
check_pod_data_checksum(expected_checksum, pod_name, file_name)

def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default"):
pods = get_workload_pods(workload_name, namespace=namespace)
def delete_workload_pod_on_node(self, workload_name, node_name, namespace="default", label_selector=""):
pods = get_workload_pods(workload_name, namespace=namespace, label_selector=label_selector)
for pod in pods:
if pod.spec.node_name == node_name:
logging(f'Deleting pod {pod.metadata.name} on node {node_name}')
delete_pod(pod.metadata.name, namespace=namespace)

def get_workload_pod_name(self, workload_name):
return get_workload_pod_names(workload_name)[0]
def get_workload_pod_name(self, workload_name, namespace="default"):
return get_workload_pod_names(workload_name, namespace)[0]

def get_workload_persistent_volume_claim_name(self, workload_name):
return get_workload_persistent_volume_claim_name(workload_name)
Expand Down
16 changes: 16 additions & 0 deletions e2e/libs/sharemanager/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,3 +5,19 @@ class Base(ABC):
@abstractmethod
def list(self):
return NotImplemented

@abstractmethod
def get(self, name):
return NotImplemented

@abstractmethod
def delete(self, name):
return NotImplemented

@abstractmethod
def wait_for_running(self, name):
return NotImplemented
Comment on lines +17 to +19
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider adding timeout parameter to wait_for_running.

For robustness in test scenarios, consider adding an optional timeout parameter to control how long the method should wait before giving up.

     @abstractmethod
-    def wait_for_running(self, name):
+    def wait_for_running(self, name: str, timeout: int = 300) -> bool:
+        """Wait for a share manager to reach running state.
+
+        Args:
+            name: Name of the share manager
+            timeout: Maximum time to wait in seconds (default: 300)
+
+        Returns:
+            bool: True if running state is reached, False if timeout occurs
+        """
         return NotImplemented
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
@abstractmethod
def wait_for_running(self, name):
return NotImplemented
@abstractmethod
def wait_for_running(self, name: str, timeout: int = 300) -> bool:
"""Wait for a share manager to reach running state.
Args:
name: Name of the share manager
timeout: Maximum time to wait in seconds (default: 300)
Returns:
bool: True if running state is reached, False if timeout occurs
"""
return NotImplemented


@abstractmethod
def wait_for_restart(self, name, last_creation_time):
return NotImplemented
Loading