-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ci: fix test failures caused by unstable harvester cluster #2144
Conversation
Signed-off-by: Yang Chiu <[email protected]>
WalkthroughThe changes in this pull request enhance the error handling and resilience of the VM management process in the Harvester class. Specifically, retry logic has been added to the Changes
Assessment against linked issues
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (2)
e2e/libs/host/harvester.py (2)
74-80
: Ensure robust handling of VM status checksWhen checking the VM's status, consider verifying that the response JSON contains the expected keys to prevent potential
KeyError
exceptions if the response structure changes.Apply this diff to safely access the JSON data:
resp = requests.get(url, cookies=self.cookies, verify=False) - if "Stopped" in resp.json()['metadata']['fields']: + resp_json = resp.json() + metadata = resp_json.get('metadata', {}) + fields = metadata.get('fields', []) + if "Stopped" in fields: stopped = True break
102-108
: Ensure robust handling of VM status checksWhen checking the VM's status, consider verifying that the response JSON contains the expected keys to prevent potential
KeyError
exceptions if the response structure changes.Apply this diff to safely access the JSON data:
resp = requests.get(url, cookies=self.cookies, verify=False) - if "Running" in resp.json()['metadata']['fields']: + resp_json = resp.json() + metadata = resp_json.get('metadata', {}) + fields = metadata.get('fields', []) + if "Running" in fields: started = True break
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- e2e/libs/host/harvester.py (1 hunks)
- pipelines/utilities/terraform_setup.sh (1 hunks)
🧰 Additional context used
until [ "$(KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes -o jsonpath='{.items[*].status.conditions}' | jq '.[] | select(.type == "Ready").status' | grep -ci true)" -eq 4 ]; do | ||
echo "waiting for harvester cluster nodes to be running" | ||
sleep 2 | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ensure the script correctly handles any errors from kubectl commands
If the kubectl
command fails (e.g., due to configuration issues or cluster unavailability), the script may produce misleading output or errors.
Add error handling to catch failures of the kubectl
command:
until [ "$(KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes --no-headers 2>/dev/null | grep -cw 'Ready')" -eq $EXPECTED_NODE_COUNT ]; do
echo "waiting for harvester cluster nodes to be running"
+ if ! KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes >/dev/null 2>&1; then
+ echo "kubectl command failed. Please check your Kubernetes configuration."
+ exit 1
+ fi
sleep $WAIT_INTERVAL
TIME_WAITED=$((TIME_WAITED + WAIT_INTERVAL))
if [ $TIME_WAITED -ge $MAX_WAIT_TIME ]; then
echo "Timeout reached while waiting for nodes to be ready"
exit 1
fi
done
This addition:
- Redirects
stderr
to/dev/null
in the condition check to prevent error messages from breaking the loop logic. - Checks the success of the
kubectl get nodes
command separately. - Exits with an error message if
kubectl
fails.
Committable suggestion was skipped due to low confidence.
🛠️ Refactor suggestion
Parameterize the expected number of nodes and add a timeout to prevent indefinite waiting
The until
loop is hardcoded to wait for 4 nodes to be ready. If the cluster size changes, this script will need to be updated manually each time. Additionally, without a timeout, the script could hang indefinitely if the nodes fail to become ready.
Consider introducing variables for the expected node count and a maximum wait time. This makes the script more flexible and prevents indefinite execution.
Apply the following changes:
+EXPECTED_NODE_COUNT=4
+MAX_WAIT_TIME=300 # in seconds
+WAIT_INTERVAL=2 # in seconds
+TIME_WAITED=0
until [ "$(KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes -o jsonpath='{.items[*].status.conditions}' | jq '.[] | select(.type == "Ready").status' | grep -ci true)" -eq $EXPECTED_NODE_COUNT ]; do
echo "waiting for harvester cluster nodes to be running"
sleep $WAIT_INTERVAL
+ TIME_WAITED=$((TIME_WAITED + WAIT_INTERVAL))
+ if [ $TIME_WAITED -ge $MAX_WAIT_TIME ]; then
+ echo "Timeout reached while waiting for nodes to be ready"
+ exit 1
+ fi
done
This modification:
- Introduces
EXPECTED_NODE_COUNT
to specify how many nodes should be ready. - Adds a timeout mechanism with
MAX_WAIT_TIME
to prevent infinite loops. - Uses
TIME_WAITED
to track the total time spent waiting. - Exits the script with an error if the timeout is reached.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
until [ "$(KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes -o jsonpath='{.items[*].status.conditions}' | jq '.[] | select(.type == "Ready").status' | grep -ci true)" -eq 4 ]; do | |
echo "waiting for harvester cluster nodes to be running" | |
sleep 2 | |
done | |
EXPECTED_NODE_COUNT=4 | |
MAX_WAIT_TIME=300 # in seconds | |
WAIT_INTERVAL=2 # in seconds | |
TIME_WAITED=0 | |
until [ "$(KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes -o jsonpath='{.items[*].status.conditions}' | jq '.[] | select(.type == "Ready").status' | grep -ci true)" -eq $EXPECTED_NODE_COUNT ]; do | |
echo "waiting for harvester cluster nodes to be running" | |
sleep $WAIT_INTERVAL | |
TIME_WAITED=$((TIME_WAITED + WAIT_INTERVAL)) | |
if [ $TIME_WAITED -ge $MAX_WAIT_TIME ]; then | |
echo "Timeout reached while waiting for nodes to be ready" | |
exit 1 | |
fi | |
done |
🛠️ Refactor suggestion
Simplify the node readiness check for better readability
The current command to check node readiness is complex and could be simplified for clarity.
You can achieve the same result with a more straightforward command:
-until [ "$(KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes -o jsonpath='{.items[*].status.conditions}' | jq '.[] | select(.type == "Ready").status' | grep -ci true)" -eq $EXPECTED_NODE_COUNT ]; do
+until [ "$(KUBECONFIG=${PWD}/test_framework/kube_config.yaml kubectl get nodes --no-headers | grep -cw 'Ready')" -eq $EXPECTED_NODE_COUNT ]; do
echo "waiting for harvester cluster nodes to be running"
sleep $WAIT_INTERVAL
TIME_WAITED=$((TIME_WAITED + WAIT_INTERVAL))
if [ $TIME_WAITED -ge $MAX_WAIT_TIME ]; then
echo "Timeout reached while waiting for nodes to be ready"
exit 1
fi
done
This change:
- Uses
kubectl get nodes --no-headers
for a cleaner output. - Utilizes
grep -cw 'Ready'
to count lines where the word "Ready" appears exactly once, ensuring accurate counting. - Improves readability and maintainability of the script.
Committable suggestion was skipped due to low confidence.
for i in range(self.retry_count): | ||
logging(f"Trying to stop vm {vm_id} ... ({i})") | ||
try: | ||
resp = requests.post(f"{url}?action=stop", cookies=self.cookies, verify=False) | ||
logging(f"resp = {resp}") | ||
assert resp.status_code == 204, f"Failed to stop vm {vm_id} response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}" | ||
break | ||
except Exception as e: | ||
logging(f"Stopping vm failed with error {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Refactor duplicate retry logic into a helper method
The retry logic in the power_off_node
and power_on_node
methods is very similar. Consider refactoring this code into a separate helper method to reduce duplication and improve maintainability.
Here's a suggested approach:
def _retry_vm_action(self, action, vm_id):
url = f"{self.url}/{vm_id}"
for i in range(self.retry_count):
logging(f"Trying to {action} vm {vm_id} ... ({i})")
try:
resp = requests.post(f"{url}?action={action}", cookies=self.cookies, verify=False)
logging(f"resp = {resp}")
if resp.status_code != 204:
logging(f"Failed to {action} vm {vm_id}. Response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}")
continue
break
except Exception as e:
logging(f"{action.capitalize()}ing vm failed with error {e}")
def power_off_node(self, node_name):
vm_id = self.mapping[node_name]
self._retry_vm_action('stop', vm_id)
# Existing code for checking VM stopped status
def power_on_node(self, node_name):
vm_id = self.mapping[node_name]
self._retry_vm_action('start', vm_id)
# Existing code for checking VM started status
Also applies to: 88-96
Replace assert
statements with explicit error handling
Using assert
statements for control flow is not recommended in production code because assertions can be disabled with optimization flags (-O
, -OO
). Instead, consider checking the response status code explicitly and handling errors accordingly.
Apply this diff to replace the assertion:
resp = requests.post(f"{url}?action=stop", cookies=self.cookies, verify=False)
logging(f"resp = {resp}")
- assert resp.status_code == 204, f"Failed to stop vm {vm_id} response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}"
+ if resp.status_code != 204:
+ logging(f"Failed to stop vm {vm_id}. Response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}")
+ continue
break
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for i in range(self.retry_count): | |
logging(f"Trying to stop vm {vm_id} ... ({i})") | |
try: | |
resp = requests.post(f"{url}?action=stop", cookies=self.cookies, verify=False) | |
logging(f"resp = {resp}") | |
assert resp.status_code == 204, f"Failed to stop vm {vm_id} response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}" | |
break | |
except Exception as e: | |
logging(f"Stopping vm failed with error {e}") | |
for i in range(self.retry_count): | |
logging(f"Trying to stop vm {vm_id} ... ({i})") | |
try: | |
resp = requests.post(f"{url}?action=stop", cookies=self.cookies, verify=False) | |
logging(f"resp = {resp}") | |
if resp.status_code != 204: | |
logging(f"Failed to stop vm {vm_id}. Response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}") | |
continue | |
break | |
except Exception as e: | |
logging(f"Stopping vm failed with error {e}") |
for i in range(self.retry_count): | ||
logging(f"Trying to start vm {vm_id} ... ({i})") | ||
try: | ||
resp = requests.post(f"{url}?action=start", cookies=self.cookies, verify=False) | ||
logging(f"resp = {resp}") | ||
assert resp.status_code == 204, f"Failed to start vm {vm_id} response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}" | ||
break | ||
except Exception as e: | ||
logging(f"Starting vm failed with error {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace assert
statements with explicit error handling
Using assert
statements for control flow is not recommended in production code because assertions can be disabled with optimization flags (-O
, -OO
). Instead, consider checking the response status code explicitly and handling errors accordingly.
Apply this diff to replace the assertion:
resp = requests.post(f"{url}?action=start", cookies=self.cookies, verify=False)
logging(f"resp = {resp}")
- assert resp.status_code == 204, f"Failed to start vm {vm_id} response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}"
+ if resp.status_code != 204:
+ logging(f"Failed to start vm {vm_id}. Response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}")
+ continue
break
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for i in range(self.retry_count): | |
logging(f"Trying to start vm {vm_id} ... ({i})") | |
try: | |
resp = requests.post(f"{url}?action=start", cookies=self.cookies, verify=False) | |
logging(f"resp = {resp}") | |
assert resp.status_code == 204, f"Failed to start vm {vm_id} response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}" | |
break | |
except Exception as e: | |
logging(f"Starting vm failed with error {e}") | |
for i in range(self.retry_count): | |
logging(f"Trying to start vm {vm_id} ... ({i})") | |
try: | |
resp = requests.post(f"{url}?action=start", cookies=self.cookies, verify=False) | |
logging(f"resp = {resp}") | |
if resp.status_code != 204: | |
logging(f"Failed to start vm {vm_id}. Response: {resp.status_code} {resp.reason}, request: {resp.request.url} {resp.request.headers}") | |
continue | |
break | |
except Exception as e: | |
logging(f"Starting vm failed with error {e}") |
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9670
What this PR does / why we need it:
fix test failures caused by unstable harvester cluster
Special notes for your reviewer:
Additional documentation or context
Summary by CodeRabbit
New Features
terraform_setup.sh
script to wait for Kubernetes nodes to reach a "Ready" state before proceeding, ensuring smoother deployments.Bug Fixes