Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

image awx-ee:latest broken for use with awx-operator #258

Open
sgreinerCNS opened this issue Sep 25, 2024 · 5 comments · May be fixed by ansible/awx-operator#1985
Open

image awx-ee:latest broken for use with awx-operator #258

sgreinerCNS opened this issue Sep 25, 2024 · 5 comments · May be fixed by ansible/awx-operator#1985

Comments

@sgreinerCNS
Copy link

The awx-web and awx-task kubernetes pods stop working with Init:CrashLoopBackOff

the reason was the init container's image quay.io/ansible/awx-ee:latest

ln: failed to create symbolic link '/etc/pki/ca-trust/extracted/pem/directory-hash/ca-certificates.crt': Permission denied

I manually edited the deployments to use quay.io/ansible/awx-ee:24.6.1 instead and the pods come up again.
Unfortunately the awx-operator wants to change it back to the broken latest tag.

@sgreinerCNS sgreinerCNS changed the title image broken for use with awx-operator awx-ee:latest image awx-ee:latest broken for use with awx-operator Sep 25, 2024
@Jed-Giblin
Copy link

Encountering the same issue. This can be reproduced by draining the node they are running on, on first boot on the new node this will happen. Recreating the pod on the new node will restore functionality.

k8s info:

Client Version: v1.28.10
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.28.10

AWX Resource Details

Labels:       app.kubernetes.io/component=awx
              app.kubernetes.io/managed-by=awx-operator
              app.kubernetes.io/operator-version=2.12.2
              app.kubernetes.io/part-of=awx-prod
Annotations:  <none>
API Version:  awx.ansible.com/v1beta1
Kind:         AWX
Metadata:
  Creation Timestamp:  2024-04-08T18:06:54Z
  Generation:          2
  Resource Version:    79405586

Some extra configuration that might be relevant:

  web_extra_env:    - name: LDAPTLS_CACERT
  value: /etc/pki/ca-trust/source/anchors/bundle-ca.crt

Above file inside the container is the CA for a local LDAP domain

Status:
  Admin Password Secret:       <redact>
  Admin User:                  <redact>
  Broadcast Websocket Secret:  <redact>
  Conditions:
    Last Transition Time:         2024-10-14T12:51:29Z
    Reason:
    Status:                       False
    Type:                         Failure
    Last Transition Time:         2024-10-14T12:50:18Z
    Reason:                       Successful
    Status:                       True
    Type:                         Running
    Last Transition Time:         2024-10-14T13:16:05Z
    Reason:                       Successful
    Status:                       True
    Type:                         Successful
  Image:                          quay.io/ansible/awx:23.9.0
  Postgres Configuration Secret:  <redact>
  Secret Key Secret:              <redact>
  Version:                        23.9.0

@sgreinerCNS
Copy link
Author

Our situation was similar, also involving a LDAPS CA and a CA Bundle (required because TLS Deep Inspection by Security Appliances).

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: cns-awx
  namespace: awx
spec:
  image_pull_policy: Always
  control_plane_ee_image: quay.io/ansible/awx-ee:23.3.0
  init_container_image: quay.io/ansible/awx-ee
  init_container_image_version: 24.6.1
  ingress_type: Ingress
  hostname: <redact>
  ingress_annotations: ""
  ingress_tls_secret: <redact>
  admin_user: <redact>
  admin_email: <redact>
  admin_password_secret: <redact>
  web_resource_requirements:
    requests:
      cpu: 200m
      memory: 500Mi
  task_resource_requirements:
    requests:
      cpu: 200m
      memory: 500Mi
  ldap_cacert_secret: <redact>
  bundle_cacert_secret: <redact>
  secret_key_secret: <redact>
  projects_persistence: true
  projects_existing_claim: cns-awx-storage-projects-claim
  postgres_storage_requirements:
    requests:
      storage: 4Gi
  postgres_storage_class: postgres

The ldap_cacert_secret gets the "file" ldap-ca.crt and bundle_cacert_secret get the "file" bundle-ca.crt via a secret

By setting init_container_image and pinning init_container_image_version to 24.6.1 I was able to avoid the buggy awx-ee:latest which cannot set ca-certificates.crt for some reason

@ppmathis
Copy link

ppmathis commented Oct 26, 2024

After digging around for a while, as I've been facing the same problem in a custom EE built from Rocky Linux 9, I found out that the issue is related to changes in the ca-certificates system package. The version 24.6.1 that still works has 2023.2.60_v7.0.306 installed, whereas the current latest is running 2024.2.69_v8.0.303.

After going through the RPM changelog, I noticed that not only have CA certificates been updated, but the update-ca-trust script itself has been greatly changed, as can be seen in the commit history: https://gitlab.com/redhat/centos-stream/rpms/ca-certificates/-/commits/c9s/update-ca-trust

The old script, which is also part of 24.6.1, is very trivial can still be found here.

The new script on the other hand, which has been introduced here and its latest version can be found here is much more complex and does more things than the old script.

One key change is that in addition to simply calling /usr/bin/trust extract a couple times, it is now also trying to execute /usr/bin/ln for creating symlinks, specifically those in the directory-hash directory, which causes the issue here due to a lack of permissions. As the script itself explains, p11-kit will make the directory-hash directory unwritable, and due running as non-root, we do not have the benefits of CAP_DAC_OVERRIDE.

I was able to verify that the current EE runs if the deployment of awx-task and awx-web would call update-ca-trust extract --output /etc/pki/ca-trust/extracted in the init-bundle-ca-trust init container. This will internally fill USER_DEST in the script, which then triggers the extra code branch to run /usr/bin/chmod u+w which fixes up the permissions of the directory-hash directory.

Unfortunately I currently lack the time to submit this as a PR to awx-operator, as I'm unsure about the potential impact when considering other EEs with different script versions, but it might be an easy fix. As a workaround, which has been good enough for me, I'm now copying the old script into my AWX EE:

additional_build_files:
  - src: files/update-ca-trust
    dest: files

additional_build_steps:
  append_base:
    # Copy legacy update-ca-trust script for compatibility with AWX Operator
    - COPY --chmod=755 _build/files/update-ca-trust /usr/bin/update-ca-trust

This might also be of interest to @JoelKle who introduced this init container as part of PR #1846 in the awx-operator project. The initial idea was to run this as root, but due to OpenShift compatibility a non-privileged approach was taken, which worked fine - until the update-ca-trust script changed and broke this previously working solution.

@zendritic
Copy link

zendritic commented Oct 29, 2024

Unfortunately I currently lack the time to submit this as a PR to awx-operator, as I'm unsure about the potential impact when considering other EEs with different script versions, but it might be an easy fix. As a workaround, which has been good enough for me, I'm now copying the old script into my AWX EE:

additional_build_files:
  - src: files/update-ca-trust
    dest: files

additional_build_steps:
  append_base:
    # Copy legacy update-ca-trust script for compatibility with AWX Operator
    - COPY --chmod=755 _build/files/update-ca-trust /usr/bin/update-ca-trust

If ansible-builder is not an option for you, you can also copy update-ca-trust from 24.6.1 into a custom init container built from latest in its dockerfile (or containerfile for you podman folks).

JoelKle added a commit to JoelKle/awx-operator that referenced this issue Oct 30, 2024
@JoelKle
Copy link

JoelKle commented Oct 30, 2024

Thank you @ppmathis for your great analysis on that problem.
I've opened a PR with the solution you proposed: ansible/awx-operator#1985

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants