Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce hardened Flatcar Images for CAPZ #1659

Closed
50 tasks done
Tracked by #426 ...
Rotfuks opened this issue Nov 21, 2022 · 23 comments
Closed
50 tasks done
Tracked by #426 ...

Introduce hardened Flatcar Images for CAPZ #1659

Rotfuks opened this issue Nov 21, 2022 · 23 comments
Assignees
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service

Comments

@Rotfuks
Copy link
Contributor

Rotfuks commented Nov 21, 2022

Motivation

Currently we use Ubuntu images for our cluster nodes. But those are not specially hardened and thus not really secure. We have a more secure alternative with the hardened flatcar images. We therefor need to replace the ubuntu images with flatcar ones.

Todo

Open Upstream Issues

Outcome

  • We have systems that are a lot more secure through hardened base images for the nodes.

Technical Hint

@Rotfuks Rotfuks added area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service team/clippy labels Nov 21, 2022
@Rotfuks
Copy link
Contributor Author

Rotfuks commented Nov 28, 2022

@primeroz
Copy link

primeroz commented Dec 2, 2022

something to keep an eye on kubernetes-sigs/cluster-api-provider-azure#2890 is adding a template for using flatcar on capz

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Jan 24, 2023

Flatcar now officially in the docs: https://capz.sigs.k8s.io/topics/flatcar.html

@teemow teemow changed the title Introduce hardened Flatcar Images Introduce hardened Flatcar Images for CAPZ Feb 9, 2023
@primeroz primeroz self-assigned this Feb 20, 2023
@primeroz
Copy link

primeroz commented Feb 20, 2023

Flatcar now officially in the docs: https://capz.sigs.k8s.io/topics/flatcar.html

I am getting a 404 here now 🤷 :)

for reference the file still exists here https://github.com/kinvolk/cluster-api-provider-azure/blob/8dad8f074688f1790b08a185ed0a33a6bcf3fd4b/docs/book/src/topics/flatcar.md

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Feb 20, 2023

Ah yeah sorry, that was because of the recent change to point the documentation no longer to the newest release branch, but the main branch of the capz book.
So it will be there again once the new release is done or once we introduce the multi-version documentation in capz upstream :)

@primeroz
Copy link

First test at least nodes joined 👍

NAME                                                                             READY  SEVERITY  REASON  SINCE  MESSAGE                              
Cluster/fctest1                                                                  True                     6m25s                                                                                                    
├─ClusterInfrastructure - AzureCluster/fctest1                                   True                     8m53s                                                           
├─ControlPlane - KubeadmControlPlane/fctest1                                     True                     6m25s                                                                                    
│ └─Machine/fctest1-zzrhf                                                        True                     6m26s                                                                                    
│   ├─BootstrapConfig - KubeadmConfig/fctest1-jnd99                              True                     8m49s                                                                                    
│   └─MachineInfrastructure - AzureMachine/fctest1-control-plane-c17c01d5-zzxbh  True                     6m26s                                                                                                    
└─Workers                                                                                                                                                                                          
  ├─MachineDeployment/fctest1-bastion                                            True                     10m                                                                                      
  │ └─Machine/fctest1-bastion-868b7dcb67-tz7rc                                   True                     4s                                                                                                       
  │   ├─BootstrapConfig - KubeadmConfig/fctest1-bastion-973fd873-fkttq           True                     6m22s                                                                                                    
  │   └─MachineInfrastructure - AzureMachine/fctest1-bastion-836b66f0-hf7kl      True                     4s                                                                                                       
  └─MachineDeployment/fctest1-md00                                               True                     19s                                                                                                      
    ├─Machine/fctest1-md00-77c9d6f645-68l7m                                      True                     114s                                                                                                     
    │ ├─BootstrapConfig - KubeadmConfig/fctest1-md00-ad5e9669-qlxdd              True                     6m22s                                                                                                    
    │ └─MachineInfrastructure - AzureMachine/fctest1-md00-bcb876fb-l69vx         True                     114s                                                                                                     
    ├─Machine/fctest1-md00-77c9d6f645-8pn6k                                      True                     4m12s                                                                                                    
    │ ├─BootstrapConfig - KubeadmConfig/fctest1-md00-ad5e9669-v6m92              True                     6m21s                                                                                                    
    │ └─MachineInfrastructure - AzureMachine/fctest1-md00-bcb876fb-r2fc9         True                     4m12s                                                                                                    
    └─Machine/fctest1-md00-77c9d6f645-dcdlx                                      True                     3m25s                                                                                                    
      ├─BootstrapConfig - KubeadmConfig/fctest1-md00-ad5e9669-4jbmz              True                     6m21s                                                                                                    
      └─MachineInfrastructure - AzureMachine/fctest1-md00-bcb876fb-xlm6q         True                     3m25s                                                                                                    
                                                                                                                    
NAME                                   STATUS   ROLES           AGE     VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                             KERNEL-VERSION      CONTAINER-RUNTIME
fctest1-control-plane-c17c01d5-zzxbh   Ready    control-plane   7m19s   v1.24.9   10.0.0.4      <none>        Ubuntu 20.04.5 LTS                                   5.15.0-1029-azure   containerd://1.6.2
fctest1-md00-bcb876fb-l69vx            Ready    <none>          2m46s   v1.24.9   10.0.16.4     <none>        Flatcar Container Linux by Kinvolk 3374.2.1 (Oklo)   5.15.77-flatcar     containerd://1.6.14
fctest1-md00-bcb876fb-r2fc9            Ready    <none>          4m59s   v1.24.9   10.0.16.6     <none>        Flatcar Container Linux by Kinvolk 3374.2.1 (Oklo)   5.15.77-flatcar     containerd://1.6.14
fctest1-md00-bcb876fb-xlm6q            Ready    <none>          4m29s   v1.24.9   10.0.16.5     <none>        Flatcar Container Linux by Kinvolk 3374.2.1 (Oklo)   5.15.77-flatcar     containerd://1.6.14

@primeroz
Copy link

primeroz commented Feb 20, 2023

Machine Review

  • the Hostname placeholder for the joinConfiguration could be similar to openstack, but it works with the template
  • WARNING : files: createResultFile: Ignition has already run on this system. Unexpected behavior may occur. Ignition is not designed to run more than once per system. cleanup of image-builder not complete ?
  • this flatcar version is supposed to have containerd 1.6.8 according to flatcar release notes ... but it seems we are running 1.6.14 - image-builder upgrading this ?
  • SeLinux running in permissive mode.
  • etcd-tuning[986]: Setting etcd network tuning parameters for interface: fd149e91-82e0-4a7d-afa6-2a4166cbd7c0 - /opt/bin/etcd-network-tuning.sh
    • it seems good ( it sets priority for etcd ports using TC but is actyally failing Error setting etcd network tuning parameters for interface: 2dd1ce17-079e-403c-b352-a1921ee207ee )
  • NTP Not using chronyc like ubuntu image with hardware clocks, instead is using timesyncd.conf with default settings

@primeroz
Copy link

primeroz commented Feb 21, 2023

License for flatcar is Apache License 2.0 which , to my understanding, let us free to modify and redistribute the images as long as

  • we don't change the license
  • we keep reference of attributoin to Flatcar project
  • we might need to add notice of our changes to the image

https://github.com/flatcar/flatcar-docs/blob/main/LICENSE

@primeroz
Copy link

primeroz commented Feb 21, 2023

Cilium looks ok but 4 tests are failing from the connectivty test suite

TLDR: I think the issue is with the test itself

But i tried with latest version of cilium-cli and i am still getting the error for some tests - do we also need cilium 1.13 ?


📋 Test Report
❌ 4/29 tests failed (6/230 actions), 2 tests skipped, 1 scenarios skipped:
Test [to-entities-world]:
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-755fb678bd-4r6pg (192.168.2.121) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-5b97d7bc66-nxl76 (192.168.2.210) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7]:
  ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-0: cilium-test/client2-5b97d7bc66-nxl76 (192.168.2.210) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7-named-port]:
  ❌ client-egress-l7-named-port/pod-to-world/http-to-one-one-one-one-0: cilium-test/client2-5b97d7bc66-nxl76 (192.168.2.210) -> one-one-one-one-http (one.one.one.one:80)
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-755fb678bd-4r6pg (192.168.2.121) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-5b97d7bc66-nxl76 (192.168.2.210) -> one-one-one-one-http (one.one.one.one:80)
connectivity test failed: 4 tests failed
[=] Test [to-entities-world]                                                                                                                                                                                       
.                                                                                                                                                                                                                  
  ℹ️  📜 Applying CiliumNetworkPolicy 'client-egress-to-entities-world' to namespace 'cilium-test'..                                                                                                                
  [-] Scenario [to-entities-world/pod-to-world]                                                                                                                                                                    
  [.] Action [to-entities-world/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-755fb678bd-4r6pg (192.168.2.121) -> one-one-one-one-http (one.one.one.one:80)]                                          
  ❌ command "curl -w %{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code} --silent --fail --show-error --connect-timeout 5 --output /dev/null http://one.one.one.one:80" failed: command te
rminated with exit code 28
  ℹ️  curl output:
  curl: (28) Resolving timed out after 5000 milliseconds
:0 -> :0 = 000

DNS issues ?

/ # ping one.one.one.one
PING one.one.one.one (1.0.0.1) 56(84) bytes of data.
/ # dig @192.168.1.99 -p 1053 one.one.one.one +short
1.1.1.1
1.0.0.1
/ # dig @192.168.1.181 -p 1053 one.one.one.one +short
1.1.1.1
1.0.0.1
/ # dig @192.168.0.228 -p 1053 one.one.one.one +short
1.1.1.1
1.0.0.1

/ # dig @172.31.0.10 -p 53 one.one.one.one +short
1.0.0.1
1.1.1.1

Running the command myself from the pod works just fine

/ # while true                                                                                            
> do                                                                                                      
> curl -w "%{local_ip}:%{local_port} -> %{remote_ip}:%{remote_port} = %{response_code}" --silent --fail --show-error --connect-timeout 5 --output /dev/null http://one.one.one.one:80; echo " - $?"
> done                                     
192.168.2.121:40672 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:55990 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40680 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40688 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:55998 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40692 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56000 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40694 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40700 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40702 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40716 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56004 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:56018 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40726 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40736 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40740 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56034 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40754 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40762 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40764 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56048 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40776 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56052 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:56058 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:56068 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:56076 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40792 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40802 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40808 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56080 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40820 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40808 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56080 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40820 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56094 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40826 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56108 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40838 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40846 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56122 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40852 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:40860 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56138 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:56154 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:40862 -> 1.1.1.1:80 = 301 - 0
192.168.2.121:56158 -> 1.0.0.1:80 = 301 - 0
192.168.2.121:56170 -> 1.0.0.1:80 = 301 - 0


Same failures on a ubuntu capz cluster so this is not related to the flatcar change

📋 Test Report
❌ 4/29 tests failed (6/230 actions), 2 tests skipped, 1 scenarios skipped:
Test [to-entities-world]:
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-755fb678bd-wpkfj (192.168.2.51) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-entities-world/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-5b97d7bc66-xq6x9 (192.168.2.65) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7]:
  ❌ client-egress-l7/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-5b97d7bc66-xq6x9 (192.168.2.65) -> one-one-one-one-http (one.one.one.one:80)
Test [client-egress-l7-named-port]:
  ❌ client-egress-l7-named-port/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-5b97d7bc66-xq6x9 (192.168.2.65) -> one-one-one-one-http (one.one.one.one:80)
Test [to-fqdns]:
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-0: cilium-test/client-755fb678bd-wpkfj (192.168.2.51) -> one-one-one-one-http (one.one.one.one:80)
  ❌ to-fqdns/pod-to-world/http-to-one-one-one-one-1: cilium-test/client2-5b97d7bc66-xq6x9 (192.168.2.65) -> one-one-one-one-http (one.one.one.one:80)
connectivity test failed: 4 tests failed

I will create a follow up issue to investigate this

@primeroz
Copy link

primeroz commented Feb 21, 2023

Control Plane nodes review

  • etcd disk did not get mounted on /var/lib/etcddisk
fctest1-control-plane-e95df458-bgcbw / # df -h /var/lib/etcddisk/
Filesystem      Size  Used Avail Use% Mounted on
/dev/sda9        47G  4.0G   40G   9% /
fctest1-control-plane-e95df458-bgcbw / # ls^C
fctest1-control-plane-e95df458-bgcbw / # blkid | grep sdc
/dev/sdc: LABEL="etcd_disk" UUID="eb5653fc-d429-4b05-9b05-780c9005b725" BLOCK_SIZE="4096" TYPE="ext4"
fctest1-control-plane-e95df458-bgcbw / # lsblk | grep sdc
sdc       8:32   0   10G  0 disk  

@primeroz
Copy link

To build images from the flatcar offer in azure i had to accept the following license

License: Flatcar Container Linux is a 100% open source product and licensed under the applicable licenses of its constituent components, as described here: https://kinvolk.io/legal/open-source/
Warranty: Kinvolk provides this software "as is", without warranty or support of any kind. Support subscriptions are available separately from Kinvolk - please contact us for information at https://www.kinvolk.io/contact-us

by running

  • az vm image accept-terms --publisher kinvolk --offer flatcar-container-linux-free --plan stable-gen2
  • az vm image accept-terms --publisher kinvolk --offer flatcar-container-linux-free --plan stable

@primeroz
Copy link

primeroz commented Feb 27, 2023

TODO

Gallery ToDos

  • define sku, offer and such for our Image Definition
  • Gallery regions, Best practice recommends replicating at least to 2 regions since the SIG is not a global service
    • Do i need to create 2 different Galleries even when replicating to other regions ?
    • Can i use images from a different region or do we need replication in all regions where we plan to use this images ? nope - we need to replicate to all regions
      • Message=\"\\\"The gallery image /CommunityGalleries/gsCAPITest1-5cb24dcf-a2d0-4aba-820f-b52ca78f96e6/Images/capi-flatcar-stable-1.24.10-gen2/Versions/latest is not available in G ermanyWestCentral region.
  • Only latest is available through the Community Gallery
  • Document how to create the gallery
  • Set Subscription owner to Admin Group
  • When using a VM Marketplace as our parent image we need to accept-terms
    • az vm image accept-terms --publisher kinvolk --offer flatcar-container-linux-free --plan stable-gen2
    • I switched to use a , still official, community gallery as a source and this should remove this requirement. We should test it though ...
  • Review best practices - https://learn.microsoft.com/en-us/azure/virtual-machines/azure-compute-gallery#best-practices
  • Gallery Limits

Images and Pipelines ToDos

@primeroz
Copy link

primeroz commented Feb 27, 2023

After lots of trial and error i think i got the right spec to use our images

      image:
        computeGallery:
          gallery: gsCAPITest1-5cb24dcf-a2d0-4aba-820f-b52ca78f96e6
          name: capi-flatcar-stable-1.24.9-gen2
          plan:
            offer: flatcar-container-linux-free
            publisher: kinvolk
            sku: stable-gen2
          version: latest

BUT , since azure keeps a link between our build images and the parent flatcar one we are getting this error

capz-controller-manager-68c6664879-lmzfc manager I0227 15:45:36.440634       1 recorder.go:103] events "msg"="Warning"  "message"="failed to reconcile AzureMachine: failed to reconcile AzureMachine service virtualmachine: failed to create resource fctest1/fctest1-control-plane-cdd30d8e-lq5wk (service: virtualmachine): compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=400 -- Original Error: Code=\"ResourcePurchaseValidationFailed\" Message=\"User failed validation to purchase resources. Error message: 'You have not accepted the legal terms on this subscription: '6b1f6e4a-6d0e-4aa4-9a5a-fbaca65a23b3' for this plan. Before the subscription can be used, you need to accept the legal terms of the image. To read and accept legal terms, use the Azure CLI commands described at https://go.microsoft.com/fwlink/?linkid=2110637 or the PowerShell commands available at https://go.microsoft.com/fwlink/?linkid=862451. Alternatively, deploying via the Azure portal provides a UI experience for reading and accepting the legal terms. Offer details: publisher='kinvolk' offer = 'flatcar-container-linux-free', sku = 'stable-gen2', Correlation Id: '0b436d96-21c6-4e41-9ed9-daac49507cde'.'\"" "object"={"kind":"AzureMachine","namespace":"org-multi-project","name":"fctest1-control-plane-cdd30d8e-lq5wk","uid":"ae16afdd-7c1f-430d-89d1-37540c38f074","apiVersion":"infrastructure.cluster.x-k8s.io/v1beta1","resourceVersion":"6315646"} "reason"="ReconcileError"

I will accept the terms in ghost subscription but this means we will need every customer to also do that in every subscription where we want to use those images.

I can't explain how we are using the flatcar4capi images without having accepted the same terms ... ?

From Upstream

Hello. Images in flatcar4capi are build from Flatcar VHDs imported into a SIG, so their advantage is that they don't require plan information. That's the big part of it.

sample script used by upstream to build image - https://gist.github.com/primeroz/702e6bec5fcee2986adbefeb633bffb4

@primeroz
Copy link

primeroz commented Mar 1, 2023

The fact that only latest is available from an image-definition is not true apparently

➜ kubectl get azuremachinetemplate fctest1-control-plane-9e46fb4a -o yaml | yq .spec.template.spec.image
computeGallery:
  gallery: gsCAPITest1-5cb24dcf-a2d0-4aba-820f-b52ca78f96e6
  name: capi-flatcar-stable-1.24.10-gen2
  version: 3374.2.3

➜ kubectl get azuremachinetemplate fctest1-md00-4e69b84e-2 -o yaml | yq .spec.template.spec.image       
computeGallery:
  gallery: gsCAPITest1-5cb24dcf-a2d0-4aba-820f-b52ca78f96e6
  name: capi-flatcar-stable-1.24.10-gen2
  version: latest
➜ kubectl --kubeconfig /dev/shm/fctest1.kubeconfig get node -o wide      
NAME                                   STATUS   ROLES           AGE     VERSION    INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                             KERNEL-VERSION    CONTAINER-RUNTIME
fctest1-control-plane-9e46fb4a-8zrtk   Ready    control-plane   14m     v1.24.10   10.0.0.4      <none>        Flatcar Container Linux by Kinvolk 3374.2.3 (Oklo)   5.15.86-flatcar   containerd://1.6.15
fctest1-md00-4e69b84e-2-68tt6          Ready    <none>          2m26s   v1.24.10   10.0.16.6     <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.15
fctest1-md00-4e69b84e-2-j9g97          Ready    <none>          6m3s    v1.24.10   10.0.16.7     <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.15
fctest1-md00-4e69b84e-zcg6l            Ready    <none>          10m     v1.24.10   10.0.16.5     <none>        Flatcar Container Linux by Kinvolk 3374.2.3 (Oklo)   5.15.86-flatcar   containerd://1.6.15

@Rotfuks
Copy link
Contributor Author

Rotfuks commented Mar 1, 2023

We can use the following information for the legal statement in the Azure Image Gallery:

Community gallery prefix: giantswarm-
Publisher support email: [email protected]
Publisher URL: giantswarm.io
Legal agreement URL: https://www.giantswarm.io/privacy-policy

@primeroz
Copy link

primeroz commented Mar 2, 2023

since last upgrade i noticed something strange

build-capz-image-1.24.11-6xb7532313faaf96cac2bcaa780286a09f-pod step-build-image ==> azure-arm.sig-{{user `build_name`}}: + [[ flatcar-gen2 != \f\l\a\t\c\a\r* ]]                                                                      
build-capz-image-1.24.11-6xb7532313faaf96cac2bcaa780286a09f-pod step-build-image ==> azure-arm.sig-{{user `build_name`}}: + sudo bash -c '/usr/share/oem/python/bin/python /usr/share/oem/bin/waagent -force -deprovision+user && sync'        

the name is azure-arm.sig-{{user \build_name`}}- why is build name not rendering ? is the actualbuild_name` working in the rest of the ansible run ?

. /home/imagebuilder/packer/azure/scripts/init-sig.sh flatcar-gen2 && packer build -var-file="/home/imagebuilder/packer/config/kubernetes.json"  -var-file="/home/imagebuilder/packer/config/cni.json"  -var-file="/home/imagebuilder/packer/config/containerd.json"  -var-file="/home/imagebuilder/packer/config/wasm-shims.json"  -var-file="/home/imagebuilder/packer/config/ansible-args.json"  -var-file="/home/imagebuilder/packer/config/goss-args.json"  -var-file="/home/imagebuilder/packer/config/common.json"  -var-file="/home/imagebuilder/packer/config/additional_components.json"  -color=true -var-file="/home/imagebuilder/packer/azure/azure-config.json" -var-file="/home/imagebuilder/packer/azure/azure-sig-gen2.json" -var-file="/home/imagebuilder/packer/azure/flatcar-gen2.json" -only="sig-flatcar-gen2" -var-file="/workspace/vars/vars.json"  packer/azure/packer.json

Executing Ansible: ansible-playbook -e packer_build_name="sig-flatcar-gen2"


UPDATE:

Everything is ok , the printing of the name was added in 1.8.6 and is buggy, already fixed in 1.8.7 hashicorp/packer#12281

I checked the whole provisioning and is working as expected , all the flatcar bits are properly run

@primeroz
Copy link

primeroz commented Mar 2, 2023

Review of Hardening and other tuning

protect-kernel-defaults

Outcome: Enable

ARP Settings

net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2

there is no history or reference i could find on why we are setting those values, i will try to reach to phoenix

Outcome: TBD

local ports reserved

# Reserved to avoid conflicts with kube-apiserver, which allocates within this range
net.ipv4.ip_local_reserved_ports=30000-32767

Not sure what this conflict is and can't find an history for it, i will try to reach to phoenix

Outcome: TBD

maxmap

# Increased mmapfs because some applications, like ES, need higher limit to store data properly
vm.max_map_count = 262144

Self Explanatory

Outcome: Add to worker node pools

ipv6

net.ipv6.conf.all.accept_redirects = 0
net.ipv6.conf.default.accept_redirects = 0

since we do not disable ipv6 ( capi sets net.ipv6.conf.all.disable_ipv6 to 0 ) then we should set those

Outcome: add unless we want to disable ipv6 ?

ipv4

net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.log_martians = 1
net.ipv4.tcp_timestamps = 0

they are all reasonable

Outcome: add

inotify

fs.inotify.max_user_watches = 16384
# Default is 128, doubling for nodes with many pods
# See https://github.com/giantswarm/giantswarm/issues/7711
fs.inotify.max_user_instances = 8192

reasonable

Outcome: add

kernel settings

kernel.kptr_restrict = 2
kernel.sysrq = 0

They both seem reasoable to me

Outcome: add

@primeroz
Copy link

primeroz commented Mar 6, 2023

comparing containerd config.toml

  • oom_score = -999 - default is 0 , we don't set it on flatcar capz ( but i thought i saw it in the ansible code ? )
    • this is in the systemd service unit OOMScoreAdjust=-999
  • subreaper = true we don't set it and i can't see it in the docs
  • [plugins."containerd.runtime.v1.linux"] - we don't have it set in the capz config
  • registry mirror and credentials - we don't have it but can add as a snippet in /etc/containerd/conf.d/*.toml import

@primeroz
Copy link

primeroz commented Mar 6, 2023

Reservations

in vintage we do

on master nodes

kubeReserved:
  cpu: 350m
  memory: 1280Mi
  ephemeral-storage: 1024Mi
kubeReservedCgroup: /kubereserved.slice
protectKernelDefaults: true
systemReserved:
  cpu: 250m
  memory: 384Mi
systemReservedCgroup: /system.slice

on worker nodes

kubeReserved:
  cpu: 250m
  memory: 768Mi
  ephemeral-storage: 1024Mi
kubeReservedCgroup: /kubereserved.slice
protectKernelDefaults: true
systemReserved:
  cpu: 250m
  memory: 384Mi
systemReservedCgroup: /system.slice

on CAPZ we

  • calculate kubeReserved based on instance size ( and , especially in terms of cpu, we reserve much less )
  • we do not set dedicated slices
  • we do not set any system reserved
  • we do not set the protectKernelDefaults

I wlil

  • Enable protectKernelDefaults
  • create a follow up for much bigger reservation , systemReserved and dedicatedslices

@primeroz
Copy link

primeroz commented Mar 6, 2023

Upgrading from ubuntu 0.13 to flatcar currently fails with

 reason: 'Upgrade "fctest2" failed: cannot patch "fctest2" with kind KubeadmControlPlane:
    admission webhook "validation.kubeadmcontrolplane.controlplane.cluster.x-k8s.io"
    denied the request: KubeadmControlPlane.controlplane.cluster.x-k8s.io "fctest2"
    is invalid: [spec.kubeadmConfigSpec.format: Forbidden: cannot be modified, spec.kubeadmConfigSpec.mounts:
    Forbidden: cannot be modified]'
  • Should we HASH the KubeadmControlPlane name as well ?
    • otherwise we won't be able to change specs for the CP anyway like mounts and such

@primeroz
Copy link

primeroz commented Mar 7, 2023

Chaing the control-plane name and object does not seem to work

during rollout it gets stuck with

org-multi-project  ├─KubeadmControlPlane/fctest2                                False  Deleting                                             45m  
org-multi-project  │ ├─Machine/fctest2-95sxz                                    True                                                        41m  
org-multi-project  │ │ ├─AzureMachine/fctest2-control-plane-c17c01d5-gd6m4      True                                                        41m  
org-multi-project  │ │ └─KubeadmConfig/fctest2-7ps9h                            True                                                        41m  
org-multi-project  │ │   └─Secret/fctest2-7ps9h                                 -                                                           40m  
org-multi-project  │ ├─Machine/fctest2-d8xxd                                    True                                                        44m  
org-multi-project  │ │ ├─AzureMachine/fctest2-control-plane-c17c01d5-qxwn5      True                                                        44m  
org-multi-project  │ │ └─KubeadmConfig/fctest2-zz6vq                            True                                                        44m  
org-multi-project  │ │   └─Secret/fctest2-zz6vq                                 -                                                           44m  
org-multi-project  │ ├─Machine/fctest2-hlh7h                                    True                                                        38m  
org-multi-project  │ │ ├─AzureMachine/fctest2-control-plane-c17c01d5-brfwp      True                                                        38m  
org-multi-project  │ │ └─KubeadmConfig/fctest2-8dzwh                            True                                                        38m  
org-multi-project  │ │   └─Secret/fctest2-8dzwh                                 -                                                           38m  
org-multi-project  │ └─Secret/fctest2-kubeconfig                                -                                                           44m  
org-multi-project  ├─KubeadmControlPlane/fctest2-changed                        False  ScalingUp                                            8m10s
org-multi-project  │ ├─Secret/fctest2-ca                                        -                                                           44m  
org-multi-project  │ ├─Secret/fctest2-etcd                                      -                                                           44m  
org-multi-project  │ ├─Secret/fctest2-proxy                                     -                                                           44m  
org-multi-project  │ └─Secret/fctest2-sa                                        -                                                           44m  
Cluster/fctest2                                                                  False  Warning   ScalingUp  8m25s  Scaling up control plane to 3 replicas (actual 0)                         
├─ClusterInfrastructure - AzureCluster/fctest2                                   True                        44m                                                               
├─ControlPlane - KubeadmControlPlane/fctest2-changed                             False  Warning   ScalingUp  8m25s  Scaling up control plane to 3 replicas (actual 0)                      
│ ├─Machine/fctest2-95sxz                                                        True                        39m                                          
│ │ ├─BootstrapConfig - KubeadmConfig/fctest2-7ps9h                              True                        41m                         
│ │ └─MachineInfrastructure - AzureMachine/fctest2-control-plane-c17c01d5-gd6m4  True                        39m                                                             
│ ├─Machine/fctest2-d8xxd                                                        True                        42m                         


│ │ ├─BootstrapConfig - KubeadmConfig/fctest2-zz6vq                              True                        44m                         
│ │ └─MachineInfrastructure - AzureMachine/fctest2-control-plane-c17c01d5-qxwn5  True                        42m                                                                                   
│ └─Machine/fctest2-hlh7h                                                        True                        37m                                                                                                   
│   ├─BootstrapConfig - KubeadmConfig/fctest2-8dzwh                              True                        39m                                                         
│   └─MachineInfrastructure - AzureMachine/fctest2-control-plane-c17c01d5-brfwp  True                        37m       

I will reach out upstrema to see what they think since most fields can be modified and i can't see why those 2 cannot ( https://github.com/kubernetes-sigs/cluster-api/blob/main/controlplane/kubeadm/api/v1beta1/kubeadm_control_plane_webhook.go#L137 ) but right now we cna't update the CP from ubuntu to flatcar

@primeroz
Copy link

primeroz commented Mar 8, 2023

glippy is now converted to flatcar

➜ k get node -o wide
NAME                                  STATUS   ROLES           AGE     VERSION    INTERNAL-IP    EXTERNAL-IP   OS-IMAGE                                             KERNEL-VERSION    CONTAINER-RUNTIME
glippy-control-plane-aae7f116-jqtcd   Ready    control-plane   23m     v1.24.11   10.223.0.132   <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-control-plane-aae7f116-vpk8s   Ready    control-plane   30m     v1.24.11   10.223.0.137   <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-control-plane-aae7f116-wclks   Ready    control-plane   16m     v1.24.11   10.223.0.133   <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-md00-e6ebd75a-9br9p            Ready    <none>          21m     v1.24.11   10.223.0.4     <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-md00-e6ebd75a-fvjtj            Ready    <none>          31m     v1.24.11   10.223.0.10    <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-md00-e6ebd75a-lt6zc            Ready    <none>          15m     v1.24.11   10.223.0.7     <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-md00-e6ebd75a-q28jz            Ready    <none>          4m37s   v1.24.11   10.223.0.8     <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-md00-e6ebd75a-vbrzq            Ready    <none>          25m     v1.24.11   10.223.0.9     <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18
glippy-md00-e6ebd75a-xnnq7            Ready    <none>          9m22s   v1.24.11   10.223.0.6     <none>        Flatcar Container Linux by Kinvolk 3374.2.4 (Oklo)   5.15.89-flatcar   containerd://1.6.18

@primeroz
Copy link

primeroz commented Mar 8, 2023

this is now done

@primeroz primeroz closed this as completed Mar 8, 2023
This was referenced Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service
Projects
None yet
Development

No branches or pull requests

2 participants