Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zstd:chunked issues #509

Open
ckyrouac opened this issue May 3, 2024 · 27 comments
Open

zstd:chunked issues #509

ckyrouac opened this issue May 3, 2024 · 27 comments
Labels
area/osintegration Relates to an external OS/distro base image

Comments

@ckyrouac
Copy link
Contributor

ckyrouac commented May 3, 2024

This took awhile to track down. I'm going to continue investigating but I wanted to document what I've found so far.

The failure happens when attempting a bootc install to-disk using an image built from a base image with at least one extra layer, e.g.

FROM quay.io/centos-bootc/centos-bootc-dev:stream9
RUN dnf install -y tmux

If the image is built locally bootc install to-disk works correctly. The failure happens when pushing the image to a repo (only tested with quay.io), clearing out the image from local storage via podman system prune --all, then running bootc install to-disk. Here's example output of the failure:

[test@fedora-39 ~]$ sudo podman run --pid=host --network=host --privileged --security-opt label=type:unconfined_t -v /var/lib/containers:/var/lib/containers -v .:/output -v /dev:/dev -e RUST_LOG=debug quay.io/ckyrouac/bootc-lldb bootc install to-disk --via-loopback --generic-image --skip-fetch-check /output/test.raw
Trying to pull quay.io/ckyrouac/bootc-lldb:latest...
Getting image source signatures
...
ERROR Installing to disk: Creating ostree deployment: Performing deployment: Importing: Parsing layer blob sha256:5d35bfe747b2c76a01310e69a14daa90baf352a11f990f73d4ce3e1917668719: Failed to invoke skopeo proxy method FinishPipe: remote error: corrupted blob, expecting sha256:dede69b8181937a545b87707fbe4ace2ee9936459ffd43b7ba957324861992a0

So, the OpenImage call to the skopeo proxy is failing.

The latest version of containers-common found in Fedora39/40 repos sets pull_options.enable_partial_images=true in /usr/share/containers/storage.conf. This is the change that started causing this error. Toggling enable_partial_images to false resolves the error. I'm not familiar enough with this stack to know the root cause of this yet. I'll continue digging but I'm sure someone else would be able to track this down a lot quicker if you think it's urgent.

@cgwalters
Copy link
Collaborator

The latest version of containers-common found in Fedora39/40 repos sets pull_options.enable_partial_images=true in /usr/share/containers/storage.conf. This is the change that started causing this error. Toggling enable_partial_images to false resolves the error.

Ugh. Fun...thanks for finding and debugging this.

@cgwalters
Copy link
Collaborator

It's actually really embarrassing that this wasn't caught by our CI, needs fixing

@cgwalters
Copy link
Collaborator

Actually wait this is the -dev image which is intentionally tracking git main, I don't think this has hit f40 or stream9 proper yet. I see a lot of activity in https://src.fedoraproject.org/rpms/containers-common/commits/rawhide and what I bet is happening here is those spec files are being pulled into the copr.

cc @rhatdan @lsm5

@cgwalters
Copy link
Collaborator

And yes, we need to add bootc test gating to containers-common and skopeo pretty soon.

@ckyrouac
Copy link
Contributor Author

ckyrouac commented May 3, 2024

hmm interesting, earlier in the week this was happening regardless of which base image I used. Just went to verify that and now this bug only happens with the -dev base image.

@lsm5
Copy link
Member

lsm5 commented May 3, 2024

Actually wait this is the -dev image which is intentionally tracking git main, I don't think this has hit f40 or stream9 proper yet. I see a lot of activity in https://src.fedoraproject.org/rpms/containers-common/commits/rawhide and what I bet is happening here is those spec files are being pulled into the copr.

cc @rhatdan @lsm5

The last build of containers-common on the podman-next copr was an automatic rebuild of the rawhide sources from sometime back. I disabled this automatic rebuild after we got rawhide to a sane-enough state.

Let me know if you need an update to the fedora or copr rpm. I can do a one-off build.

We're currently working on a packit workflow from upstream c/common to downstream containers-common rpm, like we have for podman and the rest, with automatic builds going to podman-next right after every upstream commit to main. I'm hoping that change will land early next week.

@cgwalters cgwalters added the area/osintegration Relates to an external OS/distro base image label May 4, 2024
@ckyrouac
Copy link
Contributor Author

ckyrouac commented May 6, 2024

so this works now using any base image. I'm not sure what changed. I guess something in the base images or in quay.io?

@cgwalters
Copy link
Collaborator

@vrothberg

@henrywang
Copy link
Contributor

Hi @cgwalters. We encountered this issue in QE CI environment many times in two days.
All bootc install to-existing-root tests failed due to this error. https://artifacts.osci.redhat.com/testing-farm/10a21cd4-b029-48fa-9c23-9848288a7065/
This issue can be found in fedora-bootc:40 image testing. But can't be found in CS9 and RHEL 9.4/9.5 bootc image testing.

ERROR: Installing to filesystem: Creating ostree deployment: Performing deployment: Importing: Parsing layer blob sha256:f34e1c1a6f0f3ac0450db999825e519b67ac7c36697ad80ecfa3672ff285dbbc: Failed to invoke skopeo proxy method FinishPipe: remote error: expected 69427364 bytes in blob, got 72333312

@cgwalters
Copy link
Collaborator

cgwalters commented Jun 9, 2024

Ugh man, I think this is fallout from https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb which rolls in containers/storage@23ff5f8 which somehow breaks things...

EDIT: No, I was wrong, enable_partial_pulls = true also in containers-common-0.58.0-2.fc40.


And yes, the fact that there is no gating CI in any of

  • containers/storage upstream
  • dist-git merge requests
  • bodhi

That covers the ostree-container path let this all sail right through.

@henrywang
Copy link
Contributor

henrywang commented Jun 9, 2024

OH!!!. Since yesterday (Saturday), I can't run container inside quay.io/fedora/fedora:40. And reports Error: configure storage: 'overlay' is not supported over overlayfs, a mount_program is required: backing file system is unsupported for this graph driver error. I have to use fedora:39 image.

@cgwalters
Copy link
Collaborator

cgwalters commented Jun 9, 2024

Hmm, at this very moment quay.io/fedora/fedora-bootc:40 with version=40.20240606.0 has containers-common-0.58.0-2.fc40.noarch which predates that change (as the date stamp implies).

So...hum, this must somehow relate to the host environment version. Ah yes, if we look at the logs from that test run I can see that inside the fedora cloud AMI we have 'Installed: containers-common-5:0.59.1-1.fc40.noarch'`.

@henrywang can you try patching the tests to do something like this as a sanity test:

$ sed -ie 's/enable_partial_images = "true"/enable_partial_images = "false"/' /usr/share/containers/storage.conf

EDIT: See above, I'm no longer confident the relevant change here was in containers-common.

@cgwalters
Copy link
Collaborator

I'm trying to reproduce this locally initially by hacking up my podman-machine environment, but no luck yet.

Another thing that actually changed pretty recently too is there's a new podman: https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb
And we're also now getting that in the host environment. Can you play with downgrading that in the host environment too?

@henrywang
Copy link
Contributor

And we're also now getting that in the host environment. Can you play with downgrading that in the host environment too?

Sure. I'll run that tomorrow.

@henrywang
Copy link
Contributor

henrywang commented Jun 10, 2024

@cgwalters I re-run the test with an old Fedora 40 runner, tests passed. I checked the log. I found the difference is the latest containers-common-5:0.59.1-1.fc40 added composefs. That might be the root cause.

@cgwalters
Copy link
Collaborator

I've added comments to https://bodhi.fedoraproject.org/updates/FEDORA-2024-ab42dd0ffb and I think that's the root cause is that image builds started defaulting to being zstd:chunked. I still need to dig in and see if that's what's causing the "remote error: expected 69427364 bytes in blob, got 72333312" but I'd bet so.

@hanulec
Copy link

hanulec commented Jun 14, 2024

i think this is two fold issue -- but the end user impact is only see if you have a btrfs containers-storage. in my testing with a digitalocean f39 system which uses btrfs.

  1. the default image from quay.io/fedora/fedora-bootc:41 doesn't see this problem performing a 'bootc install' (but the base image is missing cloud-init).

  2. a system that is using btrfs appears impacted / not able to bootc install when using personally build bootc images. This issue doesn't occur on a f40/rawhide system using xfs for containers-storage (podman graphDriverName: overlay)

furthermore, the simple act of pulling a personally built bootc images on a f39 (or f40 or rawhide) to a system that uses btrfs as the containers-storage will cause the machine to wedge/freeze up when you have small resources 1c/1g. adding swap to the system prevented the freezing, but didn't produce a more reliable / predictable 'podman pull' behavior. if you run the podman pull in a loop 10 times it keeps on attempting to re-sync data.

making the suggested change to /usr/share/containers/storage.conf of enable_partial_images = "false" allowed for both a predictable 'podman pull' and a bootc install to-existing-root to succeed when graphDriverName: btrfs

once bootc is running the underlying containers-storage reverts to overlay.

@cgwalters
Copy link
Collaborator

@hanulec Is your input image in zstd:chunked format? Try podman inspect and look at the layers (you'll see zstd instead of gzip).

@hanulec
Copy link

hanulec commented Jun 17, 2024

@hanulec Is your input image in zstd:chunked format? Try podman inspect and look at the layers (you'll see zstd instead of gzip).

the image i built had the newest items from my containerfile be added in zstd format. i needed to use skopeo inspect to see this. the image was built on a default config from a fresh rawhide image (version: 41.20240530.0)

root@bootc:/240617-bootc# skopeo inspect docker://quay.io/fedora/fedora-bootc:41 |grep MIMEType|sort |uniq -c
65 "MIMEType": "application/vnd.oci.image.layer.v1.tar+gzip",
root@bootc:
/240617-bootc# skopeo inspect docker:///f40jump05:240615-0058 | grep MIMEType|sort |uniq -c
65 "MIMEType": "application/vnd.oci.image.layer.v1.tar+gzip",
4 "MIMEType": "application/vnd.oci.image.layer.v1.tar+zstd",
root@bootc:~/240617-bootc#

@hanulec
Copy link

hanulec commented Jun 17, 2024

and the more i look / re-test -- its the podman push action that is changing the MIMEType from "application/vnd.oci.image.layer.v1.tar" to either "+gzip" or "+zstd"

@cgwalters cgwalters changed the title Install to-disk fails with latest containers-common zstd:chunked issues Jun 19, 2024
@shi2wei3
Copy link

shi2wei3 commented Sep 10, 2024

@cgwalters centos-bootc c10s bootc install test start to failed from the past week, error output is the same like this issue, only derived image is affected, could be related to the host containers-common get updated 0.57.3-4.el10 -> 0.60.2-3.el10, containers/common#2048, how could I workaround this issue and is this a bug we need to fix?

@cgwalters
Copy link
Collaborator

I only recently realized on this issue why this may be happening. When I was testing ostreedev/ostree-rs-ext#622 I did it via a registry.

But this bug is about "bootc install" where we're pulling from containers-storage: (unpacked representation) and as part of that we ask it to regenerate a tarball from the unpacked files, and by design today that tarball must be bit-for-bit compatible with the descriptor. It would not surprise me at all if there were corner cases where that breaks today. Inherently this "copy from c/storage" model is going through a different codepath than what is used by podman for skopeo today where it drives the copying.

The whole "synthesize a temporary tarball" is really lame of course, what we want instead is containers/storage#1849

@shi2wei3
Copy link

I probably hit containers/podman#22813 and I've modified the runtime container.conf as a workaround.

@cgwalters
Copy link
Collaborator

Based on some recent discussion it sounds like this one should gain some more priority. I think a think we need to do here is add an integration test that covers a zstd:chunked image (and for good measure we should probably also include LBIs built with zstd:chunked).

@shi2wei3
Copy link

I'll add LBIs testing with zstd:chunked soon, meanwhile, we need to fix the error described in this issue, because the test will fail when I revert gzip to the default zstd:chunked on c10s https://artifacts.osci.redhat.com/testing-farm/8d024c92-874a-4228-bf35-be69080b6fde/

@cgwalters
Copy link
Collaborator

Right for

Installing to filesystem: Creating ostree deployment: Pulling: Importing: Unencapsulating base: Failed to invoke skopeo proxy method FinishPipe: remote error: expected 45 bytes in blob, got 139264

There's a few very recent fixes for zstd:chunked issues in c/storage and c/image, one that I think might be related is containers/storage#2130

cgwalters added a commit to cgwalters/common that referenced this issue Oct 22, 2024
xref commit 1ad44cb
"rpm/update-config-files: zstd:chunked not enabled in Fedora yet"

Basically it doesn't make sense to keep this enabled in RHEL10 but
not in Fedora, that *seriously* undermines the testing story.

My immediate practical issue is that zstd:chunked in RHEL10
as of right now still breaks the bootc path, xref:
containers/bootc#509 (comment)

Signed-off-by: Colin Walters <[email protected]>
@cgwalters
Copy link
Collaborator

containers/common#2213 will back out the zstd:chunked default for rhel10, and it was just backed out in all fedora versions.

That said, we still want to do background work to test with it, both:

  • With LBIs
  • Verifying that the bootc host at least doesn't barf on them (but it won't have optimized fetches)

cgwalters pushed a commit to cgwalters/bootc that referenced this issue Nov 5, 2024
…1.0.182

build(deps): bump serde from 1.0.179 to 1.0.182
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/osintegration Relates to an external OS/distro base image
Projects
None yet
Development

No branches or pull requests

6 participants