Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid multiple downloads for same image #2732

Open
nirs opened this issue Oct 14, 2024 · 5 comments
Open

Avoid multiple downloads for same image #2732

nirs opened this issue Oct 14, 2024 · 5 comments
Labels
area/performance enhancement New feature or request

Comments

@nirs
Copy link
Contributor

nirs commented Oct 14, 2024

Description

Currently when starting multiple clusters using the same image before the image is cached, we download the same image in parallel. The first download stores the file in the cache and the other downloaded files are dropped.

Possible flow

  1. Create the cache directory
  2. Take the cache directory lock
  3. If we can use the image, release the lock and use the image.
    we can use it in these cases:
    • cached digest matches wanted digest
    • cached time matches server last-modified time
    • no wanted digest or cached time
  4. Write the metadata files
  5. Download to temporary file
  6. Rename temporary file to data file
  7. Release the cache directory lock

Possible issues

  • stale locks - should not be possible since flock (and the windows locks) are released when the process terminates
  • bug when you forget to release the lock after download - should not be possible with lockutil.WithDirLock.
  • download takes too much time, maybe hang on inaccessible network - should not happen becuase of socket timeouts. Terminating limactl will abort the download or waiting on the cache directory lock.

@jandubois you was concerned about issues, anything to add?

@AkihiroSuda AkihiroSuda added enhancement New feature or request area/performance labels Oct 15, 2024
@jay7x
Copy link

jay7x commented Oct 21, 2024

Somehow related: #1354
I did a quick look to the sources, but my Go foo is not good enough :(

@jandubois
Copy link
Member

@jandubois you was concerned about issues, anything to add?

No, just that we should not timeout waiting for the lock; we need to wait indefinitely and rely on the process holding the lock to eventually release the lock.

Otherwise you will have to start checking for this in scripts running limactl create, and then retry at the script level, and I really would like to avoid that.

@afbjorklund

This comment was marked as resolved.

@jay7x
Copy link

jay7x commented Oct 21, 2024

IIRC the behavior above (attempting to download the snapshot) happens because another process have the release file already created, but checksum is wrong (because it's still downloading). That's why 2nd lima process tries to go snapshot instead. Not 100% sure though.. I roughly remember something like that.

@nirs
Copy link
Contributor Author

nirs commented Oct 21, 2024

another process have the release file already created, but checksum is wrong (because it's still downloading).

This should not be possible with current code. The data file is created only when the download is completed.

But the checksum file is created only after the data file, so there is a tiny window when the checksum file is missing.

nirs added a commit to nirs/lima that referenced this issue Nov 14, 2024
To solve the races during concurrent downloads and avoid unneeded work
and bandwidth, we allow one concurrent download of the same image.

When a limactl process try to access the cache, it takes a lock on the
file cache directory. If multiple processes try to get the lock in the
same time, only one will take the lock, and the other will block.

The process that took the lock tries to get the file from the cache.
This is the fast path and the common case. This can fail if the file is
not in the cache, the digest does not match, or the cached last modified
time does not match the last modified returned from the server.

If the process cannot get the file from the cache, it downloads the file
from the remote server, and update the cached data and metadata files.

Finally the process release the lock on the cache directory. Other
limactl processes waiting on the lock wake up and take the lock. In the
common case they will find the image in the cache and will release the
lock quickly.

Since we have exactly one concurrent download, updating the metadata
files is trivial and we don't need the writeFirst() helper.

Fixes: lima-vm#2902
Fixes: lima-vm#2732
Signed-off-by: Nir Soffer <[email protected]>
nirs added a commit to nirs/lima that referenced this issue Nov 14, 2024
To solve the races during concurrent downloads and avoid unneeded work
and bandwidth, we allow one concurrent download of the same image.

When a limactl process try to access the cache, it takes a lock on the
file cache directory. If multiple processes try to get the lock in the
same time, only one will take the lock, and the other will block.

The process that took the lock tries to get the file from the cache.
This is the fast path and the common case. This can fail if the file is
not in the cache, the digest does not match, or the cached last modified
time does not match the last modified returned from the server.

If the process cannot get the file from the cache, it downloads the file
from the remote server, and update the cached data and metadata files.

Finally the process release the lock on the cache directory. Other
limactl processes waiting on the lock wake up and take the lock. In the
common case they will find the image in the cache and will release the
lock quickly.

Since we have exactly one concurrent download, updating the metadata
files is trivial and we don't need the writeFirst() helper.

Fixes: lima-vm#2902
Fixes: lima-vm#2732
Signed-off-by: Nir Soffer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/performance enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants