Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix races during parallel downloads #2903

Open
wants to merge 5 commits into
base: master
Choose a base branch
from
Open

Conversation

nirs
Copy link
Contributor

@nirs nirs commented Nov 13, 2024

Fix races during parallel downloads by holding a per file lock while checking or updating the cache.

Fixes #2902, #2732

@nirs nirs changed the title Update cached last modified time during re-download Fix races during parallel downloads Nov 14, 2024
@nirs nirs marked this pull request as ready for review November 14, 2024 20:32
@nirs
Copy link
Contributor Author

nirs commented Nov 14, 2024

Tested using a script creating 10 instances in parallel.

% grep 'downloading' out/*.log
out/vm-09.log:time="2024-11-14T22:34:40+02:00" level=debug msg="downloading \"https://cloud-images.ubuntu.com/releases/24.10/release-20241023/ubuntu-24.10-server-cloudimg-arm64.img\" into \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""

% grep 'is cached' out/*.log
out/vm-01.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-01/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-02.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-02/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-03.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-03/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-04.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-04/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-05.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-05/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-06.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-06/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-07.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-07/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-08.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-08/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""
out/vm-10.log:time="2024-11-14T22:34:51+02:00" level=debug msg="file \"/Users/nsoffer/.lima/vm-10/basedisk\" is cached as \"/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data\""```

Test scripts and logs:
test-parallel-download.tar.gz

@nirs
Copy link
Contributor Author

nirs commented Nov 14, 2024

Another test, creating and starting 3 instances in parallel:

% for i in 1 2 3; do (_output/bin/limactl start --tty=0 --name vm-$i &); done
INFO[0000] Terminal is not available, proceeding without opening an editor                                                                           
INFO[0000] Terminal is not available, proceeding without opening an editor 
INFO[0000] Terminal is not available, proceeding without opening an editor 
INFO[0000] Starting the instance "vm-3" with VM driver "vz" 
INFO[0000] Attempting to download the image              arch=aarch64 digest="sha256:d71df0bcca6c3d2e7530517d3885f1d007fd9210d40ce2054db36af2a2176c38" location="https://cloud-images.ubuntu.com/releases/24.10/release-20241023/ubuntu-24.10-server-cloudimg-arm64.img"
INFO[0000] Starting the instance "vm-1" with VM driver "vz" 
INFO[0000] Attempting to download the image              arch=aarch64 digest="sha256:d71df0bcca6c3d2e7530517d3885f1d007fd9210d40ce2054db36af2a2176c38" location="https://cloud-images.ubuntu.com/releases/24.10/release-20241023/ubuntu-24.10-server-cloudimg-arm64.img"
INFO[0000] Starting the instance "vm-2" with VM driver "vz" 
INFO[0000] Attempting to download the image              arch=aarch64 digest="sha256:d71df0bcca6c3d2e7530517d3885f1d007fd9210d40ce2054db36af2a2176c38" location="https://cloud-images.ubuntu.com/releases/24.10/release-20241023/ubuntu-24.10-server-cloudimg-arm64.img"
Downloading the image (ubuntu-24.10-server-cloudimg-arm64.img)
592.71 MiB / 592.71 MiB [----------------------------------] 100.00% 56.41 MiB/s
INFO[0011] Downloaded the image from "https://cloud-images.ubuntu.com/releases/24.10/release-20241023/ubuntu-24.10-server-cloudimg-arm64.img" 
INFO[0011] Using cache "/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data" 
INFO[0011] Using cache "/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/2422cebd3cfd3b6733dd03b3724b14701a1dcfbb11115572c1eb98bb01c23731/data" 
INFO[0011] Converting "/Users/nsoffer/.lima/vm-3/basedisk" (qcow2) to a raw disk "/Users/nsoffer/.lima/vm-3/diffdisk" 
INFO[0011] Converting "/Users/nsoffer/.lima/vm-1/basedisk" (qcow2) to a raw disk "/Users/nsoffer/.lima/vm-1/diffdisk" 
INFO[0011] Converting "/Users/nsoffer/.lima/vm-2/basedisk" (qcow2) to a raw disk "/Users/nsoffer/.lima/vm-2/diffdisk" 
3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 702.62 MiB/s
INFO[0016] Expanding to 100GiB                          
3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 691.12 MiB/s
INFO[0016] Expanding to 100GiB                          
3.50 GiB / 3.50 GiB [-------------------------------------] 100.00% 680.00 MiB/s
INFO[0016] Expanding to 100GiB                          
INFO[0017] Attempting to download the nerdctl archive    arch=aarch64 digest="sha256:fe085381a09aa240ae5d1e0bbef1beccfb7c1d6dbb98bdc55bd416581d46ebc8" location="https://github.com/containerd/nerdctl/releases/download/v2.0.0/nerdctl-full-2.0.0-linux-arm64.tar.gz"
INFO[0017] Attempting to download the nerdctl archive    arch=aarch64 digest="sha256:fe085381a09aa240ae5d1e0bbef1beccfb7c1d6dbb98bdc55bd416581d46ebc8" location="https://github.com/containerd/nerdctl/releases/download/v2.0.0/nerdctl-full-2.0.0-linux-arm64.tar.gz"
INFO[0017] Attempting to download the nerdctl archive    arch=aarch64 digest="sha256:fe085381a09aa240ae5d1e0bbef1beccfb7c1d6dbb98bdc55bd416581d46ebc8" location="https://github.com/containerd/nerdctl/releases/download/v2.0.0/nerdctl-full-2.0.0-linux-arm64.tar.gz"
Downloading the nerdctl archive (nerdctl-full-2.0.0-linux-arm64.tar.gz)
195.48 MiB / 195.48 MiB [----------------------------------] 100.00% 37.47 MiB/s
INFO[0023] Downloaded the nerdctl archive from "https://github.com/containerd/nerdctl/releases/download/v2.0.0/nerdctl-full-2.0.0-linux-arm64.tar.gz" 
INFO[0023] Using cache "/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/1699e54a52757df863155fca76f8a77b50f05d993edca23421798af6635156f0/data" 
INFO[0023] Using cache "/Users/nsoffer/Library/Caches/lima/download/by-url-sha256/1699e54a52757df863155fca76f8a77b50f05d993edca23421798af6635156f0/data" 
INFO[0024] [hostagent] hostagent socket created at /Users/nsoffer/.lima/vm-3/ha.sock 
INFO[0024] [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/nsoffer/.lima/vm-3/serial*.log") 
INFO[0025] [hostagent] hostagent socket created at /Users/nsoffer/.lima/vm-1/ha.sock 
INFO[0025] [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/nsoffer/.lima/vm-1/serial*.log") 
INFO[0025] [hostagent] hostagent socket created at /Users/nsoffer/.lima/vm-2/ha.sock 
INFO[0025] [hostagent] Starting VZ (hint: to watch the boot progress, see "/Users/nsoffer/.lima/vm-2/serial*.log") 
INFO[0025] SSH Local Port: 64510                        
INFO[0025] [hostagent] [VZ] - vm state change: running  
INFO[0025] [hostagent] Waiting for the essential requirement 1 of 2: "ssh" 
INFO[0025] SSH Local Port: 64508                        
INFO[0025] SSH Local Port: 64506                        
INFO[0025] [hostagent] [VZ] - vm state change: running  
INFO[0025] [hostagent] Waiting for the essential requirement 1 of 2: "ssh" 
INFO[0025] [hostagent] Waiting for the essential requirement 1 of 2: "ssh" 
INFO[0025] [hostagent] [VZ] - vm state change: running  
INFO[0035] [hostagent] Waiting for the essential requirement 1 of 2: "ssh" 
INFO[0035] [hostagent] Waiting for the essential requirement 1 of 2: "ssh" 
INFO[0035] [hostagent] Waiting for the essential requirement 1 of 2: "ssh" 
INFO[0035] [hostagent] The essential requirement 1 of 2 is satisfied 
INFO[0035] [hostagent] Waiting for the essential requirement 2 of 2: "user session is ready for ssh" 
INFO[0035] [hostagent] The essential requirement 2 of 2 is satisfied 
INFO[0035] [hostagent] Waiting for the optional requirement 1 of 2: "systemd must be available" 
INFO[0035] [hostagent] Guest agent is running           
INFO[0035] [hostagent] Not forwarding TCP 127.0.0.54:53 
INFO[0035] [hostagent] Not forwarding TCP 127.0.0.53:53 
INFO[0035] [hostagent] Not forwarding TCP 0.0.0.0:22    
INFO[0035] [hostagent] The essential requirement 1 of 2 is satisfied 
INFO[0035] [hostagent] The optional requirement 1 of 2 is satisfied 
INFO[0035] [hostagent] Waiting for the essential requirement 2 of 2: "user session is ready for ssh" 
INFO[0035] [hostagent] Not forwarding TCP [::]:22       
INFO[0035] [hostagent] Waiting for the optional requirement 2 of 2: "containerd binaries to be installed" 
INFO[0035] [hostagent] The essential requirement 1 of 2 is satisfied 
INFO[0035] [hostagent] Waiting for the essential requirement 2 of 2: "user session is ready for ssh" 
INFO[0035] [hostagent] The essential requirement 2 of 2 is satisfied 
INFO[0035] [hostagent] Waiting for the optional requirement 1 of 2: "systemd must be available" 
INFO[0035] [hostagent] The essential requirement 2 of 2 is satisfied 
INFO[0035] [hostagent] Waiting for the optional requirement 1 of 2: "systemd must be available" 
INFO[0035] [hostagent] Guest agent is running           
INFO[0035] [hostagent] The optional requirement 1 of 2 is satisfied 
INFO[0035] [hostagent] Guest agent is running           
INFO[0035] [hostagent] Not forwarding TCP 127.0.0.53:53 
INFO[0035] [hostagent] Waiting for the optional requirement 2 of 2: "containerd binaries to be installed" 
INFO[0035] [hostagent] Not forwarding TCP 127.0.0.53:53 
INFO[0035] [hostagent] Not forwarding TCP 127.0.0.54:53 
INFO[0035] [hostagent] Not forwarding TCP 127.0.0.54:53 
INFO[0035] [hostagent] Not forwarding TCP 0.0.0.0:22    
INFO[0035] [hostagent] The optional requirement 1 of 2 is satisfied 
INFO[0035] [hostagent] Not forwarding TCP 0.0.0.0:22    
INFO[0035] [hostagent] Not forwarding TCP [::]:22       
INFO[0035] [hostagent] Not forwarding TCP [::]:22       
INFO[0035] [hostagent] Waiting for the optional requirement 2 of 2: "containerd binaries to be installed" 
INFO[0050] [hostagent] The optional requirement 2 of 2 is satisfied 
INFO[0050] [hostagent] Waiting for the guest agent to be running 
INFO[0050] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished" 
INFO[0051] [hostagent] The optional requirement 2 of 2 is satisfied 
INFO[0051] [hostagent] Waiting for the guest agent to be running 
INFO[0051] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished" 
INFO[0053] [hostagent] Forwarding TCP from 127.0.0.1:39523 to 127.0.0.1:39523 
INFO[0053] [hostagent] The optional requirement 2 of 2 is satisfied 
INFO[0053] [hostagent] Waiting for the guest agent to be running 
INFO[0053] [hostagent] Waiting for the final requirement 1 of 1: "boot scripts must have finished" 
INFO[0053] [hostagent] Forwarding TCP from 127.0.0.1:35733 to 127.0.0.1:35733 
INFO[0053] [hostagent] Forwarding TCP from 127.0.0.1:33171 to 127.0.0.1:33171 
INFO[0062] [hostagent] The final requirement 1 of 1 is satisfied 
INFO[0063] READY. Run `limactl shell vm-3` to open the shell. 
INFO[0063] [hostagent] The final requirement 1 of 1 is satisfied 
INFO[0063] READY. Run `limactl shell vm-1` to open the shell. 
INFO[0063] [hostagent] The final requirement 1 of 1 is satisfied 
INFO[0063] READY. Run `limactl shell vm-2` to open the shell. 

We tested that modified last modified time on the server cause a
redownload, but we did not test that after a redownload we update the
cache, so the next attempt will used the cached download.

Add a failing test verifying the issue, and improve test comments and
configuration to make it more clear.

Part-of: lima-vm#2902
Signed-off-by: Nir Soffer <[email protected]>
Extract getCache() to handle getting and file from the cache, and
fetch() for fetching a file from the remote to the cache. This will
allow locking around the code checking and updating the cache, and much
easier to maintain and understand.

Signed-off-by: Nir Soffer <[email protected]>
To solve the races during concurrent downloads and avoid unneeded work
and bandwidth, we allow one concurrent download of the same image.

When a limactl process try to access the cache, it takes a lock on the
file cache directory. If multiple processes try to get the lock in the
same time, only one will take the lock, and the other will block.

The process that took the lock tries to get the file from the cache.
This is the fast path and the common case. This can fail if the file is
not in the cache, the digest does not match, or the cached last modified
time does not match the last modified returned from the server.

If the process cannot get the file from the cache, it downloads the file
from the remote server, and update the cached data and metadata files.

Finally the process release the lock on the cache directory. Other
limactl processes waiting on the lock wake up and take the lock. In the
common case they will find the image in the cache and will release the
lock quickly.

Since we have exactly one concurrent download, updating the metadata
files is trivial and we don't need the writeFirst() helper.

Fixes: lima-vm#2902
Fixes: lima-vm#2732
Signed-off-by: Nir Soffer <[email protected]>
We can assert now that only one thread downloaded the file, and all
other threads used the cache.

Signed-off-by: Nir Soffer <[email protected]>
Checking if an image is cached races with parallel downloads. Take the
lock when validating the digest or the data file to ensure that we
validate the cached when it is in consistent state.

If an image is being downloaded, the check will block until the download
completes.

Signed-off-by: Nir Soffer <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Digest-less images always re-download after last-modified time changes on server
1 participant