Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression with lvm (non-thin) on latest/edge snap #14341

Open
simondeziel opened this issue Oct 25, 2024 · 5 comments · Fixed by canonical/lxd-pkg-snap#589
Open
Assignees
Labels
Bug Confirmed to be a bug
Milestone

Comments

@simondeziel
Copy link
Member

It's been a while since we noticed that tests/storage-vm lvm from lxd-ci is much slower with latest/edge than it is with 5.21/edge.

Here are some logs taken from CI runs comparing the 2 snap channels.

storage-vm lvm (latest/edge - 24.04) taking ~44 minutes to complete:

==> Checking VM can be migrated with snapshots (different storage pool)
+ lxc copy v1 localhost:v2 -s vmpool-lvm-33922 --stateless
Transferring instance: v2/snap0: 82.51MB (82.50MB/s)
Transferring instance: v2/snap0: 163.06MB (81.51MB/s)
Transferring instance: v2/snap0: 243.90MB (81.29MB/s)
...
Transferring instance: v2/snap0: 3.53GB (233.38MB/s)                                                    
Transferring instance: v2: 785.26MB (785.25MB/s)
...
Transferring instance: v2: 3.60GB (449.27MB/s)

storage-vm lvm (5.21/edge - 24.04) taking ~24 minutes to complete:

==> Checking VM can be migrated with snapshots (different storage pool)
+ lxc copy v1 localhost:v2 -s vmpool-lvm-34072 --stateless
Transferring instance: v2/snap0: 315.72MB (315.72MB/s)
Transferring instance: v2/snap0: 631.64MB (315.82MB/s)
...
Transferring instance: v2/snap0: 3.68GB (73.25MB/s)                                                   
Transferring instance: v2: 745.92MB (745.91MB/s)
...
Transferring instance: v2: 3.55GB (317.22MB/s)

In both cases, we see that transferring v2/snap0 is much slower than transferring v2 itself. However, in the latest/edge case, the snapshot transfer is noticeably slower than that of 5.21/edge.

In those 2 CI runs, the GHA runners are using the exact same 24.04 image. This means only LXD's snap and its core2X base differ. 5.21/edge uses core22 while latest/edge uses core24. To rule out a lvm2 version issue, I used lvm.external=true in canonical/lxd-ci#328 and got identical results which seems to indicate a potential regression introduced in LXD between stable-5.21 and main.

One way to compare CI logs is to download raw logs from storage-vm lvm (latest/edge - 24.04) and storage-vm lvm (5.21/edge - 24.04). Once downloaded, they can be stripped of their datestamp prefix with sed:

$ sed 's/^[^Z]\+Z //' lvm-latest.raw > lvm-latest.txt
$ sed 's/^[^Z]\+Z //' lvm-521.raw > lvm-521.txt

And then meld allows to compare them line by line (meld lvm-latest.txt lvm-521.txt). This is with this method that I extracted the log snippets from above. Both the .raw and .txt files are included in the attached tarball. Due to GH policy, .tgz files cannot be attached so I added a .txt placeholder.

@tomponline
Copy link
Member

@simondeziel please can you try building LXD from main (on an ubuntu 22.04 system) and then sideloading it into the 5.21/edge snap and repeating the tests. This will help us rule out LXD itself and hopefully narrow it down to something in the latest/edge snap packaging (with the most likely candidate being the core24 base snap I expect).

@tomponline tomponline added this to the lxd-6.2 milestone Oct 25, 2024
@simondeziel simondeziel self-assigned this Oct 25, 2024
@simondeziel
Copy link
Member Author

daemon.start does some hot patching of lvm.conf:

sed \
    -e "s#obtain_device_list_from_udev = 1#obtain_device_list_from_udev = 0#g" \
    -e "s#cache_file_prefix = \"\"#cache_file_prefix = \"lxd\"#g" \
    -e "s#udev_sync = 1#udev_sync = 0#g" \
    -e "s#udev_rules = 1#udev_rules = 0#g" \
    -e "s#use_lvmetad = 1#use_lvmetad = 0#g" \
    -e "s#monitoring = 1#monitoring = 0#g" \
    -e "s%# executable = \"/sbin/dmeventd\"%executable = \"${SNAP}/bin/dmeventd\"%g" \
    -e "/# .*_executable =/s/# //g" \
    -e "s#/usr/sbin/#${SNAP}/bin/#g" \
    "${SNAP}/etc/lvm/lvm.conf" > /etc/lvm/lvm.conf

However, the mangling likely needs to be updated as many of the altered lines are kept as comments in the resulting lvm.conf from latest/edge.

@simondeziel
Copy link
Member Author

simondeziel commented Nov 6, 2024

The performance is still wildly different between lvm and lvm-thin on latest/edge as can be seen here: https://github.com/canonical/lxd-ci/actions/runs/11708961560/job/32611939047

I found another interesting difference between 5.21/stable and latest/edge:

root@jupiter:~# snap list lxd
Name  Version         Rev    Tracking     Publisher   Notes
lxd   5.21.2-2f4ba6b  30131  5.21/stable  canonical✓  -

root@jupiter:~# LD_LIBRARY_PATH=/snap/lxd/current/lib/:/snap/lxd/current/lib/x86_64-linux-gnu/:/snap/lxd/current/zfs-2.2/lib PATH=/snap/lxd/current/zfs-2.2/bin:/snap/lxd/current/bin:$PATH nsenter --mount=/run/snapd/ns/lxd.mnt -- lvmconfig --typeconfig diff
dmeventd {
	executable="/snap/lxd/30131/bin/dmeventd"
}
activation {
	udev_sync=0
	udev_rules=0
	monitoring=0
}
global {
	thin_check_executable="/snap/lxd/30131/bin/thin_check"
	thin_dump_executable="/snap/lxd/30131/bin/thin_dump"
	thin_repair_executable="/snap/lxd/30131/bin/thin_repair"
	cache_check_executable="/snap/lxd/30131/bin/cache_check"
	cache_dump_executable="/snap/lxd/30131/bin/cache_dump"
	cache_repair_executable="/snap/lxd/30131/bin/cache_repair"
}
devices {
	obtain_device_list_from_udev=0
	issue_discards=1
}
root@sdeziel-lemur:~# snap list lxd
Name  Version      Rev    Tracking     Publisher   Notes
lxd   git-9ac2433  31013  latest/edge  canonical✓  -

root@sdeziel-lemur:~# LD_LIBRARY_PATH=/snap/lxd/current/lib/:/snap/lxd/current/lib/x86_64-linux-gnu/:/snap/lxd/current/zfs-2.2/lib PATH=/snap/lxd/current/zfs-2.2/bin:/snap/lxd/current/bin:$PATH nsenter --mount=/run/snapd/ns/lxd.mnt -- lvmconfig --typeconfig diff
activation {
	udev_sync=0
	udev_rules=0
	monitoring=0
}
global {
	thin_check_executable="/snap/lxd/31013/bin/thin_check"
	thin_dump_executable="/snap/lxd/31013/bin/thin_dump"
	thin_repair_executable="/snap/lxd/31013/bin/thin_repair"
	cache_check_executable="/snap/lxd/31013/bin/cache_check"
	cache_dump_executable="/snap/lxd/31013/bin/cache_dump"
	cache_repair_executable="/snap/lxd/31013/bin/cache_repair"
	vdo_format_executable="/snap/lxd/31013/bin/vdoformat"
	fsadm_executable="/snap/lxd/31013/bin/fsadm"
}
devices {
	issue_discards=1
}

From the above, obtain_device_list_from_udev is not relevant as it now defaults to obtain_device_list_from_udev = 0 now. That leaves the dmeventd section as the only effective config change (ignoring fsadm).

@simondeziel
Copy link
Member Author

simondeziel commented Nov 13, 2024

5.21/edge sideload with "itself":

root@v1:~/lxd-ci# time PURGE_LXD=1 LXD_SIDELOAD_PATH=~/lxd.521 ./bin/local-run tests/storage-vm 5.21/edge lvm
+ echo 'Test passed'
Test passed
+ exit 0

real	31m34.850s
user	0m15.943s
sys	0m35.073s

5.21/edge + main LXD/lxc sideloaded (DQLITE=lts-1.17.x):

root@v1:~/lxd-ci# time PURGE_LXD=1 LXC_SIDELOAD_PATH=~/lxc LXD_SIDELOAD_PATH=~/lxd.main ./bin/local-run tests/storage-vm 5.21/edge lvm
+ echo 'Test passed'
Test passed
+ exit 0

real	31m25.747s
user	0m14.969s
sys	0m31.677s

So that's apparently not a regression in LXD itself.

@simondeziel
Copy link
Member Author

It's unclear if the lxd-pkg-snap PR will fix it as I couldn't reproduce the slowness with latest/edge locally:

root@v1:~/lxd-ci# time PURGE_LXD=1 ./bin/local-run tests/storage-vm latest/edge lvm
+ '[' 0 = 1 ']'
+ echo 'Test passed'
Test passed
+ exit 0

real	32m31.590s
user	0m15.224s
sys	0m34.292s

But at least, the lxd-pkg-snap didn't regress either:

root@v1:~/lxd-ci# time PURGE_LXD=1 LXD_SNAP_PATH=/dev/shm/lxd_0+git.89550582_amd64.snap ./bin/local-run tests/storage-vm latest/edge lvm
+ '[' 0 = 1 ']'
+ echo 'Test passed'
Test passed
+ exit 0

real	32m52.209s
user	0m14.686s
sys	0m33.072s

I'll check on the next scheduled run on lxd-ci, tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed to be a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants