Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cpuset not working with present arch linux (maybe cgroupv2?) #40

Open
ebennett1980 opened this issue Apr 19, 2021 · 17 comments
Open

cpuset not working with present arch linux (maybe cgroupv2?) #40

ebennett1980 opened this issue Apr 19, 2021 · 17 comments

Comments

@ebennett1980
Copy link

Problem seems to afflict arch linux on two different kernels, I think it has to do with them using control group v2 and cpuset expecting v1 perhaps? Arch upgraded in a recent update.

symptom looks like this;
root@monolith:~# cset shield --cpu=0-7
mount: /cpusets: none already mounted on /sys/fs/bpf.
cset: **> mount of cpuset filesystem failed, do you have permission?

Two kernels confirmed affected;

Linux monolith 5.4.85-1-vfio-lts #1 SMP Wed, 23 Dec 2020 06:46:51 +0000 x86_64 GNU/Linux
Linux magister 5.9.11-xanmod1-1 #1 SMP PREEMPT Tue, 01 Dec 2020 12:38:55 +0000 x86_64 GNU/Linux

@Werkov
Copy link
Member

Werkov commented Apr 20, 2021

Hello.
What does the following say on your system:

grep -E "cpuset|cgroup2" /proc/mounts
cat /sys/fs/cgroup/cgroup.controllers  # edit: or wherever your cgroup2 tree is moutned

?

Also, what systemd version is this?

(I'm just checking, but I'd generally not expect cpuset working with v2 cpuset controller.)

@ebennett1980
Copy link
Author

Results on the metal
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
cpuset cpu io memory hugetlb pids rdma

Results in a VM
cgroup2 /sys/fs/cgroup cgroup2 rw,nosuid,nodev,noexec,relatime 0 0
cpuset cpu io memory hugetlb pids rdma

systemd is 248-5 on both

@Werkov
Copy link
Member

Werkov commented Apr 21, 2021

Thanks. So the cpuset controller is bound to the v2 hierarchy and it also shows that systemd runs in the unified mode (that was perhaps the critical change between updates). I'm afraid the cset utility can't serve you in such a setup.

(As a workaround, you may switch via kernel cmdline back to the hybrid setup systemd.unified_cgroup_hierarchy=0 and use cset as previously, or migrate your configuration to systemd cpuset implementation (I didn't test it).)

@Hubro
Copy link

Hubro commented May 5, 2021

@Werkov Do you have any advice on how to shield CPU cores now, with the new unified cgroup hierarchy? Does systemd have any functionality for this? I am unable to find any information about this.

@Werkov
Copy link
Member

Werkov commented May 6, 2021

@Hubro With the recent systemd version you should be able to specify AllowedCPUs= directive. You can't set it directly on root (-.slice) but all 1st level children instead (init.scope, system.slice and user.slice in default setups). That way you can move the userspace tasks out of the way. (Note that kernel threads can still run on "shielded" CPUs but that is no different to cset shield --kthread=off(the default).)

@Hubro
Copy link

Hubro commented May 6, 2021

@Werkov Is it possible to set AllowedCPUs for already running slices without restarting them? Also, is there any way to inform the kernel which cores I want it to keep its threads on? In this case a kernel command line argument would be fine.

EDIT: I figured out how I set AllowedCPUs at runtime:

sudo systemctl set-property --runtime user.slice AllowedCPUs=0-3
sudo systemctl set-property --runtime system.slice AllowedCPUs=0-3

This didn't do anything for me, but I assume that's because I disabled unified cgroup hierarchy. I'll test this out later with unified hierarchy enabled.

I still have no idea how to keep kernel threads off my virtualization cores though.

@Werkov
Copy link
Member

Werkov commented May 7, 2021

Setting the cgroup attributes should work at runtime exactly as you did (alternatively you can edit slice unit or drop-in files and call systemctl daemon-reload), restart of slices is not necessary. (One possible catch with the runtime update is that NUMA memory won't be migrated with the change.) And you need the unified hierarchy for this to work with systemd (otherwise you'd have used cset, right?).

If you need "silence" on the cpu, then see isolcpus kernel cmdline. I'm just curious, why you need to keep kernel threads off your selected cores (is that due to RT constraints)?

@Hubro
Copy link

Hubro commented May 7, 2021

@Werkov My use case is a high performance virtual machine doing realtime tasks, so I'm doing everything I can to reduce latency and stutters on those cores.

I'm not entirely sure how isolcpus works. Will this keep kernel threads off those cores, or will it keep everything off them? I want to be able to do compilation and encoding tasks using all my host cores when my VMs are not running, so any kernel cmdline parameters that prevent that is not an option for me.


I just noticed the docs you linked says that isolcpus is deprecated:

isolcpus=       [KNL,SMP,ISOL] Isolate a given set of CPUs from disturbance.
                        [Deprecated - use cpusets instead]
                        Format: [flag-list,]<cpu-list>

Does that mean that cpusets can do all the things that isolcpus can do? 🤔

@fweisbec
Copy link

fweisbec commented May 7, 2021

Does that mean that cpusets can do all the things that isolcpus can do? thinking

Not exactly, isolcpus is often used to disable scheduler load balancing on a CPU and that's the only
part where cpusets can help in a similar fashion (through cpuset.sched_load_balance), that and also telling
which task is allowed to go on a given set of CPUs. But that's where the similarity ends.

In fact isolcpus does much more as it also isolates from unbound kernel threads, workqueues, timers, etc...

I understand you can't afford to use boot parameters but it's worth being aware of "nohz_full=".
It will isolate your CPU pretty much as well as isolcpus does and it will also further
deactivate the timer tick on the host CPU, avoiding interrupting the guest vCPUs. If you combine that with cpusets to move all unrelated tasks from the CPUs running the guests, you might have good results. Oh and don't forget to re-affine interrupts out
of the CPUs running guest as well (https://www.kernel.org/doc/Documentation/IRQ-affinity.txt).
I'm writing a series of articles about that if that can help: https://www.suse.com/c/cpu-isolation-introduction-part-1/

@lpechacek
Copy link
Contributor

Thanks, @fweisbec, for the comment and recommendation about nohz_full. I'd recommend the same for the purpose of ensuring undisturbed VM run.

Ad cpuset v2 cgroup controller in general, I'll dump my current thoughts here. It's not going to be a neatly formulated message but I'll be grateful for alternate views and opinions.

  1. (AFAIK) Cpuset utility was created in the pre-systemd era as part of the Novell/SUSE SLERT offering. It is quite a nice utility with reasonably good code.
  2. When the cpuset author left the company, I took over the utility maintenance because patches started to pile up in the package. I didn't know that there are that many users outside the company, specifically because I don't recall receiving any feedback from other distro maintainers about the changes upstream.
  3. With the introduction of SystemD in SUSE products I heard horror stories about how SystemD freezes when external programs manipulate its group setting. It was LTP at that time taking SystemD down in product testing.
  4. I noticed the introduction of the cgroup v2 hierarchy, briefly discussed it with our cgroup expert and put the v2 hierarchy support on my "look into it when there's spare time" list.
  5. The introduction of the cpuset controller in the v2 hierarchy made me recall the sleeping task. Given my beliefs about incompatibility with SystemD, I thought that cpuset might be helpful to inspect the hierarchy but perhaps should not alter the system settings. The process CPU scheduling options should perhaps be controlled with systemd-run or something like that. I haven't tried myself yet, but that's where I would start my search.

At this point, I'd like to know your opinion(s) about whether it is safe to manipulate systems cgroup settings without the daemon's consent. If you have any further comments, feel free to drop them here as well. Thanks!

@Werkov
Copy link
Member

Werkov commented May 13, 2021

3. With the introduction of SystemD in SUSE products I heard horror stories about how SystemD freezes when external programs manipulate its group setting. It was LTP at that time taking SystemD down in product testing.

Fortunately, this is irrelevant for cgroup hierarchies that are not managed by systemd. Practically, the cpuset utility could be used safely with systemd prior v244 (which introduced support for cpuset in systemd (edit: therefore cpuset hierarchy was unmanaged by systemd in the older versions)).

At this point, I'd like to know your opinion(s) about whether it is safe to manipulate systems cgroup settings without the daemon's consent.

Nowadays with v2, it is safe when the operations are carried out in a dedicated subtree only. Full description is in the document about cgroup delegation with systemd.

@haelix888
Copy link

haelix888 commented Jun 9, 2021

Has anyone managed to set isolcpus on Arch using EFISTUB (efibootmgr) by any chance?
The kernel parameter is not getting picked up.
(linux-lts 5.10.40)

Edit: possibly related: https://lore.kernel.org/lkml/20200414215715.GB182757@xz-x1/T/#u

@fweisbec
Copy link

fweisbec commented Jun 9, 2021

Has anyone managed to set isolcpus on Arch using EFISTUB (efibootmgr) by any chance?
The kernel parameter is not getting picked up.
(linux-lts 5.10.40)

Edit: possibly related: https://lore.kernel.org/lkml/20200414215715.GB182757@xz-x1/T/#u

No idea but you can still hardcode kernel boot option with CONFIG_CMDLINE_BOOL + CONFIG_CMDLINE

@haelix888
Copy link

My issue was with the BIOS. It somehow deduplicates boot entries having the same image, but different parameters. In my case I can get it to work by entering bios and modifying boot priority order (even if efibootmgr correctly reports the order).

@joshuaboniface
Copy link

Just chiming in here, I'm looking to use cset on Debian 11 which, by default, leverages the unified cgroup heirarchy. While disabling the unified heirarchy is of course feasible and did work for me, I'd be concerned about the long-term implications of this, especially when Debian 12 drops with who-knows-what-other-changes in systemd, cgroups, etc.

Is there currently any plan to support the unified hierarchy?

I ask because, while the systemd unit option might be useful in some cases, in my case I'm using cset to fully isolate one process to its own set of CPUs. By leveraging cset and its automated move of processes into another set, this is pretty trivial - I move everything into a new cset with cset proc --move --force, and then use cset proc to put my new processes into their own cset. But trying to update every systemd unit to exclude them from executing on those CPUs would not be trivial. I'd be curious if anyone else has an alternative for this if indeed cset isn't going to support the unified hierarchy long-term. No rush of course I have at least 2 years until it could potentially become a problem, but wanted to get ahead of it ;-)

joshuaboniface added a commit to parallelvirtualcluster/pvc-ansible that referenced this issue Oct 10, 2021
This is required on Debian 11 to use the cset tool, since the newer
systemd implementation of a unified cgroup hierarchy is not compatible
with the cset tool.

Ref for future use:
  SUSE/cpuset#40
joshuaboniface added a commit to parallelvirtualcluster/pvc-ansible that referenced this issue Oct 10, 2021
This is required on Debian 11 to use the cset tool, since the newer
systemd implementation of a unified cgroup hierarchy is not compatible
with the cset tool.

Ref for future use:
  SUSE/cpuset#40
@Werkov
Copy link
Member

Werkov commented Oct 11, 2021

But trying to update every systemd unit to exclude them from executing on those CPUs would not be trivial.

Actually, you should just be able to leverage the hierarchy and apply cpuset systemd settings on the top-level units only (system.slice, user.slice, machine.slice, init.scope by default).

@joshuaboniface
Copy link

Actually, you should just be able to leverage the hierarchy and apply cpuset systemd settings on the top-level units only (system.slice, user.slice, machine.slice, init.scope by default).

Interesting, that would definitely suit my needs, I'll give it a shot. I hadn't considered setting it at the slice level!

matta added a commit to matta/lower_bound_benchmark that referenced this issue Sep 16, 2022
VoodaGod added a commit to VoodaGod/rokups.github.io that referenced this issue Mar 25, 2023
just some suggestions because cset is no longer usable for many: SUSE/cpuset#40

i tried to replicate what `cset` does in your original script with the suggestions from the linked issue.

i am unsure if setting the `/sys/bus/workqueue/devices/writeback/cpumask` is superflous if `nohz_full` is configured
VoodaGod added a commit to VoodaGod/rokups.github.io that referenced this issue Mar 25, 2023
just some suggestions because cset is no longer usable for many: SUSE/cpuset#40

i tried to replicate what `cset` does in your original script with the suggestions from the linked issue.

i am unsure if setting the `/sys/bus/workqueue/devices/writeback/cpumask` is superflous if `nohz_full` is configured
VoodaGod added a commit to VoodaGod/rokups.github.io that referenced this issue Mar 25, 2023
just some suggestions because cset is no longer usable for many: SUSE/cpuset#40

i tried to replicate what `cset` does in your original script with the suggestions from the linked issue.

i am unsure if setting the `/sys/bus/workqueue/devices/writeback/cpumask` is superflous if `nohz_full` is configured
joshuaboniface added a commit to parallelvirtualcluster/pvc-ansible that referenced this issue Sep 1, 2023
This is required on Debian 11 to use the cset tool, since the newer
systemd implementation of a unified cgroup hierarchy is not compatible
with the cset tool.

Ref for future use:
  SUSE/cpuset#40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants