Add racct limit option #11

cyrilzhangfreebsd · 2021-06-02T21:11:41Z

Issue number:
#8

Description of changes:
This adds a FreeBSD -specific struct to the configuration, populated with an array of racct limit structs which define rctl rules to be applied to the jail. It also adds the function that adds the rules using rctl when the jail is created, and deletes the rules when the jail is deleted.

Testing done:
I tested by creating some jails with various racct rules, as well as with an empty "freebsd" configuration. The unit tests also pass.

Terms of contribution:

By submitting this pull request, I agree that this contribution is licensed under the terms found in the LICENSE file.

davidchisnall · 2021-06-10T16:33:55Z

I'm a little bit nervous that this seems to directly leak the RACCT resource names directly into the config file format. This means that if FreeBSD 14 adds a new RACCT resource that isn't in FreeBSD 13 then it would still be supported by runj on 14, but not 13. It's not clear to me that all of the RACCT resources make sense in the context of a container. For example, should coredumpsize be a property of a container? Setting a wallclock limit seems highly dubious for a container. Similarly, some of the actions probably don't make sense: if you want the signal behaviour, you probably want to configure this inside the container. The log and devctl rules are interesting corner cases because they might be interesting for a container orchestration layer that does FreeBSD-specific monitoring things but probably aren't useful for containerd.

I think it would make sense to start by defining a subset of RACCT that definitely makes sense and proposing that for inclusion in the FreeBSD-specific bit of the spec. It's easy to add other actions and resources if they make sense, it's much harder to remove things.

samuelkarp · 2021-06-11T03:27:51Z

I'm a little bit nervous that this seems to directly leak the RACCT resource names directly into the config file format.

It doesn't look like that to me; @cyrilzhangfreebsd is proposing adding the following structs:

// FreeBSD contains platform-specific configuration for FreeBSD based containers.
type FreeBSD struct {
	// RacctLimits specifies racct rules to apply to this jail.
	RacctLimits []RacctLimit `json:"racct,omitempty"`
}

// RacctLimit is a racct rule to apply to a jail.
type RacctLimit struct {
	// Resource is the resource to set a limit on.
	Resource string `json:"resource"`
	// Action is what will happen if a process exceeds the allowed amount.
	Action string `json:"action"`
	// Amount is the allowed amount of the resource.
	Amount string `json:"amount"`
	// Per defines the entity that the amount applies to.
	Per string `json:"per,omitempty"`
}

This doesn't encode any of the resource names; the names are represented as the Resource field (which is a string).

Am I reading this incorrectly?

samuelkarp · 2021-06-11T05:30:15Z

I spent some time reading rctl(8), rctl(4), and rctl.conf(5) as well as the Linux portions of the runtime spec.

I think it would make sense to start by defining a subset of RACCT that definitely makes sense and proposing that for inclusion in the FreeBSD-specific bit of the spec.

@davidchisnall I can see a bit of what you're saying here. On the Linux side, the spec breaks down individual limits like this in the LinuxResources struct. Similarly, Windows and Solaris have resource-specific structs and fields.

I'm not as familiar with Windows or Solaris, but on Linux this definitely makes sense. While the resource limits are expressed through cgroups, each cgroup controller has its own set of configuration pseudofiles and has to be treated separately. The OCI spec reflects this even though it's not necessarily requiring that the resource limits be implemented with cgroups. FreeBSD does seem to be a bit different; the rctl(8) configuration format is consistent for all of the resource types (even if they don't all support the same actions).

An open format like this PR would allow the full set of rctl(8) rules to be specified. A constrained format with first-class fields is less flexible, but better matches the rest of the platform-specific resource configuration in the OCI spec and decouples the rctl(8) resource name from the intent that could be specified.

samuelkarp · 2021-06-11T07:33:05Z

Here's a potential alternative sketch:

type FreeBSD struct {
	Resources *FreeBSDResources `json:"resources,omitempty"`
}

type FreeBSDResources struct {
	// Memory is the memory restriction configuration
	Memory *FreeBSDMemory `json:"memory,omitempty"`
	// FSIO is the filesystem IO restriction configuration
	FSIO *FreeBSDFSIO `json:"fsIO,omitempty"`
}

type FreeBSDMemory struct {
	// Limit is the memory limit (in bytes)
	Limit *int64 `json:"limit,omitempty"`
	// Warning is the amount of memory (in bytes) where a warning is sent to devd(8)
	Warning *int64 `json:"warning,omitempty"`
	// Swap is the amount of swap that may be used (in bytes)
	Swap *int64 `json:"swap,omitempty"`
	// SwapWarning is the amount of swap (in bytes) where a warning is sent to devd(8)
	SwapWarning *int64 `json:"swapWarning,omitempty"`
	// etc...
}

type FreeBSDFSIO struct {
	// ReadBPS is the number of filesystem reads (in bytes per second) before throttling occurs
	ReadBPS *int64 `json:"readbps,omitempty"`
	// WriteBPS is the number of filesystem writes (in bytes per second) before throttling occurs
	WriteBPS *int64 `json:"writebps,omitempty"`
	// etc...
}

I think I like this better, but I haven't tested it and reserve the right to change my mind again 😄

@davidchisnall, some questions:

The log and devctl rules are interesting corner cases because they might be interesting for a container orchestration layer that does FreeBSD-specific monitoring things but probably aren't useful for containerd.

For devctl this makes sense to me; an orchestrator may want to pass that configuration through containerd and register actions with devd.conf(5). I'm not sure I understand the use-case for log as well; would a program expect to parse console output or is that primarily for human operators?

Similarly, some of the actions probably don't make sense: if you want the signal behaviour, you probably want to configure this inside the container.

Why would the signal behavior be configured inside the jail? If we registered a rule like jail:$id:vmemoryuse:sigterm=100m would a process inside the jail receive SIGTERM if the jail's aggregate virtual memory was greater than 100m or if the process's virtual memory was greater than 100m?

davidchisnall · 2021-06-11T08:17:15Z

Here's a potential alternative sketch:

This is more what I had in mind. This way, if a future version of FreeBSD changes the names of the resources or actions, the format can be stable, and we can guarantee that every option in the format is fully supported on every version of FreeBSD that runj may run on. We don't get into a situation where a new FreeBSD adds more controls, putting them in the file enforces them on the newer kernel but silently doesn't enforce them on older kernels. If we add support for such extensions then they can be guarded by a check on the OS minimum version in the container config. We can also report errors at parse time if a restriction is specified that we can't enforce, rather than having to propagate the error back from the syscall and try to explain to the user why a syscall failed and what they need to do to fix it.

For devctl this makes sense to me; an orchestrator may want to pass that configuration through containerd and register actions with devd.conf(5). I'm not sure I understand the use-case for log as well; would a program expect to parse console output or is that primarily for human operators?

I don't really know how this plugs into higher-level orchestration frameworks on Linux. At the moment, devd is quite annoying for things like this to use because the set of actions are statically defined by configuration files. The other interface to devd is a socket that provides the raw messages that you can parse yourself, so I can imagine an orchestration framework plugging into that.

I'm not quite sure about the separation of concerns here though: Should this be something that's configured per container, or would the orchestration framework want to provide a global setting of the form 'this is the mechanism that you should use to notify me if the resource limits (whatever they are) for this container are exhausted'?

The log output is purely human-readable. It seems an odd thing to put in the container configuration file, but I can imagine having a global setting that says 'I am trying to debug a thing, please log all of these things for me to look at later'.

Why would the signal behavior be configured inside the jail? If we registered a rule like jail:$id:vmemoryuse:sigterm=100m would a process inside the jail receive SIGTERM if the jail's aggregate virtual memory was greater than 100m or if the process's virtual memory was greater than 100m?

I believe this will signal every process in the jail, but I'm not 100% sure. My original comment was on the assumption that you register for signals if you want to handle them and recover: if you don't handle them, the process is killed anyway and so you may as well just kill the process / jail. If a process wants to gracefully handle hitting resource limits then I would expect it to register for the notification at the same time it installs the signal handler.

The SIGTERM case is interesting though. I can imagine, for example, setting a resource limit of 100m for sending SIGTERM and of 120m (for example) to kill the jail: this allows the processes to gracefully shut down if they reach 100 MiB of memory usage and forcibly kills them if they go over 120 MiB. Hopefully the graceful shutdown shouldn't consume more than an extra 20 MiB of RAM.

samuelkarp · 2021-06-13T20:56:14Z

if a future version of FreeBSD changes the names of the resources or actions, the format can be stable

Is this likely? Have the resources or actions changed names previously?

The other interface to devd is a socket that provides the raw messages that you can parse yourself, so I can imagine an orchestration framework plugging into that.

I didn't realize devd had a socket to process the messages, but yes that sounds like it'd likely be easier.

Should this be something that's configured per container, or would the orchestration framework want to provide a global setting of the form 'this is the mechanism that you should use to notify me if the resource limits (whatever they are) for this container are exhausted'?

Since we're speaking in super theoretical terms: I'd anticipate it being a more flexible design to specify this per-container. That would enable multiple distinct orchestrators to function at the same time, even if one of them wants to use the same mechanism to be notified if any container exceeds its limits.

If a process wants to gracefully handle hitting resource limits then I would expect it to register for the notification at the same time it installs the signal handler.

Can a non-root process request limits/notifications for itself through rctl?

The SIGTERM case is interesting though. I can imagine, for example, setting a resource limit of 100m for sending SIGTERM and of 120m (for example) to kill the jail: this allows the processes to gracefully shut down if they reach 100 MiB of memory usage and forcibly kills them if they go over 120 MiB.

Yep, this is the kind of use-case that I'd like to support.

davidchisnall · 2021-06-14T08:06:23Z

if a future version of FreeBSD changes the names of the resources or actions, the format can be stable

Is this likely? Have the resources or actions changed names previously?

I don't believe so, but there is no guarantee that they will remain stable between major releases.

The other interface to devd is a socket that provides the raw messages that you can parse yourself, so I can imagine an orchestration framework plugging into that.

I didn't realize devd had a socket to process the messages, but yes that sounds like it'd likely be easier.

The socket just forwards all of the events from the raw device to other processes, so the consumer needs to parse them all and ignore the ones that it doesn't want.

Should this be something that's configured per container, or would the orchestration framework want to provide a global setting of the form 'this is the mechanism that you should use to notify me if the resource limits (whatever they are) for this container are exhausted'?

Since we're speaking in super theoretical terms: I'd anticipate it being a more flexible design to specify this per-container. That would enable multiple distinct orchestrators to function at the same time, even if one of them wants to use the same mechanism to be notified if any container exceeds its limits.

Sorry, I meant: what is responsible for handling the event? If it's configured in the container config, then the person writing the container configuration needs to also also set up whatever manages is. If the container configuration describes only the resource limit then the orchestration framework is responsible for handling the notification and can make global decisions about what to do.

If a process wants to gracefully handle hitting resource limits then I would expect it to register for the notification at the same time it installs the signal handler.

Can a non-root process request limits/notifications for itself through rctl?

I'm not sure, but don't forget that each jail has a root user, so processes within the jail can start as root and set the resource limits if they want to.

cyrilzhangfreebsd · 2021-06-21T21:09:20Z

Thank you, a lot of this feedback is useful and I think many good questions are being asked.

I initially designed the configuration to mimic the rlimits configuration option, which is defined as "For each entry in rlimits, a getrlimit(3) on type MUST succeed", so there is precedent for just supporting everything that the system supports. However, this will run into some trouble on FreeBSD, because we have some rlimits that don't make sense in a jail, particularly those defined for users.

So with this in mind, it probably makes sense to define specific resource limits in an implementation-agnostic way so that we can omit those racct limits that don't make sense.

@samuelkarp, if you are still happy with your alternative sketch, I could make a new commit using that configuration style.

samuelkarp · 2021-07-05T08:10:37Z

@samuelkarp, if you are still happy with your alternative sketch, I could make a new commit using that configuration style.

I am, though I'd love to hear from others as well. I've talked to @tianon (one of the runtime-spec maintainers) a bit and hopefully he doesn't mind me asking for his thoughts publicly here: Tianon, what do you think?

cyrilzhangfreebsd · 2021-07-05T22:22:17Z

I'm in the process of making a new commit with the new data structure. In doing so, some things came to mind.

For instance, I noticed that in the sketch you provided, there is no longer the option to send a signal to a process in a jail if it attempts to go over the memory limit. Is this something we would still like to support, or is it perhaps nonsensical? If we want to support this, one way might be to have a Signal field similar to Limit and Warning, which would represent the amount of memory at which a signal (also defined in the field) will be sent to the process.

Another feature that has been lost is the "per" option, which would allow the limit to be defined per each process in the jail rather than the whole jail. For example, we could limit the number of threads per process, stack size per process, cpu usage per process, etc. I am not sure if this would be useful, however, so perhaps it is worth omitting?

cyrilzhangfreebsd · 2021-07-06T21:19:24Z

Hey everyone, I've added a second commit that uses the new format, so we can discuss if we like how this works out. Going this route, we'd have to separately discuss which resource limits make sense to add. For now I've additionally selected shared memory objects, CPU usage, and process count.

tianon · 2021-07-06T21:25:38Z

I don't mind at all -- I'm honestly very much on the fence here. Linux support in the spec takes a very intentionally "raw" approach to make sure we're not limiting what's possible without spec changes too much, but it really punts that complexity upstream instead. This seems like a really powerful low-level interface, but I have to imagine this structure (and the associated "FreeBSD version to correct controls" translation table) is inevitably going to live somewhere and I guess it makes as much sense for it to live at this level as it does anywhere else.

The only argument I have for pushing this all the way up to the users instead is that if some future version of FreeBSD happens to change one of these, users can "self remediate" by updating their deployments vs updating installed components to get the new translation tables automatically. As long as the tables get updated in a timely manner during the prep for the new release such that the new release's runj package already has them, I think the latter is very attractive (and then requires no changes from the users -- things "just work").

So the TLDR of that braindump is that I'm still a bit unsure, but leaning more towards the struct.

If there are use cases that aren't covered by the struct, it might make sense to add an escape hatch for users to specify them manually, but that would also end up discouraging updates to the struct to make those more "officially" supported in the same way.

Add racct limit option

cd08a85

cyrilzhangfreebsd force-pushed the racct branch from 0de145d to cd08a85 Compare June 2, 2021 21:27

Use alternative FreeBSD limit data structure

1a04ebd

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add racct limit option #11

Add racct limit option #11

cyrilzhangfreebsd commented Jun 2, 2021

davidchisnall commented Jun 10, 2021

samuelkarp commented Jun 11, 2021

samuelkarp commented Jun 11, 2021

samuelkarp commented Jun 11, 2021

davidchisnall commented Jun 11, 2021

samuelkarp commented Jun 13, 2021

davidchisnall commented Jun 14, 2021

cyrilzhangfreebsd commented Jun 21, 2021

samuelkarp commented Jul 5, 2021

cyrilzhangfreebsd commented Jul 5, 2021

cyrilzhangfreebsd commented Jul 6, 2021

tianon commented Jul 6, 2021

Add racct limit option #11

Are you sure you want to change the base?

Add racct limit option #11

Conversation

cyrilzhangfreebsd commented Jun 2, 2021

davidchisnall commented Jun 10, 2021

samuelkarp commented Jun 11, 2021

samuelkarp commented Jun 11, 2021

samuelkarp commented Jun 11, 2021

davidchisnall commented Jun 11, 2021

samuelkarp commented Jun 13, 2021

davidchisnall commented Jun 14, 2021

cyrilzhangfreebsd commented Jun 21, 2021

samuelkarp commented Jul 5, 2021

cyrilzhangfreebsd commented Jul 5, 2021

cyrilzhangfreebsd commented Jul 6, 2021

tianon commented Jul 6, 2021