forked from lxc/cgmanager
-
Notifications
You must be signed in to change notification settings - Fork 0
/
README
217 lines (200 loc) · 10.7 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
=== Intro ===
This is a motivation, description and explanation of the cgmanager
design. The original design RFC was described here:
http://lwn.net/Articles/575672/
http://lwn.net/Articles/575683/
And much of it still holds (and is cut-pasted, though edited, here).
=== Cgmanager Design ===
One of the driving goals is to enable nested lxc as simply and safely as
possible. If this project is a success, then a large chunk of code can
be removed from lxc. I'm considering this project a part of the larger
lxc project, but given how central it is to systems management that
doesn't mean that I'll consider anyone else's needs as less important
than our own.
This document consists of two parts. The first describes how I
intend the daemon (cgmanager) to be structured and how it will
enforce the safety requirements. The second describes the commands
which clients will be able to send to the manager. The list of
controller keys which can be set is very incomplete at this point,
serving mainly to show the approach I was thinking of taking.
=== Summary ===
Each 'host' (identified by a separate instance of the linux kernel) has
exactly one running daemon to manage control groups. This daemon
answers cgroup management requests over a dbus socket, located at
/sys/fs/cgroup/cgmanager/sock. The /sys/fs/cgroup/cgmanager directory
can be bind-mounted into various containers, so that one daemon can support the
whole system. (Bind-mounting the directory rather than the socket itself
allows a container to proceed if the cgmanager is restarted, creating a
new socket.)
Outline:
. A single manager, cgmanager, is started on the host, very early
during boot. It has very few dependencies, and requires only
/proc, /run, and /sys to be mounted, with /etc ro. It mounts
the cgroup hierarchies in a private namespace and set defaults for
clone_children and use_hierarchy. It opens a Unix socket at
/sys/fs/cgroup/cgmanager/sock.
. A client (requestor 'r') can make cgroup requests over
/sys/fs/cgroup/cgmanager/sock using dbus calls. Detailed privilege
requirements for r are listed below.
. The client request will pertain an existing or new cgroup A. r's
privilege over the cgroup must be checked. r is said to have
privilege over A if A is owned by r's uid, or if A's owner is mapped
into r's user namespace, and r is root in that user namespace.
. The client request may pertain a victim task v, which may be moved
to a new cgroup. In that case r's privilege over both the cgroup
and v must be checked. r is said to have privilege over v if v
is mapped in r's pid namespace, v's uid is mapped into r's user ns,
and r is root in its userns. Or if r and v have the same uid
and v is mapped in r's pid namespace.
. r's credentials will be taken from socket's peercred, ensuring that
pid and uid are translated.
. A request to chown a cgroup requires a uid U and gid G.
. If r is in the same pid and user namespaces as the cgmanager, then
v, U and G can be passed as integer arguments over the D-Bus requests.
. If r is not in the same namespaces as the cgmanager, then V, U and G
must be passed as SCM_CREDENTIALs so that the cgmanager receives the
translated global pid/uid/gid. Since D-Bus does not support
sending SCM_CREDENTIALs as part of a D-Bus message, the D-Bus arguments
include a file descriptor. The SCM_CREDENTIALs are sent over the
file descriptor after the D-Bus transaction completes, and the final
result is sent over the same file descriptor.
. It is desirable that all transactions can be accomplished with simple
D-Bus transactions. Therefore a cgroup manager proxy (cgproxy) is
provided. This will move /sys/fs/cgroup/cgmanager to
/sys/fs/cgroup/cgmanager.lower, then serve as a proxy translating
D-Bus requests received on /sys/fs/cgroup/cgmanager/sock into
SCM-enhanced D-Bus requests on /sys/fs/cgmanager/cgmanager.lower/sock.
. In plain D-Bus transactions, the requestor r's credentials are read
from the socket.
. In SCM-enhanced D-Bus transactions, the proxy p's credentials are read
from the socket. The requestor's credential is sent as an SCM_CREDENTIAL.
Privilege requirements by action:
* Requestor of an action (r) over a socket may only make
changes to cgroups over which it has privilege.
* Requestors may be limited to a certain #/depth of cgroups
(to limit memory usage). This is not yet implemented.
* Cgroup hierarchy is responsible for resource limits. To this end,
a request to chown cgroup A to uid U will only chown the directory
itself (allowing child cgroup creation) and the tasks and cgroup.procs
file.
* A requestor must either be uid 0 in its userns with victim mapped
ito its userns, or the same uid and in same/ancestor pidns as the
victim
* If r requests creation of cgroup '/x', /x will be interpreted
as relative to r's cgroup. r cannot make changes to cgroups not
under its own current cgroup.
* Root in the cgmanager's pid namespace may 'escape' to the cgmanager's
cgroup with a special MovePidAbs command.
* A proxy may move a task over which it has privilege to the proxy's
own cgroup. This allows the proxy to mimic the cgmanager's special
root-may-escape semantics in its own container.
* If r requests creation of cgroup '/x', it must have write access
to its own cgroup.
* if r requests setting a limit under /x, then
. either r must be root in its own userns, and UID(/x) be mapped
into its userns, or else UID(r) == UID(/x)
. /x must not be / (not strictly necessary, all users know to
ensure an extra cgroup layer above '/')
. setns(UIDNS(r)) would not work, due to in-kernel capable() checks
which won't be satisfied. Therefore we'll need to do privilege
checks ourselves, then perform the write as the host root user.
(see devices.allow/deny). Further we need to support older kernels
which don't support setns for pid.
Types of requests:
* r requests creating cgroup A'/A
. lmctfy/cli/commands/create.cc
. Verify that UID(r) mapped to 0 in r's userns
. R=cgroup_of(r)
. Verify that UID(R) is mapped into r's userns
. Create R/A'/A
. chown R/A'/A to UID(r)
* r requests to move task x to cgroup A.
. lmctfy/cli/commands/enter.cc
. r must send PID(x) as ancillary message
. Verify that UID(r) mapped to 0 in r's userns, and UID(x) is mapped into
that userns
(is it safe to allow if UID(x) == UID(r))?
. R=cgroup_of(r)
. Verify that R/A is owned by UID(r) or UID(x)? (not sure that's needed)
. echo PID(x) >> /R/A/tasks
* r requests chown of cgroup A to uid X
. X is passed in ancillary message
* ensures it is valid in r's userns
* maps the userid to host for us
. Verify that UID(r) mapped to 0 in r's userns
. R=cgroup_of(r)
. Chown R/A to X
* r requests cgroup A's 'property=value'
. Verify that either
* A != ''
* UID(r) == 0 on host
In other words, r in a userns may not set root cgroup settings.
. Verify that UID(r) mapped to 0 in r's userns
. R=cgroup_of(r)
. Set property=value for R/A
* Expect kernel to guarantee hierarchical constraints
* r requests deletion of cgroup A
. lmctfy/cli/commands/destroy.cc (without -f)
. same requirements as setting 'property=value'
* r requests purge of cgroup A
. lmctfy/cli/commands/destroy.cc (with -f)
. same requirements as setting 'property=value'
Long-term we will want the cgroup manager to become more intelligent -
to place its own limits on clients, to address cpu and device hotplug,
etc. Since we will not be doing that in the first prototype, the daemon
will not keep any state about the clients.
=== Another look at the safety of requests ===
Notes:
1. In a plain D-Bus call, the proxy is the requestor.
2. If a client does an SCM call to the cgmanager socket,
then the proxy is the requestor.
3. In any call over a proxy, the proxy won't be able to
make changes outside its own cgroups. If it misbehaves,
damage is contained so it only damages itself..
4. Chained proxying is not supported. If a proxy gets a
request where proxy != requestor, the call is rejected.
5. The identity of the proxy (which may be the requestor) cannot
be forged; it is taken from the socket credential. A more
privileged user must not allow a less privileged task to
have access to the opened DBus socket, as the credential will
be that at the time of connect().
On newer kernels, cgmanager can tell whether a proxy or requestor
is in the same namespace as itself. On older kernels, it cannot.
. for Create, this is ok. We have the proxy's real pid and
can constrain create under its cgroup.
. for getPidCgroup, we can ensure that only results under the
parent's cgroup are returned.
we can NOT ensure that results will make sense for plain
DBus calls, as we cannot guarantee that proxy is in the same
ns as cgmanager. However, this is not unsafe.
When we can and do detect that p is in a different pid namespace,
then we reject the call, because the result cannot be sensible.
. for chmod: We constrain under proxy's cgroup, so this is safe.
. for chown: on older kernel we cannot guarantee that the
uid/gid make sense on the host; However
. root on the root host - no translation necessary
. root in a non-user-ns container: no translation necessary
. root in a unprivileged container: won't have privilege
to do any chown without going through a proxy.
Therefore rejecting calls from another namespace is not
necessary. The worst it will do is to give -EPERM for calls
which for root in a unprivileged container otherwise would be
allowed to do.
. movepid:
. root on root host - fine
. root in a non-user-ns container: we can only ensure that
the victim be under the proxy's cgroup. If that is the
case, then root (which is also root on the host) is allowed
to move the task.
When we can and do detect a different pid namespace, then we
reject the call because the results cannot make sense.
. MovePidAbs: On an older kernel, or if the task is in a different
namespace, then this requires a proxy. The cgmanager will only
allow escaping up to the level of the proxy.
. root on root host - allowed to escape.
. root in a non-user-ns container: allowed to escape up to the
proxy's level. If the host misconfigures the container so
that the host's proxy is in the container, then root can
escape completely.
. if root tries to mimick a proxy, then it can only escape to
the proxy's level - it's own. So it cannot escape at all.