Skip to content

Commit

Permalink
qmemman: refresh domain list directly instead of scheduling
Browse files Browse the repository at this point in the history
... for the next watcher loop iteration.

If two VMs are started in parallel, there may be no watcher loop
iteration between handling their requests. This means the memory request
for the second VM will operate on outdated list of VMs and may not
account for some allocations (assume memory is free, while in fact it's
already allocated to another VM). If that happens, the second VM may
fail to start due to out of memory error.

This is very similar problem as described in QubesOS/qubes-issues#1389,
but affects actual VM startup, not its auxiliary processes.

Fixes QubesOS/qubes-issues#9431
  • Loading branch information
marmarek committed Oct 22, 2024
1 parent 7eea85e commit 20d32f7
Showing 1 changed file with 15 additions and 17 deletions.
32 changes: 15 additions & 17 deletions qubes/tools/qmemmand.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@
# pylint: disable=global-statement

import configparser
import functools
import socketserver
import logging
import logging.handlers
Expand All @@ -41,15 +42,6 @@

system_state = qubes.qmemman.systemstate.SystemState()
global_lock = threading.Lock()
# If XSWatcher will
# handle meminfo event before @introduceDomain, it will use
# incomplete domain list for that and may redistribute memory
# allocated to some VM, but not yet used (see #1389).
# To fix that, system_state should be updated (refresh domain
# list) before processing other changes, every time some process requested
# memory for a new VM, before releasing the lock. Then XS_Watcher will check
# this flag before processing other event.
force_refresh_domain_list = False

def only_in_first_list(list1, list2):
ret = []
Expand Down Expand Up @@ -153,10 +145,6 @@ def meminfo_changed(self, domain_id):
with global_lock:
self.log.debug('global_lock acquired')
try:
global force_refresh_domain_list
if force_refresh_domain_list:
self.domain_list_changed(refresh_only=True)
force_refresh_domain_list = False
if domain_id not in self.watch_token_dict:
# domain just destroyed
return
Expand Down Expand Up @@ -184,6 +172,10 @@ class QMemmanReqHandler(socketserver.BaseRequestHandler):
client.
"""

def __init__(self, watcher: XSWatcher, request, client_address, server):
super().__init__(request, client_address, server)
self.watcher = watcher

def handle(self):
self.log = logging.getLogger('qmemman.daemon.reqhandler')

Expand All @@ -196,8 +188,10 @@ def handle(self):
if len(self.data) == 0:
self.log.info('client disconnected, resuming membalance')
if got_lock:
global force_refresh_domain_list
force_refresh_domain_list = True
# got_lock = True means some VM may have been just created
# ensure next watch or client request handler will use
# updated domain list
self.watcher.domain_list_changed(refresh_only=True)
return

# XXX something is wrong here: return without release?
Expand Down Expand Up @@ -290,7 +284,11 @@ def main():
# Initialize the connection to Xen and to XenStore
system_state.init()

server = socketserver.UnixStreamServer(SOCK_PATH, QMemmanReqHandler)
watcher = XSWatcher()
server = socketserver.UnixStreamServer(
SOCK_PATH,
functools.partial(QMemmanReqHandler, watcher)
)
os.umask(0o077)

# notify systemd
Expand All @@ -305,4 +303,4 @@ def main():
sock.close()

threading.Thread(target=server.serve_forever).start()
XSWatcher().watch_loop()
watcher.watch_loop()

0 comments on commit 20d32f7

Please sign in to comment.