Profile RucioConMon memory #12089

amaltaro · 2024-08-30T15:37:16Z

Fixes #<GH_Issue_Number>

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot · 2024-08-30T15:46:44Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 1 warnings
- 23 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 7 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15179/artifact/artifacts/PullRequestReport.html

fix parameters

cmsdmwmbot · 2024-08-30T22:07:04Z

Jenkins results:

Python3 Unit tests: failed
- 1 new failures
- 7 tests deleted
Python3 Pylint check: failed
- 5 warnings and errors that must be fixed
- 21 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 12 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15180/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-06T19:22:47Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests deleted
- 1 changes in unstable tests
Python3 Pylint check: failed
- 6 warnings and errors that must be fixed
- 26 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 11 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15197/artifact/artifacts/PullRequestReport.html

cmsdmwmbot · 2024-09-06T23:00:04Z

Jenkins results:

Python3 Unit tests: succeeded
- 7 tests deleted
Python3 Pylint check: failed
- 7 warnings and errors that must be fixed
- 21 comments to review
Pylint py3k check: succeeded
Pycodestyle check: succeeded
- 10 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/15198/artifact/artifacts/PullRequestReport.html

amaltaro · 2024-09-06T23:41:22Z

I finally managed to do memory measurements with the code currently provided in this PR. It compares the current RucioConMon implementation in the WMCore stack in 2 scenarios:

fetching compressed data from RucioConMon (format=raw, generating line by line)
fetching uncompressed data from RucioConMon (format=json, loading the whole data in memory)

I ran these tests in vocms0259, such that I could measure memory usage in the node with the grafana host monitor. See screenshot below:

Some observations are:

even though the memory application barely change in format=raw (compressed), we can see that about 2.5GB of memory gets allocated in the "Cache" category. My guess is that it's related to the file written to the file system (when retrieving data through RucioConMon -> Services -> pycurl_manager)
the same "Cache" 2.5GB allocation happens in the format=json as well. But this time, the application memory footprint also jumps by almost 4GB.

As I was running memory_profiler as well, here is a breakdown of the format=raw vs ``format=json`, where we can see that the application memory barely changes in the raw/compressed implementation; but has GBs of memory footprint in the json one (no generator).

format=raw (also faster!)

(WMAgent-2.3.4.3) [xxx@xxx:install]$ python testRucioConMonMem.py 
2024-09-06 19:49:28,773:INFO:testRucioConMonMem: Fetching unmerged dump for RSE: T1_US_FNAL_Disk with compressed data: True
2024-09-06 19:49:28,788:INFO:testRucioConMonMem: Fetching data from Rucio ConMon for RSE: T1_US_FNAL_Disk.
2024-09-06 19:49:28,802:INFO:RucioConMon: Size of rseUnmerged object: 11888

2024-09-06 20:48:44,553:INFO:testRucioConMonMem: Total files received: 10877227, unique dirs: 12885
Filename: testRucioConMonMem.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    27     36.0 MiB     36.0 MiB           1   @profile
    28                                         def getUnmergedFiles(rucioConMon, logger, compressed=False):
    29     36.0 MiB      0.0 MiB           1       dirs = set()
    30     36.0 MiB      0.0 MiB           1       counter = 0
    31     36.0 MiB      0.0 MiB           1       logger.info("Fetching data from Rucio ConMon for RSE: %s.", RSE_NAME)
    32     68.6 MiB     30.6 MiB    10877228       for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
    33     68.6 MiB      2.0 MiB    10877227           dirPath = _cutPath(lfn)
    34     68.6 MiB      0.0 MiB    10877227           dirs.add(dirPath)
    35                                                 #logger.info(f"Size of dirs object: {asizeof.asizeof(dirs)}")
    36     68.6 MiB      0.0 MiB    10877227           counter += 1
    37     68.6 MiB      0.0 MiB           1       logger.info(f"Total files received: {counter}, unique dirs: {len(dirs)}")
    38     68.6 MiB      0.0 MiB           1       return dirs

2024-09-06 20:48:44,555:INFO:testRucioConMonMem: Done!

format=json:

(WMAgent-2.3.4.3) [xxx@xxx:install]$ python testRucioConMonMem.py 
2024-09-06 21:12:16,810:INFO:testRucioConMonMem: Fetching unmerged dump for RSE: T1_US_FNAL_Disk with compressed data: False
2024-09-06 21:12:16,825:INFO:testRucioConMonMem: Fetching data from Rucio ConMon for RSE: T1_US_FNAL_Disk.
2024-09-06 21:20:38,011:INFO:RucioConMon: Size of rseUnmerged object: 2812956952
2024-09-06 22:18:50,841:INFO:testRucioConMonMem: Total files received: 10877227, unique dirs: 12885
Filename: testRucioConMonMem.py

Line #    Mem usage    Increment  Occurrences   Line Contents
=============================================================
    27     35.8 MiB     35.8 MiB           1   @profile
    28                                         def getUnmergedFiles(rucioConMon, logger, compressed=False):
    29     35.8 MiB      0.0 MiB           1       dirs = set()
    30     35.8 MiB      0.0 MiB           1       counter = 0
    31     35.8 MiB      0.0 MiB           1       logger.info("Fetching data from Rucio ConMon for RSE: %s.", RSE_NAME)
    32   2946.0 MiB     24.6 MiB    10877228       for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):
    33   2946.0 MiB      3.0 MiB    10877227           dirPath = _cutPath(lfn)
    34   2946.0 MiB      0.2 MiB    10877227           dirs.add(dirPath)
    35                                                 #logger.info(f"Size of dirs object: {asizeof.asizeof(dirs)}")
    36   2946.0 MiB      0.0 MiB    10877227           counter += 1
    37     63.6 MiB  -2882.4 MiB           1       logger.info(f"Total files received: {counter}, unique dirs: {len(dirs)}")
    38     63.6 MiB      0.0 MiB           1       return dirs

2024-09-06 22:18:50,842:INFO:testRucioConMonMem: Done!

I will make sure these changes are reflected in #12059 and proceed with this development over there.

vkuznet · 2024-09-07T17:11:39Z

Alan, it seems to me that actual issue in accumulation of results in this for loop: for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):. How about converting the code to generator and let client process it. You may measure the size of returned dirs and it may likely to be constant size you observe in grafana which is unavoidable. The JSON adds overhead to load the JSON data into the RAM.

amaltaro · 2024-09-08T12:10:30Z

The function rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed) returns a generator to the client, and in this example the client is testRucioConMonMem.py (I am making a similar test to what MSUnmerged does here).
With that said, this line:

for lfn in rucioConMon.getRSEUnmerged(RSE_NAME, zipped=compressed):

is in the correct place, as here will be the place to parse each lfn and decide what to do with them (on the client side).
Please let me know if I misunderstood your comment though.

vkuznet · 2024-09-08T14:21:24Z

Alan, the issue is in RucioConMon and Service modules. Here is my insight into its behavior:

the RucioConMon relies on refreshCache API from Service module, see

WMCore/src/python/WMCore/Services/RucioConMon/RucioConMon.py

Line 73 in 3ea5301

with self.refreshCache(cachedApi, apiUrl, decoder=True, binary=False) as istream:
the refreshCache put an object (list of lfns) into a file (see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Service.py#L223) after getting the data from upstream service via getData call
the RucioConMon reads full content of that file when no zip encoding is specified, see

WMCore/src/python/WMCore/Services/RucioConMon/RucioConMon.py

Line 74 in 3ea5301

results = istream.read()

, but
the underlying stream of data is read already into a RAM during getData call, see https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Service.py#L318
what is different in the call above is how data is treated when it is written to the cache file
- in case of binary=True it writes data as binary blob, https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Service.py#L336
- in case of binary=False it first loads the data via json.loads and then write it back to a file, https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Service.py#L339
- in either case the data is already in a RAM because of https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/Service.py#L318
- therefore it is unavoidable in both scenarios to get full data content in a RAM

My proposal is to stream data from makeRequest which should be converted to a generator, to preserve backward compatibility you need a new API for that. This way the code should not read data from remote server but rather reads it line by line and yields it back to a client. Also, to avoid converting data from list to a set (unique set of LFNs) you have two options: either make unique constrain on BE and avoid it on a client, or if you use cache file let the external process to handle uniqueness in a file, e.g. cat lfns.txt | sort -n | unique > new file.txt and switch your cache to a new file.

From my observation the current implementation of fetching data has unavoidable memory footprint due to loading data into python after HTTP call, and the larger HTTP response the larger memory footprint the code will deal with.

amaltaro · 2024-09-09T13:11:46Z

Valentin, you seem to have captured well the flow of an HTTP call through the Service parent class.

To add on what you described above, I think the actual data is loaded into memory at the most base class (pycurl_manager), at these lines:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Services/pycurl_manager.py#L342-L343

which can then be automatically decompressed as well, if content is in gzip format. In that scenario, the cache file in the filesystem will not be binary as well, but it will be in the content-type from the response object (json, text, etc).

To really minimize the memory footprint, we would have to stream data from server to client, fetching 1 row each time. I don't know the exact details, but I guess the server would have to support a new-line (or similar) data streaming, the connection between client and server would have to remain opened until the client exhausts all the data in the response object.

This data-streaming is somehow a conflicting idea with the custom data caching we have implemented in the Services python package. So it needs to be carefully thought and implemented.

vkuznet · 2024-09-09T13:47:51Z

What you are looking for is NDJSON data-format which server must implement, basically it is list of JSON records separated by '\n'. Doing this way client can request such data (it can be in zipped format as well) and read one record at a time, similar to how CSV data-format is processed. And, the total amount of memory required for entire set of records will be reduced to a size of a single record.

cmsdmwmbot · 2024-09-30T20:59:12Z

Can one of the admins verify this patch?

amaltaro added PR: Do not merge yet PR: Work in progress labels Aug 30, 2024

amaltaro added 2 commits August 30, 2024 17:03

Profile RucioConMon memory

570d1f6

fix parameters

Use pympler to check memory

f7ec44e

amaltaro force-pushed the fix-12061-take2 branch from 8712888 to f7ec44e Compare August 30, 2024 21:53

use tracemalloc

2130a1d

remove tracemalloc; measure with memory_profiler; use FNAL

3ea5301

amaltaro mentioned this pull request Oct 11, 2024

Consume raw/generator unmerged dump data in MSUnmerged #12059

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profile RucioConMon memory #12089

Profile RucioConMon memory #12089

amaltaro commented Aug 30, 2024

cmsdmwmbot commented Aug 30, 2024

cmsdmwmbot commented Aug 30, 2024

cmsdmwmbot commented Sep 6, 2024

cmsdmwmbot commented Sep 6, 2024

amaltaro commented Sep 6, 2024

vkuznet commented Sep 7, 2024

amaltaro commented Sep 8, 2024

vkuznet commented Sep 8, 2024

amaltaro commented Sep 9, 2024

vkuznet commented Sep 9, 2024

cmsdmwmbot commented Sep 30, 2024

Profile RucioConMon memory #12089

Are you sure you want to change the base?

Profile RucioConMon memory #12089

Conversation

amaltaro commented Aug 30, 2024

Status

Description

Is it backward compatible (if not, which system it affects?)

Related PRs

External dependencies / deployment changes

cmsdmwmbot commented Aug 30, 2024

cmsdmwmbot commented Aug 30, 2024

cmsdmwmbot commented Sep 6, 2024

cmsdmwmbot commented Sep 6, 2024

amaltaro commented Sep 6, 2024

vkuznet commented Sep 7, 2024

amaltaro commented Sep 8, 2024

vkuznet commented Sep 8, 2024

amaltaro commented Sep 9, 2024

vkuznet commented Sep 9, 2024

cmsdmwmbot commented Sep 30, 2024