-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge RAM usage when using multiprocessing - PyCall and OGGM suspected #96
Comments
How are you doing this? Opening files does not lead to memory usage unless if they are compresses since xarray reads only the selected data. In OGGM however we process the data independantly for each gdir and add the time series as single point netcdf to the gdir |
OK, I haven't properly benchmarked this, but I have an intuition from where this comes from. Since we are working in 2D, I'm retrieving the climate data in a very similar fashion as you do in the MB_Sandbox, but I am downscaling the temperature to the DEM of the glacier with the lapse rate from W5E5. This means that I have 1000s of more data points compared to a flowline approach. I'm then putting everything together again in an xarray @fmaussion I'm sure you have thought about these issues. I remember thinking my implementation was pretty naive, but since we still not modelling MB per se, I didn't care much at the time. We'd need to come up with a way to make this both memory efficient and fast. |
The trivial fix that comes to mind is to not store climate data on a 2D grid but compute it on the go (as OGGM does). Yes indeed, in 2D you have many more grid point, but the actual info you need for long term storage is still one single climate time series and from this you can recompute everything. For a model run, you compute what you need for the time step you need (in OGGM we do 12 months or 365 days to cover 1 year), and store only the yearly or monthly diagnostic output (e.g. mass balance). If you don't want to store all time slices in memory you can write each time step in a pre-formatted netcdf file, i.e. dumping memory to disk when possible. Would be interesting to know how the ice sheet people deal with that. |
If I only compute the MB for the timestep of interest (e.g. for a given month) this would indeed reduce the amount of data loaded, even if it's parallelized for many glaciers. However, I'm afraid that the data manipulation might be quite computationally expensive. I'll have to benchmark this. Is this what you are doing in OGGM? You retrieve and downscale the climate series for a given glacier only for the time interval of interest? I know that often the bottleneck in simulations is read/write of files, so I'm really afraid of doing this. Just to be clear, when I mean memory usage I mean RAM, there's no problem in reading and writing big files on disk right now. This also makes me think that there might be either a ton of redundancy or there's something wrong with how I computed the 2D climate data. 18 glaciers and their 2D monthly climate series for 50 years take up to 140GB of RAM, so this is not normal 🥲 |
We retrieve the entire climate data (this is only timeseries) for the entire period of available climate, but we compute distributed MB only when asked to (usually for one year). We store the distributed MB only on demand (e.g. when people want to do hydrology).
Yes. We read the climate data only once, and write only once. If memory becomes an issue I think that dumping to disk every now and then should be fine though.
you should be able to compute the size of your stored arrays and estimate the RAM they would need? One trick to save memory would be to store 1D arrays of glacier pixels only instead of the 2D ones which have loads of NaNs in them. |
Some progress on this side. I implemented the retrieval, processing and computation of the MB in the above mentioned way. I have a raw climate file in .nc in the folder of each gdir, which is loaded in memory at the beginning of a simulation. Then, whenever the solver is interrupted with a callback for a given timestep for the computation of the mass balance (e.g. monthly), the climate is downscaled to the current surface elevation of the glacier in 2D, and then the MB is computed with a given MB model. However, this didn't seem to fix the memory issue. After quite a lot of debugging, I've isolated that the issue is coming from the multiprocessing of OGGM. If I stop using multiprocessing for OGGM (even if I use it for ODINN), the memory goes down to reasonable levels. I think the issue must be related to the inability of PyCall.jl to correctly call the Python garbage collector. Moreover, I've noticed by checking all the processes using |
thanks - sorry this is causing you such trouble.
This is how multiprocessing works in python - we send N processes for N glaciers, but once a process is done it closes... Sounds like a pycall thing - that's bad, but the multiprocessing part of OGGM is extremely trivial, it should be pretty straightforward to port this to Julia. |
Yeah, I just need to figure out how to properly close "old" processes in Python from Julia. For now I'll stay off multiprocessing with OGGM, it's a small price to pay to have reasonable memory usage. I'll update this once I have more information. Temporarily fixed with #105 |
I suspect the huge memory usage right now comes from xarray and the opening of huge .nc files with the climate data for each gdir. We should find a way to make this more memory efficient and avoid having too much stuff open at the same time. This might be worth a discussion with @fmaussion to see how it's done in OGGM.
The text was updated successfully, but these errors were encountered: