Notebooks and scripts for vorticity-strain-divergence histogram calculations on LLC4320 data.
LLC4320 is a very high-resolution simulation of global ocean circulation circulation, done by Dmitri Menemelis et al. on the NASA Pleiades supercomputer using the MIT general circulation model.
See these slides for more details about the simulation dataset.
Vorticity-strain-divergence histograms are a way of diagnosing general features of turbulent ocean flow. In particular Balwada et al. (2021) showed that they can give information about ocean ventilation.
They generally look like this
We aim to compute the full vorticity-strain-divergence histogram for all locations and all times in the LLC4320 dataset. We will divide the whole global ocean up into many small regions (similar to the approach used for spectral analysis in Torres et al. 2018) and create one histogram for each region, averaged over an intermediate time period.
Because this analysis requires processing 16 Terabytes of velocity data (that's just for the surface - the full dataset with all vertical depth levels would be Petabytes!) this is a significant computational challenge. Doing this in python with the Pangeo tools would be a powerful demonstration of their utility on a real science problem, which is why we have been referring to this analysis task as the "hero calculation".
For this calculation we have to perform several steps:
- Compute the vorticity, strain, and divergence from the x- and y-components of the flow velocities, correctly accounting for the fact the variables lie on an Arakawa grid staggered relative to one another;
- Split the resulting spatial fields up into a number of smaller regions;
- Compute the 3-dimensional vorticity-strain-divergence histogram for each region, without also counting over the time dimension;
- Average the resulting histograms over some time period (e.g. a few days),
- Save the new regional histogram dataset to an intermediate zarr store,
- Extract useful information from the resulting histograms.
For (1) we used xGCM - see also these AMS talk slides. To compute arbitary operations like vorticity we refactored xGCM to be able to apply "grid ufuncs". The documentation on grid ufuncs is here, but a blog post is also planned.
For (3) we used xhistogram - see also this slide (and the next) from my Scipy 2022 talk.
For (2) and (4) we use xarray's built-in .coarsen
feature.
All these steps have to be done on a very large amount of data, which we handled using dask. We found that dask's distributed scheduler needed to be improved before it could handle this workload - see this pangeo blog post and this Coiled blog post.
For (5) you can see the code used for the computation in the compute
directory, specifically this notebook. The actual dataset was written out to a zarr store in this bucket.
We are in the progress of doing (6). See notebooks in this repository in the analysis
directory.
Currently available as a Zarr store sat in a persistent Google cloud storage bucket on the LEAP-Pangeo hub but we plan to put the output histogram data in a permanent public bucket for other researchers to use.