m2lines · dorchard · Nov 7, 2024 · May 23, 2023 · Jun 29, 2023
diff --git a/README.md b/README.md
@@ -1,20 +1,41 @@
-# :construction: :construction: M<sup>2</sup>LInES Equation discovery package. :construction: :construction:
+# M<sup>2</sup>LInES Equation discovery package
 
+This package implements a method for equation discovery over 2D periodic spatial fields using interleaved linear regression and symbolic regression with spatial differential operators on residuals. This approach allows us to discover parsimonious symbolic relationships between spatial fields using high-order differential terms -- as well as weighted sums of such terms, even if they have significantly different orders of magnitude.
 
-This repo will house the M<sup>2</sup>LInES equation discovery repo, and is _under construction_.
+The library is closely integrated with [pyqg](https://github.com/pyqg/pyqg), a simulator of quasigeostrophic fluid dynamics, though the symbolic regression method [can be run independently](./notebooks/example_without_pyqg.ipynb).
 
+## Usage
 
+Below we show a basic example of how to run symbolic regression on a dataset, then interpret the learned expression as a parameterization within pyqg, and use it to run a parameterized fluid dynamics simulation:
 
-## Who is this package for?
-TODO: Populate.
+```python
+import eqn_disco
 
+terms_by_iter, parameterizations_by_iter = eqn_disco.hybrid_symbolic.hybrid_symbolic_regression(
+  dataset,
+  max_iters=3,
+  target='fancy_subgrid_forcing',
+  base_features=['velocity', 'potential_vorticity'],
+  base_functions=['mul', 'add'],
+  spatial_functions=['ddx', 'ddy', 'laplacian', 'advected'],
+  **gplearn_arguments
+)
 
-## Authors and contributors
-The code in this repository was originally developed by [Andrew Ross](https://github.com/asross) and [Pavel Perezhogin](https://github.com/Pperezhogin), in [this repository](https://github.com/m2lines/pyqg_parameterization_benchmarks), for use in the publication: Benchmarking of machine learning ocean parameterizations in an idealized model (published in JAMES, https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2022MS003258).
+parameterization = parameterizations_by_iter[-1]
 
-It is currently being worked on by [Jim Denholm](https://github.com/jdenholm) and [Andrew Ross](https://github.com/asross), along with other members of the [ICCS](https://github.com/orgs/Cambridge-ICCS/dashboard) and [M<sup>2</sup>LInES](https://github.com/m2lines).
+pyqg_run = parameterization.run_online(**pyqg_arguments)
+```
+
+For more details, see notebooks showcasing this approach on datasets [generated from pyqg](./notebooks/hybrid_symbolic.ipynb) or [constructed independently](./notebooks/example_without_pyqg.ipynb).
+
+To install, clone the repository, and run `pip install -e .`.
 
+## Authors and contributors
+
+The code in this repository was originally developed by [Andrew Ross](https://github.com/asross) and [Pavel Perezhogin](https://github.com/Pperezhogin), in [this repository](https://github.com/m2lines/pyqg_parameterization_benchmarks), for use in the publication: Benchmarking of machine learning ocean parameterizations in an idealized model (published in JAMES, <https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2022MS003258>).
 
+It is currently being worked on by [Jim Denholm](https://github.com/jdenholm) and [Andrew Ross](https://github.com/asross), along with other members of the [ICCS](https://github.com/orgs/Cambridge-ICCS/dashboard) and [M<sup>2</sup>LInES](https://github.com/m2lines).
 
 ## License
+
 The work is available under an MIT License; see [the license](LICENSE).
diff --git a/eqn_disco/__init__.py b/eqn_disco/__init__.py
@@ -1 +1,5 @@
 """Equation discovery package init."""
+
+import eqn_disco.utils
+import eqn_disco.hybrid_symbolic
+__all__ = ["utils", "hybrid_symbolic"]
diff --git a/notebooks/example_without_pyqg.ipynb b/notebooks/example_without_pyqg.ipynb
@@ -0,0 +1,286 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "ba5de6d9",
+   "metadata": {},
+   "source": [
+    "# Example of hybrid symbolic regression without `pyqg`\n",
+    "\n",
+    "In this example, we'll show how to invoke `hybrid_symbolic` on something other than a dataset generated from `pyqg` runs.\n",
+    "\n",
+    "## Import dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "01305a36",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import numpy as np\n",
+    "import xarray as xr\n",
+    "import matplotlib.pyplot as plt\n",
+    "import eqn_disco"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f9e65500",
+   "metadata": {},
+   "source": [
+    "## Construct the dataset\n",
+    "\n",
+    "We'll still need to generate an `xarray.Dataset`, which is going to need to have a specific set of dimensions:\n",
+    "- `x`, zonal (east-west) position in real space\n",
+    "- `y`, meridional (north-south) position in real space\n",
+    "- `k`, zonal position in spectral space (we assume periodic boundary conditions<sup>1</sup>)\n",
+    "- `l`, meridional position in spectral space\n",
+    "- `lev` (optional) - vertical position in real space, ideally only over a small number of values\n",
+    "- `batch` (can have any name) - index of the spatial data sample in the dataset\n",
+    "\n",
+    "Data instance shapes must be `(batch, lev, y, x)` or `(batch, y, x)`.\n",
+    "\n",
+    "Below is an example of how to initialize such a dataset.\n",
+    "\n",
+    "\n",
+    "<sup><sup>1</sup>Note that in the future, we hope to extend this library to drop the requirement for periodic boundaries and spectral dimensions, using finite differences to take derivatives instead.</sup>"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "91b3453b",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "grid_length = 32\n",
+    "num_samples = 100\n",
+    "num_zlayers = 1\n",
+    "\n",
+    "grid = np.linspace(0, 1, grid_length)\n",
+    "inputs = 0.01 * np.random.normal(size=(num_samples, num_zlayers, grid_length, grid_length))\n",
+    "\n",
+    "horizontal_wavenumbers = np.arange(0.0, grid_length / 2 + 1)\n",
+    "vertical_wavenumbers = np.append(np.arange(0.0, grid_length / 2), np.arange(-grid_length / 2, 0.0))\n",
+    "\n",
+    "data_set = xr.Dataset(\n",
+    "    data_vars={\n",
+    "        \"inputs\": ((\"batch\", \"lev\", \"y\", \"x\"), inputs),\n",
+    "    },\n",
+    "    coords={\n",
+    "        \"lev\": np.arange(num_zlayers),\n",
+    "        \"x\": grid,\n",
+    "        \"y\": grid,\n",
+    "        \"l\": vertical_wavenumbers * 2 * np.pi,\n",
+    "        \"k\": horizontal_wavenumbers * 2 * np.pi,\n",
+    "        \"batch\": np.arange(num_samples),\n",
+    "    },\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5db2f3b8",
+   "metadata": {},
+   "source": [
+    "## Set up the prediction problem\n",
+    "\n",
+    "As a demonstration, we're going to set up a problem where the $\\mathtt{target}$ expression we're trying to predict consists of two additive terms, one proportional to $\\partial_x \\nabla^2 [\\mathtt{inputs}]$ and the other proportional to $\\mathtt{inputs}^2$.\n",
+    "\n",
+    "Because of the relative magnitudes of $\\mathtt{inputs}$ vs. the grid spacing, these additive terms have extremely different orders of magnitude ($\\mathtt{inputs}^2$ is much smaller), so to create an expression that meaningfully includes both of them, we'll need to scale it up quite a bit:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "fbb4efd4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "mul(mul(inputs,inputs),10000000)\n",
+      "\tstddev = 1416.7524144874462\n",
+      "ddx(laplacian(inputs))\n",
+      "\tstddev = 5413.163709940415\n"
+     ]
+    }
+   ],
+   "source": [
+    "additive_features = ['mul(mul(inputs,inputs),10000000)', 'ddx(laplacian(inputs))']\n",
+    "true_expr = f'add({\",\".join(additive_features)})'\n",
+    "extractor = eqn_disco.utils.FeatureExtractor(data_set)\n",
+    "data_set['target'] = extractor.extract_feature(true_expr)\n",
+    "\n",
+    "for i, feat in enumerate(additive_features):\n",
+    "    print(f\"{feat}\\n\\tstddev = {extractor.extract_feature(feat).std().data}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "322ca6e7",
+   "metadata": {},
+   "source": [
+    "Although our $\\mathtt{target}$ expression is a sum of two relatively simple terms (which you'd think would be easy for any symbolic regression method to learn), there are two difficulties:\n",
+    "1. it contains spatial differential operators, which most symbolic regression libraries can't model\n",
+    "1. it is a very unequally weighted sum, and many symbolic libraries have trouble learning very large or very small constants\n",
+    "\n",
+    "Our approach should be able to handle these difficulties.\n",
+    "\n",
+    "## Run hybrid symbolic regression\n",
+    "\n",
+    "Our approach to symbolic regression is a hybrid between genetic programming and linear regression, interleaved in sequential steps:\n",
+    "1. We run genetic programming to find a symbolic term which is correlated with the target, ignoring the exact proportionality constant.\n",
+    "1. We use linear regression to fit that constant.\n",
+    "1. We subtract our best fit prediction from the target, leaving behind a residual\n",
+    "1. We repeat our approach on the residual, refitting the constants of both terms\n",
+    "\n",
+    "Let's try that on this dataset:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "56119963",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "    |   Population Average    |             Best Individual              |\n",
+      "---- ------------------------- ------------------------------------------ ----------\n",
+      " Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left\n",
+      "   0     8.82        0.0635013        3         0.967759         0.964835    116.11m\n",
+      "Iteration 1\n",
+      "Discovered terms: ['ddx(laplacian(inputs))']\n",
+      "Train correlations: [0.9674676]\n",
+      "    |   Population Average    |             Best Individual              |\n",
+      "---- ------------------------- ------------------------------------------ ----------\n",
+      " Gen   Length          Fitness   Length          Fitness      OOB Fitness  Time Left\n",
+      "   0     8.84        0.0133863        7         0.999995         0.999995    115.53m\n",
+      "Iteration 2\n",
+      "Discovered terms: ['ddx(laplacian(inputs))', 'mul(mul(0.096, -1.661), mul(inputs, inputs))']\n",
+      "Train correlations: [1.]\n"
+     ]
+    }
+   ],
+   "source": [
+    "terms_by_iter, regressions_by_iter = eqn_disco.hybrid_symbolic.hybrid_symbolic_regression(\n",
+    "    data_set,\n",
+    "    target='target',\n",
+    "    max_iters=2,\n",
+    "    base_features=['inputs'],\n",
+    "    base_functions=['mul'],\n",
+    "    spatial_functions=['ddx', 'ddy', 'laplacian'],\n",
+    "    parsimony_coefficient=0.1\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "389f1cb2",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "['ddx(laplacian(inputs))', 'mul(mul(0.096, -1.661), mul(inputs, inputs))']"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "terms_by_iter"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "636b2529",
+   "metadata": {},
+   "source": [
+    "From the output, we can see that in the first iteration, we discovered $\\partial_x \\nabla^2 [\\mathtt{inputs}]$, which got us to pretty high (>95%) training correlation, though not all the way to 100%. In the second iteration, we discovered a term corresponding to $\\mathtt{inputs}^2$ which got us all the way to 100%, though it included multiplication by a spurious constant (symbolic regression is a bit random and noisy).\n",
+    "\n",
+    "Let's look at the coefficients:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "bab56a64",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "array([ 1.0000000e+00, -6.2713225e+07])"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "regressions_by_iter[-1].models[0].coef_"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d9cc05eb",
+   "metadata": {},
+   "source": [
+    "The first coefficient is exactly 1, and the second becomes 1 if we rescale it to remove the effects of the spurious constants, and divide by the large factor we included earlier:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "e4fcf64f",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "0.999999999999998"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "regressions_by_iter[-1].models[0].coef_[1] * (0.096 * -1.661) / 10000000"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.13"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}