Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

📓 Add README and example without pyqg #15

Merged
merged 2 commits into from
Nov 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 28 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,20 +1,41 @@
# :construction: :construction: M<sup>2</sup>LInES Equation discovery package. :construction: :construction:
# M<sup>2</sup>LInES Equation discovery package

This package implements a method for equation discovery over 2D periodic spatial fields using interleaved linear regression and symbolic regression with spatial differential operators on residuals. This approach allows us to discover parsimonious symbolic relationships between spatial fields using high-order differential terms -- as well as weighted sums of such terms, even if they have significantly different orders of magnitude.

This repo will house the M<sup>2</sup>LInES equation discovery repo, and is _under construction_.
The library is closely integrated with [pyqg](https://github.com/pyqg/pyqg), a simulator of quasigeostrophic fluid dynamics, though the symbolic regression method [can be run independently](./notebooks/example_without_pyqg.ipynb).

## Usage

Below we show a basic example of how to run symbolic regression on a dataset, then interpret the learned expression as a parameterization within pyqg, and use it to run a parameterized fluid dynamics simulation:

## Who is this package for?
TODO: Populate.
```python
import eqn_disco

terms_by_iter, parameterizations_by_iter = eqn_disco.hybrid_symbolic.hybrid_symbolic_regression(
dataset,
max_iters=3,
target='fancy_subgrid_forcing',
base_features=['velocity', 'potential_vorticity'],
base_functions=['mul', 'add'],
spatial_functions=['ddx', 'ddy', 'laplacian', 'advected'],
**gplearn_arguments
)

## Authors and contributors
The code in this repository was originally developed by [Andrew Ross](https://github.com/asross) and [Pavel Perezhogin](https://github.com/Pperezhogin), in [this repository](https://github.com/m2lines/pyqg_parameterization_benchmarks), for use in the publication: Benchmarking of machine learning ocean parameterizations in an idealized model (published in JAMES, https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2022MS003258).
parameterization = parameterizations_by_iter[-1]

It is currently being worked on by [Jim Denholm](https://github.com/jdenholm) and [Andrew Ross](https://github.com/asross), along with other members of the [ICCS](https://github.com/orgs/Cambridge-ICCS/dashboard) and [M<sup>2</sup>LInES](https://github.com/m2lines).
pyqg_run = parameterization.run_online(**pyqg_arguments)
```

For more details, see notebooks showcasing this approach on datasets [generated from pyqg](./notebooks/hybrid_symbolic.ipynb) or [constructed independently](./notebooks/example_without_pyqg.ipynb).

To install, clone the repository, and run `pip install -e .`.

## Authors and contributors

The code in this repository was originally developed by [Andrew Ross](https://github.com/asross) and [Pavel Perezhogin](https://github.com/Pperezhogin), in [this repository](https://github.com/m2lines/pyqg_parameterization_benchmarks), for use in the publication: Benchmarking of machine learning ocean parameterizations in an idealized model (published in JAMES, <https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2022MS003258>).

It is currently being worked on by [Jim Denholm](https://github.com/jdenholm) and [Andrew Ross](https://github.com/asross), along with other members of the [ICCS](https://github.com/orgs/Cambridge-ICCS/dashboard) and [M<sup>2</sup>LInES](https://github.com/m2lines).

## License

The work is available under an MIT License; see [the license](LICENSE).
4 changes: 4 additions & 0 deletions eqn_disco/__init__.py
Original file line number Diff line number Diff line change
@@ -1 +1,5 @@
"""Equation discovery package init."""

import eqn_disco.utils
import eqn_disco.hybrid_symbolic
__all__ = ["utils", "hybrid_symbolic"]
286 changes: 286 additions & 0 deletions notebooks/example_without_pyqg.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,286 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "ba5de6d9",
"metadata": {},
"source": [
"# Example of hybrid symbolic regression without `pyqg`\n",
"\n",
"In this example, we'll show how to invoke `hybrid_symbolic` on something other than a dataset generated from `pyqg` runs.\n",
"\n",
"## Import dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "01305a36",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import xarray as xr\n",
"import matplotlib.pyplot as plt\n",
"import eqn_disco"
]
},
{
"cell_type": "markdown",
"id": "f9e65500",
"metadata": {},
"source": [
"## Construct the dataset\n",
"\n",
"We'll still need to generate an `xarray.Dataset`, which is going to need to have a specific set of dimensions:\n",
"- `x`, zonal (east-west) position in real space\n",
"- `y`, meridional (north-south) position in real space\n",
"- `k`, zonal position in spectral space (we assume periodic boundary conditions<sup>1</sup>)\n",
"- `l`, meridional position in spectral space\n",
"- `lev` (optional) - vertical position in real space, ideally only over a small number of values\n",
"- `batch` (can have any name) - index of the spatial data sample in the dataset\n",
"\n",
"Data instance shapes must be `(batch, lev, y, x)` or `(batch, y, x)`.\n",
"\n",
"Below is an example of how to initialize such a dataset.\n",
"\n",
"\n",
"<sup><sup>1</sup>Note that in the future, we hope to extend this library to drop the requirement for periodic boundaries and spectral dimensions, using finite differences to take derivatives instead.</sup>"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "91b3453b",
"metadata": {},
"outputs": [],
"source": [
"grid_length = 32\n",
"num_samples = 100\n",
"num_zlayers = 1\n",
"\n",
"grid = np.linspace(0, 1, grid_length)\n",
"inputs = 0.01 * np.random.normal(size=(num_samples, num_zlayers, grid_length, grid_length))\n",
"\n",
"horizontal_wavenumbers = np.arange(0.0, grid_length / 2 + 1)\n",
"vertical_wavenumbers = np.append(np.arange(0.0, grid_length / 2), np.arange(-grid_length / 2, 0.0))\n",
"\n",
"data_set = xr.Dataset(\n",
" data_vars={\n",
" \"inputs\": ((\"batch\", \"lev\", \"y\", \"x\"), inputs),\n",
" },\n",
" coords={\n",
" \"lev\": np.arange(num_zlayers),\n",
" \"x\": grid,\n",
" \"y\": grid,\n",
" \"l\": vertical_wavenumbers * 2 * np.pi,\n",
" \"k\": horizontal_wavenumbers * 2 * np.pi,\n",
" \"batch\": np.arange(num_samples),\n",
" },\n",
")"
]
},
{
"cell_type": "markdown",
"id": "5db2f3b8",
"metadata": {},
"source": [
"## Set up the prediction problem\n",
"\n",
"As a demonstration, we're going to set up a problem where the $\\mathtt{target}$ expression we're trying to predict consists of two additive terms, one proportional to $\\partial_x \\nabla^2 [\\mathtt{inputs}]$ and the other proportional to $\\mathtt{inputs}^2$.\n",
"\n",
"Because of the relative magnitudes of $\\mathtt{inputs}$ vs. the grid spacing, these additive terms have extremely different orders of magnitude ($\\mathtt{inputs}^2$ is much smaller), so to create an expression that meaningfully includes both of them, we'll need to scale it up quite a bit:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "fbb4efd4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"mul(mul(inputs,inputs),10000000)\n",
"\tstddev = 1416.7524144874462\n",
"ddx(laplacian(inputs))\n",
"\tstddev = 5413.163709940415\n"
]
}
],
"source": [
"additive_features = ['mul(mul(inputs,inputs),10000000)', 'ddx(laplacian(inputs))']\n",
"true_expr = f'add({\",\".join(additive_features)})'\n",
"extractor = eqn_disco.utils.FeatureExtractor(data_set)\n",
"data_set['target'] = extractor.extract_feature(true_expr)\n",
"\n",
"for i, feat in enumerate(additive_features):\n",
" print(f\"{feat}\\n\\tstddev = {extractor.extract_feature(feat).std().data}\")"
]
},
{
"cell_type": "markdown",
"id": "322ca6e7",
"metadata": {},
"source": [
"Although our $\\mathtt{target}$ expression is a sum of two relatively simple terms (which you'd think would be easy for any symbolic regression method to learn), there are two difficulties:\n",
"1. it contains spatial differential operators, which most symbolic regression libraries can't model\n",
"1. it is a very unequally weighted sum, and many symbolic libraries have trouble learning very large or very small constants\n",
"\n",
"Our approach should be able to handle these difficulties.\n",
"\n",
"## Run hybrid symbolic regression\n",
"\n",
"Our approach to symbolic regression is a hybrid between genetic programming and linear regression, interleaved in sequential steps:\n",
"1. We run genetic programming to find a symbolic term which is correlated with the target, ignoring the exact proportionality constant.\n",
"1. We use linear regression to fit that constant.\n",
"1. We subtract our best fit prediction from the target, leaving behind a residual\n",
"1. We repeat our approach on the residual, refitting the constants of both terms\n",
"\n",
"Let's try that on this dataset:"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "56119963",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" | Population Average | Best Individual |\n",
"---- ------------------------- ------------------------------------------ ----------\n",
" Gen Length Fitness Length Fitness OOB Fitness Time Left\n",
" 0 8.82 0.0635013 3 0.967759 0.964835 116.11m\n",
"Iteration 1\n",
"Discovered terms: ['ddx(laplacian(inputs))']\n",
"Train correlations: [0.9674676]\n",
" | Population Average | Best Individual |\n",
"---- ------------------------- ------------------------------------------ ----------\n",
" Gen Length Fitness Length Fitness OOB Fitness Time Left\n",
" 0 8.84 0.0133863 7 0.999995 0.999995 115.53m\n",
"Iteration 2\n",
"Discovered terms: ['ddx(laplacian(inputs))', 'mul(mul(0.096, -1.661), mul(inputs, inputs))']\n",
"Train correlations: [1.]\n"
]
}
],
"source": [
"terms_by_iter, regressions_by_iter = eqn_disco.hybrid_symbolic.hybrid_symbolic_regression(\n",
" data_set,\n",
" target='target',\n",
" max_iters=2,\n",
" base_features=['inputs'],\n",
" base_functions=['mul'],\n",
" spatial_functions=['ddx', 'ddy', 'laplacian'],\n",
" parsimony_coefficient=0.1\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "389f1cb2",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['ddx(laplacian(inputs))', 'mul(mul(0.096, -1.661), mul(inputs, inputs))']"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"terms_by_iter"
]
},
{
"cell_type": "markdown",
"id": "636b2529",
"metadata": {},
"source": [
"From the output, we can see that in the first iteration, we discovered $\\partial_x \\nabla^2 [\\mathtt{inputs}]$, which got us to pretty high (>95%) training correlation, though not all the way to 100%. In the second iteration, we discovered a term corresponding to $\\mathtt{inputs}^2$ which got us all the way to 100%, though it included multiplication by a spurious constant (symbolic regression is a bit random and noisy).\n",
"\n",
"Let's look at the coefficients:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "bab56a64",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([ 1.0000000e+00, -6.2713225e+07])"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressions_by_iter[-1].models[0].coef_"
]
},
{
"cell_type": "markdown",
"id": "d9cc05eb",
"metadata": {},
"source": [
"The first coefficient is exactly 1, and the second becomes 1 if we rescale it to remove the effects of the spurious constants, and divide by the large factor we included earlier:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "e4fcf64f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.999999999999998"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"regressions_by_iter[-1].models[0].coef_[1] * (0.096 * -1.661) / 10000000"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}