diff --git a/02-spatial-data.ipynb b/02-spatial-data.ipynb new file mode 100644 index 00000000..3a19aae4 --- /dev/null +++ b/02-spatial-data.ipynb @@ -0,0 +1,1679 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Geographic data in Python {#sec-spatial-class}\n", + "\n", + "## Introduction\n", + "\n", + "This chapter introduces key Python packages and data structures for working with the two major types of spatial data, namely:\n", + "\n", + "* **shapely** and **geopandas** --- for working with vector layers\n", + "* **rasterio** and **xarray** --- for working with rasters\n", + "\n", + "As we will see in the example code presented later in this chapter, **shapely** and **geopandas** are related:\n", + "\n", + "* **shapely** is a \"low-level\" package for working with individual vector geometry objects\n", + "* **geopandas** is a \"high-level\" package for working with geometry columns (`GeoSeries` objects), which internally contain **shapely** geometries, and vector layers (`GeoDataFrame` objects)\n", + "\n", + "The **geopandas** ecosystem provides a comprehensive approach for working with vector layers in Python, with many packages building on it.\n", + "This is not the case for raster data, however: there are several partially overlapping packages for working with raster data, each with its own advantages and disadvantages. \n", + "In this book we focus on the most prominent one:\n", + "\n", + "* **rasterio** --- a spatial-oriented package, focused on \"simple\" raster formats (such as GeoTIFF), representing a raster using a combination of a `numpy` array, and a metadata object (`dict`) specifying the spatial referencing of the array\n", + "\n", + "Another raster-related package worth mentioning is **xarray**. It is a general-purpose package for working with labeled arrays, thus advantageous for processing \"complex\" raster format (such as NetCDF), representing a raster using its own native classes, namely `xarray.Dataset` and `xarray.DataArray`\n", + "\n", + "This chapter will briefly explain the fundamental geographic data models: vector and raster. \n", + "Before demonstrating their implementation in Python, we will introduce the theory behind each data model and the disciplines in which they predominate.\n", + "\n", + "The vector data model represents the world using points, lines, and polygons. \n", + "These have discrete, well-defined borders, meaning that vector datasets usually have a high level of precision (but not necessarily accuracy).\n", + "\n", + "The raster data model divides the surface up into cells of constant size.\n", + "Raster datasets are the basis of background images used in web-mapping and have been a vital source of geographic data since the origins of aerial photography and satellite-based remote sensing devices.\n", + "Rasters aggregate spatially specific features to a given resolution, meaning that they are consistent over space and scalable (many worldwide raster datasets are available).\n", + "\n", + "Which to use? The answer likely depends on your domain of application, and the datasets you have access to:\n", + "\n", + "* Vector datasets and methods dominate the social sciences because human settlements and and processes (e.g. transport infrastructure) tend to have discrete borders\n", + "* Raster datasets and methods dominate many environmental sciences because of the reliance on remote sensing data\n", + "\n", + "There is much overlap in some fields and raster and vector datasets can be used together: ecologists and demographers, for example, commonly use both vector and raster data. \n", + "Furthermore, it is possible to convert between the two forms\n", + "\n", + "Whether your work involves more use of vector or raster datasets, it is worth understanding the underlying data models before using them, as discussed in subsequent chapters. \n", + "This book focusses on approaches that build on **geopandas** and **rasterio** packages to work with vector data and raster datasets, respectively.\n", + "\n", + "## Vector data {#sec-vector-data}\n", + "\n", + "The geographic vector data model is based on points located within a coordinate reference system (CRS). \n", + "Points can represent self-standing features (e.g., the location of a bus stop), or they can be linked together to form more complex geometries such as lines and polygons. \n", + "Most point geometries contain only two dimensions (3-dimensional CRSs contain an additional $z$ value, typically representing height above sea level).\n", + "\n", + "In this system, London, for example, can be represented by the coordinates `(-0.1, 51.5)`. \n", + "This means that its location is -0.1 degrees east and 51.5 degrees north of the origin.\n", + "The origin, in this case, is at 0 degrees longitude (the Prime Meridian) and 0 degree latitude (the Equator) in a geographic ('lon/lat') CRS.\n", + "\n", + "The same point could also be approximated in a projected CRS with 'Easting/Northing' values of `(530000, 180000)` in the British National Grid, meaning that London is located 530 $km$ East and 180 $km$ North of the origin of the CRS. \n", + "\n", + "\n", + "The location of National Grid's origin, in the sea beyond South West Peninsular, ensures that most locations in the UK have positive Easting and Northing values.\n", + "\n", + "\n", + "\n", + "**geopandas** provides classes for geographic vector data and a consistent command-line interface for reproducible geographic data analysis in Python.\n", + "**geopandas** provides an interface to three mature libraries for geocomputation which, in combination, represent a strong foundation on which many geographic applications (including QGIS and R's spatial ecosystem) builds:\n", + "\n", + "* GDAL, for reading, writing, and manipulating a wide range of geographic data formats, covered in Chapter 8\n", + "* PROJ, a powerful library for coordinate system transformations, which underlies the content covered in Chapter 7\n", + "* GEOS, a planar geometry engine for operations such as calculating buffers and centroids on data with a projected CRS, covered in Chapter 5\n", + "\n", + "Tight integration with these geographic libraries makes reproducible geocomputation possible: an advantage of using a higher level language such as Python to access these libraries is that you do not need to know the intricacies of the low level components, enabling focus on the methods rather than the implementation.\n", + "This section introduces **geopandas** classes in preparation for subsequent chapters (Chapters 5 and 8 cover the GEOS and GDAL interface, respectively).\n", + "\n", + "### Vector data classes\n", + "\n", + "The main classes for working with geographic vector data in Python are hierarchical, meaning the highest level 'vector layer' class is composed of simpler 'geometry column' and individual 'geometry' components.\n", + "This section introduces them in order, starting with the highest level class.\n", + "For many applications, the high level vector layer class, which are essentially a data frame with geometry columns, are all that's needed.\n", + "However, it's important to understand the structure of vector geographic objects and their component pieces for more advanced applications.\n", + "The three main vector geographic data classes in Python are:\n", + "\n", + "* `GeoDataFrame`, a class representing vector layers, with a geometry column (class `GeoSeries`) as one of the columns\n", + "* `GeoSeries`, a class that is used to represent the geometry column in `GeoDataFrame` objects\n", + "* `shapely` geometry objects which represent individual geometries, such as a point or a polygon\n", + "\n", + "The first two classes (`GeoDataFrame` and `GeoSeries`) are defined in **geopandas**.\n", + "The third class is defined in the **shapely** package, which deals with individual geometries, and is a main dependency of the **geopandas** package.\n", + "\n", + "### Vector layers {#sec-vector-layers}\n", + "\n", + "The most commonly used geographic vector data structure is the vector layer.\n", + "There are several approaches for working with vector layers in Python, ranging from low-level packages (e.g., **osgeo**, **fiona**) to the relatively high-level **geopandas** package that is the focus of this section.\n", + "Before writing and running code for creating and working with geographic vector objects, we therefore import **geopandas** (by convention as `gpd` for more concise code) and **shapely**:\n" + ], + "id": "1770d00b" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import matplotlib.pyplot as plt\n", + "import pandas as pd\n", + "import geopandas as gpd\n", + "import shapely.geometry\n", + "import shapely.wkt" + ], + "id": "a1025385", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We also limit the maximum number of printed rows to four, to save space, using the `'display.max_rows'` option of **pandas**:\n" + ], + "id": "06bbdbda" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "pd.set_option('display.max_rows', 4)" + ], + "id": "2489daba", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Projects often start by importing an existing vector layer saved as an ESRI Shapefile (`.shp`), a GeoPackage (`.gpkg`) file, or other geographic file format.\n", + "The function `read_file()` in the following line of code imports a GeoPackage file named `world.gpkg` located in the `data` directory of Python's working directory into a `GeoDataFrame` named `gdf`:\n" + ], + "id": "67edd0dc" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| echo: false\n", + "#| label: getdata\n", + "from pathlib import Path\n", + "import os\n", + "import shutil\n", + "data_path = Path('data')\n", + "\n", + "if data_path.is_dir():\n", + " pass\n", + " # print('path exists') # directory exists\n", + "else:\n", + " print('Attempting to get and unzip the data')\n", + " import requests, zipfile, io\n", + " r = requests.get('https://github.com/geocompx/geocompy/releases/download/0.1/data.zip')\n", + " z = zipfile.ZipFile(io.BytesIO(r.content))\n", + " z.extractall('.')\n", + "\n", + "data_path = Path('data/cycle_hire_osm.gpkg')\n", + "\n", + "if data_path.is_file():\n", + " pass\n", + " # print('path exists') # directory exists\n", + "else:\n", + " print('Attempting to move data')\n", + " r = requests.get('https://github.com/geocompx/geocompy/archive/refs/heads/main.zip')\n", + " z = zipfile.ZipFile(io.BytesIO(r.content))\n", + " z.extractall('.')\n", + " shutil.copytree('py-main/data', 'data', dirs_exist_ok=True) \n", + "\n", + "data_path = Path('output')\n", + "\n", + "if data_path.is_dir():\n", + " pass\n", + " # print('path exists') # directory exists\n", + "else:\n", + " print('Attempting to move data')\n", + " shutil.copytree('py-main/output', 'output', dirs_exist_ok=True) " + ], + "id": "getdata", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf = gpd.read_file('data/world.gpkg')" + ], + "id": "0212a37c", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As result is an object of type (class) `GeoDataFrame` with 177 rows (features) and 11 columns, as shown in the output of the following code:\n" + ], + "id": "5e954ecc" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| label: typegdf\n", + "type(gdf)\n", + "gdf.shape" + ], + "id": "typegdf", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `GeoDataFrame` class is an extension of the `DataFrame` class from the popular **pandas** package. \n", + "This means we can treat a vector layer as a table, and process it using the ordinary, i.e., non-spatial, established function methods.\n", + "For example, standard data frame subsetting methods can be used.\n", + "The code below creates a subset of the `gdf` dataset containing only the country name and the geometry:\n" + ], + "id": "d762670d" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf = gdf[['name_long', 'geometry']]\n", + "gdf" + ], + "id": "aa201eb1", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following expression creates a subset based on a condition, such as equality of the value in the `'name_long'` column to the string `'Egypt'`:\n" + ], + "id": "cad49eab" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf[gdf['name_long'] == 'Egypt']" + ], + "id": "4ea1e260", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, to get a sense of the spatial component of the vector layer, it can be plotted using the `.plot` method, as follows:\n" + ], + "id": "0905dafd" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf.plot();" + ], + "id": "fd142363", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or using `.explore` to get an interactive plot:\n" + ], + "id": "a998abff" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf.explore()" + ], + "id": "2e0b28f4", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And consequently, a subset can be plotted using: \n" + ], + "id": "cbc5da60" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf[gdf['name_long'] == 'Egypt'].explore()" + ], + "id": "38bde31a", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| echo: false\n", + "# (Alternative)\n", + "# import hvplot.pandas\n", + "# gdf.hvplot(title='Hello world', geo=True, hover_cols=['name_long'], legend=False).opts(bgcolor='lightgray', active_tools=['wheel_zoom']) \n", + "#This way, we can also add background tiles:\n", + "# gdf.hvplot(tiles='OSM', alpha=0.5, geo=True, title='Hello world', hover_cols=['name_long'], legend=False).opts(active_tools=['wheel_zoom']) " + ], + "id": "974879a7", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Geometry columns {#sec-geometry-columns}\n", + "\n", + "A vital column in a `GeoDataFrame` is the geometry column, of class `GeoSeries`.\n", + "The geometry column contains the geometric part of the vector layer.\n", + "In the case of the `gdf` object, the geometry column contains `'MultiPolygon'`s associated with each country:\n" + ], + "id": "045c19e0" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf['geometry']" + ], + "id": "04445ec8", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The geometry column also contains the spatial reference information, if any:\n" + ], + "id": "c485c556" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf['geometry'].crs" + ], + "id": "7aef15f2", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Many geometry operations, such as calculating the centroid, buffer, or bounding box of each feature involve just the geometry.\n", + "Applying this type of operation on a `GeoDataFrame` is therefore basically a shortcut to applying it on the `GeoSeries` object in the geometry column.\n", + "The two following commands therefore return exactly the same result, a `GeoSeries` with country centroids:\n" + ], + "id": "3e908c1d" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf.centroid" + ], + "id": "109759f2", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf['geometry'].centroid" + ], + "id": "3a2b9003", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that `.centroid`, and other similar operators in `geopandas` such as `.buffer` (@sec-buffers) or `.convex_hull`, return only the geometry (i.e., a `GeoSeries`), not a `GeoDataFrame` with the original attribute data. In case we want the latter, we can create a copy of the `GeoDataFrame` and then \"overwrite\" its geometry (or, we can overwrite the geometries directly in case we don't need the original ones, as in `gdf['geometry']=gdf.centroid`):\n" + ], + "id": "ceb1a1e5" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf2 = gdf.copy()\n", + "gdf2['geometry'] = gdf.centroid\n", + "gdf2" + ], + "id": "19f138d2", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another useful property of the geometry column is the geometry type, as shown in the following code. \n", + "Note that the types of geometries contained in a geometry column (and, thus, a vector layer) are not necessarily the same for every row.\n", + "Accordingly, the `.type` property returns a `Series` (of type `string`), rather than a single value:\n" + ], + "id": "0bb02cce" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf['geometry'].type" + ], + "id": "46f85cdc", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To summarize the occurrence of different geometry types in a geometry column, we can use the **pandas** method called `value_counts`:\n" + ], + "id": "d42fb27b" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf['geometry'].type.value_counts()" + ], + "id": "62e1e48b", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this case, we see that the `gdf` layer contains only `'MultiPolygon'` geometries.\n", + "It is possible to have multiple geometry types in a single `GeoSeries` and a `GeoDataFrame` can have multiple `GeoSeries`:\n" + ], + "id": "ac7c5223" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf['centroids'] = gdf.centroid\n", + "gdf['polygons'] = gdf.geometry\n", + "gdf" + ], + "id": "5833a23a", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To switch the geometry column from one ``GeoSeries` column to another, we use `set_geometry`:\n" + ], + "id": "2e5e0213" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf.set_geometry('centroids', inplace=True)\n", + "gdf.explore()" + ], + "id": "6e9eb0ad", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf.set_geometry('polygons', inplace=True)\n", + "gdf.explore()" + ], + "id": "4fe42d49", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### The Simple Features standard {#sec-simple-features}\n", + "\n", + "Geometries are the basic building blocks of vector layers. \n", + "Although the Simple Features standard defines about 20 types of geometries, we will focus on the seven most commonly used types: `POINT`, `LINESTRING`, `POLYGON`, `MULTIPOINT`, `MULTILINESTRING`, `MULTIPOLYGON` and `GEOMETRYCOLLECTION`. \n", + "Find the whole list of possible feature types in the PostGIS manual.\n", + "\n", + "\n", + "Well-known binary (WKB) and well-known text (WKT) are the standard encodings for simple feature geometries.\n", + "\n", + "WKB representations are usually hexadecimal strings easily readable for computers. \n", + "This is why GIS and spatial databases use WKB to transfer and store geometry objects. \n", + "WKT, on the other hand, is a human-readable text markup description of simple features.\n", + "Both formats are exchangeable, and if we present one, we will naturally choose the WKT representation.\n", + "\n", + "The foundation of each geometry type is the point. \n", + "A point is simply a coordinate in 2D, 3D, or 4D space such as:\n", + "\n", + "\n", + "```text\n", + "POINT (5 2)\n", + "```\n", + "\n", + "A linestring is a sequence of points with a straight line connecting the points, for example:\n", + "\n", + "\n", + "```text\n", + "LINESTRING (1 5, 4 4, 4 1, 2 2, 3 2)\n", + "```\n", + "\n", + "A polygon is a sequence of points that form a closed, non-intersecting ring. Closed means that the first and the last point of a polygon have the same coordinates (see right panel in Figure ...).\n", + "\n", + "```text\n", + "POLYGON ((1 5, 2 2, 4 1, 4 4, 1 5))\n", + "```\n", + "\n", + "\n", + "\n", + "So far we have created geometries with only one geometric entity per feature. \n", + "However, the Simple Features standard allows multiple geometries to exist within a single feature, using \"multi\" versions of each geometry type, as illustrated below:\n", + "\n", + "```text\n", + "MULTIPOINT (5 2, 1 3, 3 4, 3 2)\n", + "MULTILINESTRING ((1 5, 4 4, 4 1, 2 2, 3 2), (1 2, 2 4))\n", + "MULTIPOLYGON (((1 5, 2 2, 4 1, 4 4, 1 5), (0 2, 1 2, 1 3, 0 3, 0 2)))\n", + "```\n", + "\n", + "\n", + "\n", + "Finally, a geometry collection can contain any combination of geometries including (multi)points and linestrings:\n", + "\n", + "\n", + "```text\n", + "GEOMETRYCOLLECTION (MULTIPOINT (5 2, 1 3, 3 4, 3 2), LINESTRING (1 5, 4 4, 4 1, 2 2, 3 2))\n", + "```\n", + "\n", + "### Geometries {#sec-geometries}\n", + "\n", + "Each element in the geometry column is a geometry object, of class `shapely`.\n", + "For example, here is one specific geometry selected by implicit index (that of Canada):\n" + ], + "id": "4cd77127" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf['geometry'].iloc[3]" + ], + "id": "8772911d", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also select a specific geometry based on the `'name_long'` attribute:\n" + ], + "id": "b447859f" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf[gdf['name_long'] == 'Egypt']['geometry'].iloc[0]" + ], + "id": "8a6d21f4", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The **shapely** package is compatible with the Simple Features standard.\n", + "Accordingly, seven types of geometries are supported.\n", + "The following section demonstrates creating a `shapely` geometry of each type from scratch. In the first example (a `'Point'`) we demonstrate two types of inputs for geometry creation:\n", + "\n", + "* a list of coordinates\n", + "* a `string` in the WKT format and \n", + "\n", + "In the examples for the remaining geometries we use the former approach.\n", + "\n", + "Creating a `'Point'` geometry from a list of coordinates uses the `shapely.geometry.Point` function:\n" + ], + "id": "631a8934" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "point = shapely.geometry.Point([5, 2])\n", + "point" + ], + "id": "cc3c2d9a", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively, we can use the `shapely.wkt.loads` (stands for \"load a WKT *s*tring\") to transform a WKT string to a `shapely` geometry object. \n", + "Here is an example of creating the same `'Point'` geometry from WKT:\n" + ], + "id": "bf035c13" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "point = shapely.wkt.loads('POINT (5 2)')\n", + "point" + ], + "id": "b8d2fa3f", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of a `'MultiPoint'` geometry from a list of coordinate tuples:\n" + ], + "id": "147de7f0" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "multipoint = shapely.geometry.MultiPoint([(5,2), (1,3), (3,4), (3,2)])\n", + "multipoint" + ], + "id": "f2c71317", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of a `'LineString'` geometry from a list of coordinate tuples:\n" + ], + "id": "3b7a781b" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "linestring = shapely.geometry.LineString([(1,5), (4,4), (4,1), (2,2), (3,2)])\n", + "linestring" + ], + "id": "f9060d41", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of a `'MultiLineString'` geometry. Note that there is one list of coordinates for each line in the `MultiLineString`:\n" + ], + "id": "261f3d26" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "linestring = shapely.geometry.MultiLineString([[(1,5), (4,4), (4,1), (2,2), (3,2)], [(1,2), (2,4)]])\n", + "linestring" + ], + "id": "d5a9ff1f", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of a `'Polygon'` geometry. Note that there is one list of coordinates that defines the exterior outer hull of the polygon, followed by a list of lists of coordinates that define the potential holes in the polygon:\n" + ], + "id": "8719792b" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "polygon = shapely.geometry.Polygon([(1,5), (2,2), (4,1), (4,4), (1,5)], [[(2,4), (3,4), (3,3), (2,3), (2,4)]])\n", + "polygon" + ], + "id": "5700fadb", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here is an example of a `'MultiPolygon'` geometry:\n" + ], + "id": "2344e1d7" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "multipolygon = shapely.geometry.MultiPolygon([\n", + " shapely.geometry.Polygon([(1,5), (2,2), (4,1), (4,4), (1,5)]), \n", + " shapely.geometry.Polygon([(0,2), (1,2), (1,3), (0,3), (0,2)])\n", + "])\n", + "multipolygon" + ], + "id": "de2d8163", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And, finally, here is an example of a `'GeometryCollection'` geometry:\n" + ], + "id": "b4a47141" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "multipoint = shapely.geometry.GeometryCollection([\n", + " shapely.geometry.MultiPoint([(5,2), (1,3), (3,4), (3,2)]),\n", + " shapely.geometry.MultiLineString([[(1,5), (4,4), (4,1), (2,2), (3,2)], [(1,2), (2,4)]])\n", + "])\n", + "multipoint" + ], + "id": "72ca84ed", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`shapely` geometries act as atomic units of vector data, as spatial operations on a geometry return a single new geometry.\n", + "For example, the following expression calculates the difference between the buffered `multipolygon` (using distance of `0.2`) and itself:\n" + ], + "id": "f80fb6c6" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "multipolygon.buffer(0.2).difference(multipolygon)" + ], + "id": "640b882c", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As demonstrated above, a `shapely` geometry object is automatically evaluated to a small image of the geometry (when using an interface capable of displaying it, such as a Jupyter Notebook). \n", + "To print the WKT string instead, we can use the `print` function:\n" + ], + "id": "a5097c9a" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "print(linestring)" + ], + "id": "3a8f395c", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can determine the geometry type using the `.geom_type` property, which returns a `string`:\n" + ], + "id": "6b7ebf9c" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "linestring.geom_type" + ], + "id": "df2bd867", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, it is important to note that raw coordinates of `shapely` geometries are accessible through a combination of the `.coords`, `.geoms`, `.exterior`, and `.interiors`, properties (depending on the geometry type). \n", + "These access methods are helpful when we need to develop our own spatial operators for specific tasks. \n", + "For example, the following expression returns the `list` of all coordinates of the `polygon` geometry exterior:\n" + ], + "id": "091f51db" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "list(polygon.exterior.coords)" + ], + "id": "13960411", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Vector layer from scratch {#sec-vector-layer-from-scratch}\n", + "\n", + "In the previous sections we started with a vector layer (`GeoDataFrame`), from an existing Shapefile, and \"decomposed\" it to extract the geometry column (`GeoSeries`, @sec-geometry-columns) and separate geometries (`shapely`, see @sec-geometries).\n", + "In this section, we will demonstrate the opposite process, constructing a `GeoDataFrame` from `shapely` geometries, combined into a `GeoSeries`.\n", + "This will:\n", + "\n", + "* Help you better understand the structure of a `GeoDataFrame`, and\n", + "* May come in handy when you need to programmatically construct simple vector layers, such as a line between two given points, etc.\n", + "\n", + "Vector layers consist of two main parts: geometries and non-geographic attributes. @fig-gdf-flow shows how a `GeoDataFrame` object is created---geometries come from a `GeoSeries` object (which consists of `shapely` geometries), while attributes are taken from `Series` objects. \n", + "\n", + "![Creating a `GeoDataFrame` from scratch](images/gdf-flow.svg){#fig-gdf-flow}\n", + "\n", + "The final result, a vector layer (`GeoDataFrame`) is therefore a hierarchical structure (@fig-gdf-structure), containing the geometry column (`GeoSeries`), which in turn contains geometries (`shapely`). Each of the \"internal\" components can be accessed, or \"extracted\", which is sometimes necessary, as we will see later on.\n", + "\n", + "![Structure of a `GeoDataFrame`](images/gdf-structure.svg){#fig-gdf-structure}\n", + "\n", + "Non-geographic attributes represent the name of the feature or other attributes such as measured values, groups, and other things. To illustrate attributes, we will represent a temperature of 25°C in London on June 21st, 2017. This example contains a geometry (the coordinates), and three attributes with three different classes (place name, temperature and date). Objects of class `GeoDataFrame` represent such data by combining the attributes (`Series`) with the simple feature geometry column (`GeoSeries`). First, we create a point geometry, which we know how to do from @sec-geometries:\n" + ], + "id": "424e71e0" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "lnd_point = shapely.geometry.Point(0.1, 51.5)\n", + "lnd_point" + ], + "id": "21ee5498", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we create a `GeoSeries` (of length 1), containing the point. Note that a `GeoSeries` stores a CRS definition, in this case WGS84 (defined using its EPSG code `4326`). Also note that the `shapely` geometries go into a `list`, to illustrate that there can be more than one (unlike in this example):\n" + ], + "id": "9d0391c1" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "lnd_geom = gpd.GeoSeries([lnd_point], crs=4326)\n", + "lnd_geom" + ], + "id": "d0763e10", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next, we combine the `GeoSeries` with other attributes into a `dict`. The geometry column is a `GeoSeries`, named `geometry`. The other attributes (if any) may be defined using `list` or `Series` objects. Here, for simplicity, we use the `list` option for defining the three attributes `name`, `temperature`, and `date`. Again, note that the `list` can be of length >1, in case we are creating a layer with more than one feature:\n" + ], + "id": "188f9050" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "d = {\n", + " 'name': ['London'],\n", + " 'temperature': [25],\n", + " 'date': ['2017-06-21'],\n", + " 'geometry': lnd_geom\n", + "}" + ], + "id": "924e6814", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Finally, the `dict` can be coverted to a `GeoDataFrame`:\n" + ], + "id": "a97a44f4" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "lnd_layer = gpd.GeoDataFrame(d)\n", + "lnd_layer" + ], + "id": "f1aad3cf", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "What just happened? First, the coordinates were used to create the simple feature geometry (`shapely`). Second, the geometry was converted into a simple feature geometry column (`GeoSeries`), with a CRS. Third, attributes were combined with `GeoSeries`. This results in an `GeoDataFrame` object, named `lnd_layer`.\n", + "\n", + "Just to illustrate how does creating a layer with more than one feature looks like, here is an example where we create a layer with two points, London and Paris:\n" + ], + "id": "b1834202" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "lnd_point = shapely.geometry.Point(0.1, 51.5)\n", + "paris_point = shapely.geometry.Point(2.3, 48.9)\n", + "towns_geom = gpd.GeoSeries([lnd_point, paris_point], crs=4326)\n", + "d = {\n", + " 'name': ['London', 'Paris'],\n", + " 'temperature': [25, 27],\n", + " 'date': ['2017-06-21', '2017-06-21'],\n", + " 'geometry': towns_geom\n", + "}\n", + "towns_layer = gpd.GeoDataFrame(d)\n", + "towns_layer" + ], + "id": "3b862042", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The following expression creates an interactive map with the result:\n" + ], + "id": "d9f23359" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "towns_layer.explore()" + ], + "id": "dbaf95e3", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively, we can first create a `pandas.DataFrame` and turn it into a `GeoDataFrame` like this:\n" + ], + "id": "a842bb91" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "towns_table = pd.DataFrame({\n", + " 'name': ['London', 'Paris'],\n", + " 'temperature': [25, 27],\n", + " 'date': ['2017-06-21', '2017-06-21'],\n", + " 'x': [0.1, 2.3],\n", + " 'y': [51.5, 48.9]\n", + "})\n", + "towns_geom = gpd.points_from_xy(towns_table['x'], towns_table['y'])\n", + "towns_layer = gpd.GeoDataFrame(towns_table, geometry=towns_geom, crs=4326)" + ], + "id": "546c4038", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "which gives the same result `towns_layer`:\n" + ], + "id": "3398df8e" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "towns_layer" + ], + "id": "2acabecf", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This approach is particularly useful when we need to read data from a CSV file, e.g., using `pandas.read_csv`, and want to turn the resulting `DataFrame` into a `GeoDataFrame`.\n", + "\n", + "### Derived numeric properties {#sec-area-length}\n", + "\n", + "Vector layers are characterized by two essential derived numeric properties: \n", + "\n", + "* Length (`.length`)---Applicable to lines\n", + "* Area (`.area`)---Applicable to polygons\n", + "\n", + "Area and length can be calculated for any data structures discussed above, either a `shapely` geometry, in which case the returned value is a number:\n" + ], + "id": "0581e462" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "linestring.length" + ], + "id": "76e37def", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "multipolygon.area" + ], + "id": "a62ca83e", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "or for `GeoSeries` or `DataFrame`, in which case the returned value is a numeric `Series`:\n" + ], + "id": "0ee47b4e" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf.area" + ], + "id": "26ad52a1", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Like all numeric calculations in `geopandas`, the results assume a planar CRS and are returned in its native units. This means that length and area measurements for geometries in WGS84 (`crs=4326`) are returned in decimal degrees and essentially meaningless, thus the warning in the above command. \n", + "\n", + "To obtain true length and area measurements, the geometries first need to be transformed to a projected CRS (see @sec-reprojecting-vector-geometries) applicable to the area of interest. For example, the area of Slovenia can be calculated in the UTM zone 33N CRS (`crs=32633`). The result is in the CRS units, namely $m^2$:\n" + ], + "id": "8095a719" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "gdf[gdf['name_long'] == 'Slovenia'].to_crs(32633).area" + ], + "id": "7e602b7e", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Raster data\n", + "\n", + "### Introduction\n", + "\n", + "As mentioned above, working with rasters in Python is less organized around one comprehensive package (compared to the case for vector layers and **geopandas**).\n", + "Instead, several packages provide alternative subsets of method for working with raster data. \n", + "\n", + "The two most notable approaches for working with rasters in Python are provided by the **rasterio** and **xarray** packages.\n", + "As we will see shortly, they differ in their scope and underlying data models. \n", + "Specifically, **rasterio** represents rasters as **numpy** arrays associated with a separate object holding the spatial metadata.\n", + "The **xarray** package, however, represents rasters with the native `DataArray` object, which is an extension of **numpy** array designed to hold axis labels and attributes, in the same object, together with the array of raster values.\n", + "\n", + "Both packages are not exhaustive in the same way **geopandas** is. \n", + "For example, when working with **rasterio**, on the one hand, more packages may be needed to accomplish common tasks such as zonal statistics (package **rasterstats**) or calculating topographic indices (package **richdem**). \n", + "On the other hand, **xarray** was extended to accommodate spatial operators missing from the core package itself, with the **rioxarray** and **xarray-spatial** packages.\n", + "\n", + "In the following two sections, we introduce **rasterio**, which is the raster-related package we are going to work with through the rest of the book. \n", + "\n", + "### Using **rasterio** {#sec-using-rasterio}\n", + "\n", + "To work with the **rasterio** package, we first need to import it.\n", + "We also import **numpy**, since the underlying raster data are stored in **numpy** arrays.\n", + "To effectively work with those, we expose all **numpy** functions.\n", + "Finally, we import the `rasterio.plot` sub-module for its `show` function for quick visualization of rasters.\n" + ], + "id": "d0d7ce44" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import numpy as np\n", + "import rasterio\n", + "import rasterio.plot\n", + "import subprocess" + ], + "id": "717aed7f", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Rasters are typically imported from existing files.\n", + "When working with **rasterio**, importing a raster is actually a two-step process:\n", + "\n", + "* First, we open a raster file \"connection\" using `rasterio.open`\n", + "* Second, we read raster values from the connection using the `.read` method\n", + "\n", + "This separation is analogous to basic Python functions for reading from files, such as `open` and `.readline` to read from a text file.\n", + "The rationale is that we do not always want to read all information from the file into memory, which is particularly important as rasters size can be larger than RAM size. \n", + "Accordingly, the second step (`.read`) is selective. For example, we may want to read just one raster band rather than reading all bands.\n", + "\n", + "In the first step, we pass a file path to the `rasterio.open` function to create a `DatasetReader` file connection. \n", + "For this example, we use a single-band raster representing elevation in Zion National Park:\n" + ], + "id": "28cfbcd5" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "src = rasterio.open('data/srtm.tif')\n", + "src" + ], + "id": "3f360fe4", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To get a first impression of the raster values, we can plot the raster using the `show` function:\n" + ], + "id": "9f7b1edc" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "rasterio.plot.show(src);" + ], + "id": "5807ad3c", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `DatasetReader` contains the raster metadata, that is, all of the information other than the raster values.\n", + "Let us examine it:\n" + ], + "id": "5c423816" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "src.meta" + ], + "id": "f5b5d59e", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Importantly, we can see: \n", + "\n", + "* The raster data type (`dtype`)\n", + "* Raster dimensions (`width`, `height`, and `count`, i.e., number of layers)\n", + "* Raster Coordinate Reference System (`crs`)\n", + "* The raster affine transformation matrix (`transform`)\n", + "\n", + "The last item (i.e., `transform`) deserves more attention. \n", + "To position a raster in geographical space, in addition to the CRS we must specify the raster *origin* ($x_{min}$, $y_{max}$) and resolution ($delta_{x}$, $delta_{y}$). \n", + "In the transform matrix notation, these data items are stored as follows:\n", + "\n", + "\n", + "```{text}\n", + "Affine(delta_x, 0.0, x_min,\n", + " 0.0, delta_y, y_max)\n", + "```\n", + "\n", + "\n", + "Note that, by convention, raster y-axis origin is set to the maximum value ($y_{max}$) rather than the minimum, and, accordingly, the y-axis resolution ($delta_{y}$) is negative. \n", + "\n", + "Finally, the `.read` method of the `DatasetReader` is used to read the actual raster values.\n", + "Importantly, we can read:\n", + "\n", + "* A particular layer, passing a numeric index (as in `.read(1)`)\n", + "* A subset of layers, passing a `list` of indices (as in `.read([1,2])`)\n", + "* All layers (as in `.read()`)\n", + "\n", + "Note that the layer indices start from `1`, contrary to the Python convention of the first index being `0`. \n", + "\n", + "The resulting object is a **numpy** array, with either two or three dimensions:\n", + "\n", + "* *Three* dimensions, when reading more than one layer (e.g., `.read()` or `.read([1,2])`). \n", + "In such case, the dimensions pattern is `(layers, rows, columns)`\n", + "* *Two* dimensions, when reading one specific layer (e.g., `.read(1)`)\n", + "\n", + "Let's read the first (and only) layer from the `srtm.tif` raster, using the file connection object `src`, for example:\n" + ], + "id": "b660c4b5" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "src.read(1)" + ], + "id": "3d2016c7", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result is a two-dimensional `numpy` array. \n", + "\n", + "The relation between a `rasterio` file connection and the derived properties is summarized in @fig-rasterio-structure. The file connection (created with `rasterio.open`) gives access to the two components of raster data: the metadata (via the `.meta` property) and the values (via the `.read` method).\n", + "\n", + "![Creating a `GeoDataFrame` from scratch](images/rasterio-structure.svg){#fig-rasterio-structure}\n", + "\n", + "### Raster from scratch {#sec-raster-from-scratch}\n", + "\n", + "In this section, we are going to demonstrate creation of rasters from scratch. We are going to create two small rasters, `elev` and `grain`, which we are going to use in examples later on in the book. Unlike creating a vector layer, creating a raster from scratch is rarely needed in practive because aligning a raster with the right spatial extent is difficult to do programmatically (GIS software is a better fit for the job). Nevertheless, the examples will be useful to become more familiar with the `rasterio` data structures.\n", + "\n", + "A raster is basically an array combined with georeferencing information, namely:\n", + "\n", + "* A transformation matrix (linking pixel indices with coordinates)\n", + "* A CRS definition\n", + "\n", + "Therefore, to create a raster, we first need to have an array with the values, then supplement it with the georeferencing information. Let's create the arrays `elev` and `grain`. The `elev` array is a 6 by 6 array with sequential values from 1 to 36. It can be created as follows:\n" + ], + "id": "917fac08" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "elev = np.arange(1, 37, dtype=np.uint8).reshape(6, 6)\n", + "elev" + ], + "id": "f1aa64e0", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `grain` array represents a categorical raster with values `0`, `1`, `2`, corresponding to categories \"clay\", \"silt\", \"sand\", respectively. We will create if from a specific arrangement of pixel values, as follows:\n" + ], + "id": "b9675858" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "v = [\n", + " 1, 0, 1, 2, 2, 2, \n", + " 0, 2, 0, 0, 2, 1, \n", + " 0, 2, 2, 0, 0, 2, \n", + " 0, 0, 1, 1, 1, 1, \n", + " 1, 1, 1, 2, 1, 1, \n", + " 2, 1, 2, 2, 0, 2\n", + "]\n", + "grain = np.array(v, dtype=np.uint8).reshape(6, 6)\n", + "grain" + ], + "id": "4e520c56", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that in both cases we are using the `uint8` (unsigned integer in 8 bits, i.e., `0-255`) data type, which is minimally sufficient to represent all possible values of the given rasters. This is the recommended approach for a minimal memory footprint.\n", + "\n", + "What is missing is the raster transform (see @sec-using-rasterio). In this case, since the rasters are arbitrary, we also set up an arbitrary transformation matrix, where:\n", + "\n", + "* the origin ($x_{min}$, $y_{max}$) is at `-1.5,1.5`, and\n", + "* and resolution ($delta_{x}$, $delta_{y}$) is `0.5,-0.5`.\n", + "\n", + "In terms of code, we do that as follows, using `rasterio.transform.from_origin`:\n" + ], + "id": "b4c2f750" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "new_transform = rasterio.transform.from_origin(\n", + " west=-1.5, \n", + " north=1.5, \n", + " xsize=0.5, \n", + " ysize=0.5\n", + ")\n", + "new_transform" + ], + "id": "ea2cac2c", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that, confusingly, $delta_{y}$ is defined in `rasterio.transform.from_origin` using a positive value (`0.5`), even though it is eventuially negative (`-0.5`)! \n", + "\n", + "The raster can now be plotted in its coordinate system, passing the array along with the transformation matrix to `show`:\n" + ], + "id": "2fcc762e" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "rasterio.plot.show(elev, transform=new_transform);" + ], + "id": "e3c30c85", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The `grain` raster can be plotted the same way, as we are going to use the same transformation matrix for it as well:\n" + ], + "id": "1542c807" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "rasterio.plot.show(grain, transform=new_transform);" + ], + "id": "d3410df6", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "At this point, we can work with the raster using `rasterio`:\n", + "\n", + "* Passing the transformation matrix wherever true raster pixel coordinates are important (such as in function `show` above)\n", + "* Keeping in mind that any other layer we use in the analysis is in the same CRS of those coordinates\n", + "\n", + "Finally, to export the raster for permanent storage, along with the CRS definition, we need to go through the following steps:\n", + "\n", + "* Create a raster file connection (where we set the transform and the CRS, among other settings)\n", + "* Write the array with raster values into the connection\n", + "* Close the connection\n", + "\n", + "In the case of `elev`, we do it as follows:\n" + ], + "id": "25afe9ac" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "new_dataset = rasterio.open(\n", + " 'output/elev.tif', 'w', \n", + " driver='GTiff',\n", + " height=elev.shape[0],\n", + " width=elev.shape[1],\n", + " count=1,\n", + " dtype=elev.dtype,\n", + " crs=4326,\n", + " transform=new_transform\n", + ")\n", + "new_dataset.write(elev, 1)\n", + "new_dataset.close()" + ], + "id": "245a2391", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Note that the CRS we (arbitrarily) set for the `elev` raster is WGS84, defined using `crs=4326` according to the EPSG code.\n", + "\n", + "Here is how we export the `grain` raster as well, using almost the exact same code just with `elev` replaced with `grain`:\n" + ], + "id": "a0707b34" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "new_dataset = rasterio.open(\n", + " 'output/grain.tif', 'w', \n", + " driver='GTiff',\n", + " height=grain.shape[0],\n", + " width=grain.shape[1],\n", + " count=1,\n", + " dtype=grain.dtype,\n", + " crs=4326,\n", + " transform=new_transform\n", + ")\n", + "new_dataset.write(grain, 1)\n", + "new_dataset.close()" + ], + "id": "33bd4912", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Don't worry if the raster export code is unclear. We will elaborate on the details of raster output in @sec-data-output-raster. \n", + "\n", + "As a result, the files `elev.tif` and `grain.tif` are written into the `output` directory. These are identical to the `elev.tif` and `grain.tif` files in the `data` directory which we use later on in the examples (for example, @sec-raster-subsetting).\n", + "\n", + "Note that the transform matrices and dimensions of `elev` and `grain` are identical. This means that the rasters are overlapping, and can be combined into one two-band raster, combined in raster algebra operations (@sec-map-algebra), etc.\n", + "\n", + "## Coordinate Reference Systems {#sec-coordinate-reference-systems}\n", + "\n", + "Vector and raster spatial data types share concepts intrinsic to spatial data. \n", + "Perhaps the most fundamental of these is the Coordinate Reference System (CRS), which defines how the spatial elements of the data relate to the surface of the Earth (or other bodies).\n", + "CRSs are either geographic or projected, as introduced at the beginning of this chapter (see ...).\n", + "This section explains each type, laying the foundations for @sec-reproj-geo-data, which provides a deep dive into setting, transforming, and querying CRSs.\n", + "\n", + "### Geographic coordinate systems\n", + "\n", + "Geographic coordinate systems identify any location on the Earth's surface using two values---longitude and latitude (see left panel of @fig-zion-crs). Longitude is a location in the East-West direction in angular distance from the Prime Meridian plane, while latitude is an angular distance North or South of the equatorial plane.\n", + "Distances in geographic CRSs are therefore not measured in meters. \n", + "This has important consequences, as demonstrated in Section 7.\n", + "\n", + "A spherical or ellipsoidal surface represents the surface of the Earth in geographic coordinate systems. \n", + "Spherical models assume that the Earth is a perfect sphere of a given radius---they have the advantage of simplicity, but, at the same time, they are inaccurate: the Earth is not a sphere!\n", + "Ellipsoidal models are defined by two parameters: the equatorial radius and the polar radius. \n", + "These are suitable because the Earth is compressed: the equatorial radius is around 11.5 km longer than the polar radius.\n", + "\n", + "\n", + "Ellipsoids are part of a broader component of CRSs: the datum. \n", + "This contains information on what ellipsoid to use and the precise relationship between the Cartesian coordinates and location on the Earth's surface.\n", + "There are two types of datum---geocentric (such as WGS84) and local (such as NAD83). \n", + "You can see examples of these two types of datums in Figure .... \n", + "Black lines represent a geocentric datum, whose center is located in the Earth's center of gravity and is not optimized for a specific location.\n", + "In a local datum, shown as a purple dashed line, the ellipsoidal surface is shifted to align with the surface at a particular location. \n", + "These allow local variations on Earth's surface, such as large mountain ranges, to be accounted for in a local CRS. \n", + "\n", + "\n", + "### Projected coordinate reference systems {#sec-projected-coordinate-reference-systems}\n", + "\n", + "All projected CRSs are based on a geographic CRS, described in the previous section, and rely on map projections to convert the three-dimensional surface of the Earth into Easting and Northing (x and y) values in a projected CRS.\n", + "Projected CRSs are based on Cartesian coordinates on an implicitly flat surface (see right panel of @fig-zion-crs).\n", + " \n", + "They have an origin, x and y axes, and a linear unit of measurement such as meters.\n", + "\n", + "This transition cannot be done without adding some deformations. \n", + "Therefore, some properties of the Earth's surface are distorted in this process, such as area, direction, distance, and shape. \n", + "A projected coordinate system can preserve only one or two of those properties.\n", + "Projections are often named based on a property they preserve: equal-area preserves area, azimuthal preserves direction, equidistant preserves distance, and conformal preserves local shape.\n", + "\n", + "There are three main groups of projection types: conic, cylindrical, and planar (azimuthal). \n", + "In a conic projection, the Earth's surface is projected onto a cone along a single line of tangency or two lines of tangency.\n", + "Distortions are minimized along the tangency lines and rise with the distance from those lines in this projection.\n", + "Therefore, it is best suited for maps of mid-latitude areas. \n", + "A cylindrical projection maps the surface onto a cylinder.\n", + "This projection could also be created by touching the Earth's surface along a single line of tangency or two lines of tangency.\n", + "Cylindrical projections are used most often when mapping the entire world. \n", + "A planar projection projects data onto a flat surface touching the globe at a point or along a line of tangency.\n", + "It is typically used in mapping polar regions. \n", + "\n", + "Like most open-source geospatial software, the **geopandas** and **rasterio** packages use the PROJ software for CRS definition and calculations. \n", + "The **pyproj** package is a low-level interface to PROJ. \n", + "Using its functions, we can examine the list of supported projections:\n" + ], + "id": "de45a24f" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "import pyproj\n", + "epsg_codes = pyproj.get_codes('EPSG', 'CRS') ## List of supported EPSG codes\n", + "epsg_codes[:5] ## print first five" + ], + "id": "e4c8da3a", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "pyproj.CRS.from_epsg(4326) ## Printout of WGS84 CRS (EPSG:4326)" + ], + "id": "f2fae17c", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A quick summary of different projections, their types, properties, and suitability can be found in \"Map Projections\" (1993) and at .\n", + "We will expand on CRSs and explain how to project from one CRS to another in Chapter 7. \n", + "But, for now, it is sufficient to know:\n", + "\n", + "* That coordinate systems are a key component of geographic objects\n", + "* Knowing which CRS your data is in, and whether it is in geographic (lon/lat) or projected (typically meters), is important and has consequences for how Python handles spatial and geometry operations\n", + "* CRSs of **geopandas** (vector layer or geometry column) and **rasterio** (raster) objects can be queried with the `.crs` property\n", + "\n", + "Here is a demonstration of the last bullet point, where we import a vector layer and figure out its CRS (in this case, a projected CRS, namely UTM Zone 12):\n" + ], + "id": "2e31a05c" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "zion = gpd.read_file('data/zion.gpkg')\n", + "zion.crs" + ], + "id": "1999c679", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "And here is an illustration of the layer in the original projected CRS and in a geographic CRS (@fig-zion-crs):\n" + ], + "id": "d5c6f81e" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "#| label: fig-zion-crs\n", + "#| fig-cap: Examples of geographic (WGS 84; left) and projected (NAD83 / UTM zone 12N; right) coordinate systems for a vector data type.\n", + "\n", + "fig, axes = plt.subplots(ncols=2, figsize=(9,4))\n", + "zion.to_crs(4326).plot(ax=axes[0], edgecolor='black', color='lightgrey')\n", + "zion.plot(ax=axes[1], edgecolor='black', color='lightgrey')\n", + "axes[0].set_axisbelow(True) ## Plot grid below other elements\n", + "axes[1].set_axisbelow(True)\n", + "axes[0].grid() ## Add grid\n", + "axes[1].grid()\n", + "axes[0].set_title('WGS 84')\n", + "axes[1].set_title('UTM zone 12N');" + ], + "id": "fig-zion-crs", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We are going to elaborate on reprojection from one CRS to another (`.to_crs` in the above code section) in @sec-reproj-geo-data.\n", + "\n", + "## Units\n", + "\n", + "An essential feature of CRSs is that they contain information about spatial units.\n", + "Clearly, it is vital to know whether a house's measurements are in feet or meters, and the same applies to maps.\n", + "It is a good cartographic practice to add a scale bar or some other distance indicator onto maps to demonstrate the relationship between distances on the page or screen and distances on the ground. \n", + "Likewise, it is important for the user to be aware of the units in which the geometry coordinates are, to ensure that subsequent calculations are done in the right context.\n", + "\n", + "Python spatial data structures in **geopandas** and **rasterio** do not natively support the concept of measurement.\n", + "The coordinates of a vector layer or a raster are plain numbers, referring to an arbitrary plane.\n", + "For example, according to the `.transform` matrix of `srtm.tif` we can see that the raster resolution is `0.000833` and that its CRS is WGS84 (EPSG: `4326`). \n", + "We may know (or can find out) that the units of WGS84 are decimal degrees. \n", + "However, that information is not encoded in any numeric calculation.\n" + ], + "id": "9c3646dd" + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "src.meta" + ], + "id": "bb7613da", + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Consequently, we need to be aware of the CRS units we are working with. Typically, these are decimal degrees, in a geographic CRS, or $m$, in a projected CRS, although there are exceptions.\n", + "Geometric calculations such as length, area, or distance, return plain numbers in the same units of the CRS (such as $m$ or $m^2$). \n", + "It is up to the user to determine which units the result is given in and treat the result accordingly. \n", + "For example, if the area output was in $m^2$ and we need the result in $km^2$, then we need to divide the result by $1000^2$. \n", + "\n", + "## Exercises\n", + "\n", + "...\n" + ], + "id": "78a38ea1" + } + ], + "metadata": { + "kernelspec": { + "name": "python3", + "language": "python", + "display_name": "Python 3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index 8d9dbf2f..75749f2d 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -33,4 +33,4 @@ format: toc-title: "On this page" code-overflow: wrap toc-depth: 4 -# jupyter: python3 +jupyter: python3