import pandas, geopandas
import numpy as np
import contextily
import tobler
import matplotlib.pyplot as plt
4 Spatial feature engineering (part II)
In this second part of Spatial Feature Engineering, we turn to Map Synthesis. There is only one reading to complete for this block, from the GDS Book
4.1 Packages and modules
4.2 Data
If you want to read more about the data sources behind this dataset, head to the Datasets section.
Assuming you have the file locally on the path ../data/
:
= geopandas.read_file("../data/madrid_abb.gpkg") pts
We will be working with a modified version of pts
:
- Since we will require distance calculations, we will switch to the Spanish official projection
- To make calculations in the illustration near-instantaneous, we will work with a smaller (random) sample of Airbnb properties (10% of the total)
= pts.sample(
db =0.1, random_state=123
frac=25830) ).to_crs(epsg
As you can see in the description, the new CRS is expressed in metres:
db.crs
<Projected CRS: EPSG:25830>
Name: ETRS89 / UTM zone 30N
Axis Info [cartesian]:
- E[east]: Easting (metre)
- N[north]: Northing (metre)
Area of Use:
- name: Europe between 6°W and 0°W: Faroe Islands offshore; Ireland - offshore; Jan Mayen - offshore; Norway including Svalbard - offshore; Spain - onshore and offshore.
- bounds: (-6.0, 35.26, 0.01, 80.49)
Coordinate Operation:
- name: UTM zone 30N
- method: Transverse Mercator
Datum: European Terrestrial Reference System 1989 ensemble
- Ellipsoid: GRS 1980
- Prime Meridian: Greenwich
4.3 Distance buffers
How many Airbnb’s are within 500m of each Airbnb?
from pysal.lib import weights
Using DistanceBand
, we can build a spatial weights matrix that assigns 1
to each observation within 500m, and 0
otherwise.
= weights.DistanceBand.from_dataframe(
w500m =500, binary=True
db, threshold )
/Users/carmen/anaconda3/envs/geo-env-new/lib/python3.10/site-packages/libpysal/weights/weights.py:224: UserWarning:
The weights matrix is not fully connected:
There are 86 disconnected components.
There are 47 islands with ids: 6878, 16772, 15006, 1336, 3168, 15193, 1043, 5257, 4943, 12849, 10609, 11309, 10854, 10123, 3388, 9380, 10288, 13071, 3523, 15316, 3856, 205, 7720, 10454, 18307, 3611, 12405, 10716, 14813, 15467, 1878, 16597, 14329, 7933, 16215, 13525, 13722, 11932, 14456, 8848, 15197, 8277, 9922, 13072, 13852, 5922, 17151.
The number of neighbors can be accessed through the cardinalities
attribute:
= pandas.Series(w500m.cardinalities)
n_neis n_neis.head()
11297 213
2659 5
16242 21
15565 9
14707 159
dtype: int64
= plt.subplots()
fig, ax
db.assign(=n_neis
n_neis"n_neis", markersize=0.1, ax=ax)
).plot(
plt.show()
Challenge: Calculate the number of AirBnb properties within 250m of each other property. What is the average?
4.4 Distance rings
How many Airbnb’s are between 500m and 1km of each Airbnb?
= weights.DistanceBand.from_dataframe(
w1km =1000, binary=True
db, threshold )
/Users/carmen/anaconda3/envs/geo-env-new/lib/python3.10/site-packages/libpysal/weights/weights.py:224: UserWarning:
The weights matrix is not fully connected:
There are 20 disconnected components.
There are 5 islands with ids: 4943, 12849, 15467, 13525, 11932.
Now, we could do simply a subtraction:
= pandas.Series(w1km.cardinalities) - n_neis n_ring_neis
Or, if we need to know which is which, we can use set operations on weights:
= weights.w_difference(w1km, w500m, constrained=False) w_ring
/Users/carmen/anaconda3/envs/geo-env-new/lib/python3.10/site-packages/libpysal/weights/weights.py:224: UserWarning:
The weights matrix is not fully connected:
There are 34 disconnected components.
There are 23 islands with ids: 3744, 4143, 4857, 4943, 6986, 8345, 8399, 9062, 10592, 10865, 11574, 11613, 11785, 11840, 11932, 12015, 12635, 12714, 12849, 13091, 13317, 13525, 15467.
And we can confirm they’re both the same:
- n_ring_neis).sum() (pandas.Series(w_ring.cardinalities)
0
Challenge: Can you create a plot with the following two lines?
- One depicting the average number of properties within a range of 50m, 100m, 250m, 500m, 750m
- Another one with the increase of average neighbors for the same distances above
4.5 Cluster membership (points)
We can use the spatial configuration of observations to classify them as part of clusters or not, which can then be encoded, for example, as dummy variables in a model.
We will learn a method to identify clusters of points, based on their density across space. To do this, we will use the widely used DBSCAN
algorithm. For this method, a cluster is a concentration of at least min_pts
points, each of them within a distance eps
of at least one other point in the cluster. Points in the dataset are then divided into three categories:
- Noise, for those points outside a cluster.
- Cores, for those points inside a cluster whith at least
min_pts
points in the cluster within distanceeps
. - Borders for points inside a cluster with less than
min_pts
other points in the cluster within distanceeps
.
Both min_pts
and eps need to be prespecified by the user before running DBSCAN
. This is a critical, as the value of these parameters can influence significantly the final result. For more on this, check here. Before exploring this in greater depth, let us get a first run at computing DBSCAN
in Python, with eps
= 500 and a minimum of min_pct
= 2% of points for a candidate cluster to be considered so.
from sklearn.cluster import DBSCAN
= 2
min_pct = len(db) * min_pct // 100
min_pts = 500 eps
We will illustrate it with a minimum number of points of min_pct
% of the sample and a maximum radious of eps
metres.
= DBSCAN(min_samples=min_pts, eps=eps) model
Once ready, we “fit” it to our data, but note that we first need to express the longitude and latitude of our points in metres.
'x'] = db.geometry.x
db['y'] = db.geometry.y
db[
model.fit('x', 'y']]
db[[; )
We will attach the labels to db
for easy access:
"labels"] = model.labels_ db[
We can define boundaries to turn point clusters into polygons if that fits our needs better:
The code in the next cell is a bit more advanced than expected for this course, but is used here as an illustration.
from pysal.lib import cg
= []
boundaries = [i for i in db["labels"].unique() if i!=-1]
cl_ids for cl_id in cl_ids:
= db.query(f"labels == {cl_id}")
sub = cg.alpha_shape_auto(
cluster_boundaries
np.array(
[sub.geometry.x, sub.geometry.y]
).T,
)
boundaries.append(cluster_boundaries)= geopandas.GeoSeries(
boundaries =cl_ids, crs=db.crs
boundaries, index )
And we can see what the clusters look like:
= plt.subplots()
fig, ax
db.to_crs(=3857
epsg
).plot(=0.1, color="lime", ax=ax
markersize
)
boundaries.to_crs(=3857
epsg
).plot(=ax, edgecolor="red", facecolor="none"
ax
)
contextily.add_basemap(
ax,=contextily.providers.CartoDB.DarkMatterNoLabels
source
)
plt.show()
Challenge: How does the map above change if you require 5% of points instead of 2% for a candidate cluster to be considered so?
4.6 Cluster membership (polygons)
We can take a similar approach as above if we have polygon geographies instead of points. Rather than using DBSCAN, here we can rely on local indicators of spatial association (LISAs) to pick up spatial concentrations of high or low values.
For the illustration, we will aggregate the location of Airbnb properties to a regular hexagonal grid, similar to how we generated it when transferring from polygons to polygons. First we create a polygon covering the extent of points:
= geopandas.GeoSeries(
one
[cg.alpha_shape_auto(
np.array(
[db.geometry.x, db.geometry.y]
).T,
)],=db.crs
crs )
Then we can tessellate:
= tobler.util.h3fy(
abb_hex =8
one, resolution )
/Users/carmen/anaconda3/envs/geo-env-new/lib/python3.10/site-packages/pyproj/crs/crs.py:1293: UserWarning:
You will likely lose important projection information when converting to a PROJ string from another format. See: https://proj.org/faq.html#what-is-the-best-format-for-describing-coordinate-reference-systems
And obtain a count of points in each polygon:
= geopandas.sjoin(
counts
db, abb_hex
).groupby("index_right"
).size()
"count"] = counts
abb_hex["count"] = abb_hex["count"].fillna(0)
abb_hex[
= plt.subplots()
fig, ax
"count", scheme="fisherjenks", ax=ax)
abb_hex.plot(
plt.show()
To identify spatial clusters, we rely on esda
:
from pysal.explore import esda
/Users/carmen/anaconda3/envs/geo-env-new/lib/python3.10/site-packages/spaghetti/network.py:40: FutureWarning:
The next major release of pysal/spaghetti (2.0.0) will drop support for all ``libpysal.cg`` geometries. This change is a first step in refactoring ``spaghetti`` that is expected to result in dramatically reduced runtimes for network instantiation and operations. Users currently requiring network and point pattern input as ``libpysal.cg`` geometries should prepare for this simply by converting to ``shapely`` geometries.
/Users/carmen/anaconda3/envs/geo-env-new/lib/python3.10/site-packages/tqdm/auto.py:22: TqdmWarning:
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
And compute the LISA statistics:
= weights.Queen.from_dataframe(abb_hex)
w = esda.Moran_Local(abb_hex["count"], w) lisa
/var/folders/79/65l52xsj7vq_4_t_l6k5bl2c0000gn/T/ipykernel_2582/2473509840.py:1: FutureWarning:
`use_index` defaults to False but will default to True in future. Set True/False directly to control this behavior and silence this warning
For a visual inspection of the clusters, splot
:
from pysal.viz import splot
from splot.esda import lisa_cluster
=0.01)
lisa_cluster(lisa, abb_hex, p plt.show()
And, if we want to extract the labels for each polygon, we can do so from the lisa
object:
* (lisa.p_sim < 0.01) lisa.q
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 3, 0, 0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 3, 0, 0, 3,
0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0,
0, 3, 0, 3, 0, 0, 0, 0, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
3, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0,
0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3, 0, 0,
0, 0, 3, 0, 0, 0, 0, 1, 0, 0, 0, 3, 0, 0, 0, 0, 0, 3, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 3,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 3, 0, 0, 1, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 3, 0, 0, 3, 0, 0, 0, 3, 0, 3, 0, 0, 0, 0, 0, 3, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 3, 3, 0, 0])
4.7 Next steps
If you want a bit more background into some of the techniques reviewed in this block, the following might be of interest:
- This block of the GDS Course (Pietrostefani and Cabrera-Arnau 2024) will introduce you to more techniques like the LISAs seen above to explore the spatial dimension of the statistical properties of your data. If you want a more detailed read, this Chapter of the GDS Book (Rey, Arribas-Bel, and Wolf forthcoming) will do just that.
- This block of the GDS Course (Pietrostefani and Cabrera-Arnau 2024) will introduce you to more techniques for exploring point patterns. If you want a more comprehensive read, this Chapter of the GDS Book (Rey, Arribas-Bel, and Wolf forthcoming) will do just that.