Tutorial 3: Spatial Regression in Python#

Attention

Finnish university students are encouraged to use the CSC Noppe platform.

Attribution

Standing on the shoulders of giants: This tutorial is based on excellent open source materials developed by Daniel Arribas-Bel (Uni. Liverpool) available in here (licensed with Creative Commons BY-NC-SA). Inspiration is also driven from Chapter 11 of the forthcoming book “Geographic Data Science with Python” by Rey, Arribas-Bel & Wolf which you can access from here (licensed with Creative Commons BY-NC-ND).

This notebook covers a brief and gentle introduction to spatial econometrics in Python. To do that, we will use a set of Austin properties listed in AirBnb.

The core idea of spatial econometrics is to introduce a formal representation of space into the statistical framework for regression. This can be done in many ways: by including predictors based on space (e.g. distance to relevant features), by splitting the datasets into subsets that map into different geographical regions (e.g. spatial regimes), by exploiting close distance to other observations to borrow information in the estimation (e.g. kriging), or by introducing variables that put in relation their value at a given location with those in nearby locations, to give a few examples. Some of these approaches can be implemented with standard non-spatial techniques, while others require bespoke models that can deal with the issues introduced. In this short tutorial, we will focus on the latter group. In particular, we will introduce some of the most commonly used methods in the field of spatial econometrics.

The example we will use to demonstrate this draws on hedonic house price modelling. This a well-established methodology that was developed by Rosen (1974) that is capable of recovering the marginal willingness to pay for goods or services that are not traded in the market. In other words, this allows us to put an implicit price on things such as living close to a park or in a neighborhood with good quality of air. In addition, since hedonic models are based on linear regression, the technique can also be used to obtain predictions of house prices.

Prepare data#

Before anything, let us load up the libraries we will use:

from pysal.model import spreg
from pysal.lib import weights
from scipy import stats
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import osmnx as ox
sns.set(style="whitegrid")

Let’s read the Airbnb data and OSM data for Austin, Texas:

# Read listings
fp = "data/listings.csv"
data = pd.read_csv(fp)
data.columns

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')

data.head(2)

	id	listing_url	scrape_id	last_scraped	name	summary	space	description	experiences_offered	neighborhood_overview	...	review_scores_value	requires_license	license	jurisdiction_names	instant_bookable	cancellation_policy	require_guest_profile_picture	require_guest_phone_verification	calculated_host_listings_count	reviews_per_month
0	72635	https://www.airbnb.com/rooms/72635	20151107173015	2015-11-08	3 Private Bedrooms, SW Austin	Conveniently located 10-15 from downtown in SW...	We have three spare bedrooms, each with a quee...	Conveniently located 10-15 from downtown in SW...	none	Location and convenience are key. Easy access...	...	10.0	f	NaN	NaN	f	moderate	f	f	1	0.02
1	5386323	https://www.airbnb.com/rooms/5386323	20151107173015	2015-11-07	Cricket Trailer	Rent this cool concept trailer that has everyt...	Rental arrangements for this trailer allows yo...	Rent this cool concept trailer that has everyt...	none	We're talking about wherever you'd like in the...	...	NaN	f	NaN	NaN	f	moderate	f	f	1	NaN

2 rows × 92 columns

# Read OSM data - get administrative boundaries

# define the place query
query = {'city': 'Austin'}

# get the boundaries of the place (add additional buffer around the query)
boundaries = ox.geocode_to_gdf(query)

# Add a bit of buffer around (in decimal degrees)
# 0.05 is approximately 5 kilometers
boundaries["geometry"] = boundaries.buffer(0.05)

# Let's check the boundaries on a map
boundaries.explore()

/var/folders/f7/rhmqxfmx40s4yv9bhh7skq4m0000gp/T/ipykernel_80495/243181281.py:11: UserWarning: Geometry is in a geographic CRS. Results from 'buffer' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.

  boundaries["geometry"] = boundaries.buffer(0.05)

Make this Notebook Trusted to load map: File -> Trust Notebook

Let’s convert the Airbnb data into GeoDataFrame based on the longitude and latitude columns and filter the data geographically based on Austing boundaries:

# Create a GeoDataFrame
data["geometry"] = gpd.points_from_xy(data["longitude"], data["latitude"])
data = gpd.GeoDataFrame(data, crs="epsg:4326")

# Filter data geographically
data = data.sjoin(boundaries[["geometry"]])
data = data.reset_index(drop=True)

# Check the first rows
data.head()

	id	listing_url	scrape_id	last_scraped	name	summary	space	description	experiences_offered	neighborhood_overview	...	license	jurisdiction_names	instant_bookable	cancellation_policy	require_guest_profile_picture	require_guest_phone_verification	calculated_host_listings_count	reviews_per_month	geometry
0	72635	https://www.airbnb.com/rooms/72635	20151107173015	2015-11-08	3 Private Bedrooms, SW Austin	Conveniently located 10-15 from downtown in SW...	We have three spare bedrooms, each with a quee...	Conveniently located 10-15 from downtown in SW...	none	Location and convenience are key. Easy access...	...	NaN	NaN	f	moderate	f	f	1	0.02	POINT (-97.88431 30.20282)
1	5386323	https://www.airbnb.com/rooms/5386323	20151107173015	2015-11-07	Cricket Trailer	Rent this cool concept trailer that has everyt...	Rental arrangements for this trailer allows yo...	Rent this cool concept trailer that has everyt...	none	We're talking about wherever you'd like in the...	...	NaN	NaN	f	moderate	f	f	1	NaN	POINT (-97.90068 30.19941)
2	8826517	https://www.airbnb.com/rooms/8826517	20151107173015	2015-11-07	Private room 1 in South Austin	Upstairs, private, 12ft x 13 1/2ft room. Priv...	NaN	Upstairs, private, 12ft x 13 1/2ft room. Priv...	none	NaN	...	NaN	NaN	f	flexible	f	f	2	NaN	POINT (-97.86448 30.1685)
3	8828616	https://www.airbnb.com/rooms/8828616	20151107173015	2015-11-08	Private room 2 in South Austin	Upstairs, private, 11ft x 13 1/2ft room. Priv...	NaN	Upstairs, private, 11ft x 13 1/2ft room. Priv...	none	NaN	...	NaN	NaN	f	flexible	f	f	2	NaN	POINT (-97.86487 30.16862)
4	8536913	https://www.airbnb.com/rooms/8536913	20151107173015	2015-11-08	Brand-New 3BR Austin Home	Brand-new 3BR/2BA Austin home with landscaped ...	Feel instantly at home at our brand new 3BR/2B...	Brand-new 3BR/2BA Austin home with landscaped ...	none	Entertainment and activities are plentiful her...	...	NaN	NaN	f	strict	f	f	2	NaN	POINT (-97.88832 30.16943)

5 rows × 94 columns

One of the most interesting attributes in our data is naturally price:

data["price"].head()

  $300.00
   $99.00
  $100.00
  $100.00
  $599.00
Name: price, dtype: object

As we can see, our values are represented as strings with a dollar sign. Before we can take a logarithmic value out of them, we need to remove the dollar sign and convert the values to floats:

# Remove dollar sign and the thousand separator (comma, e.g. 1000,000.00) and convert to float
data["price"] = data["price"].str.replace("$", '').str.replace(",", "").astype(float)

# Here the tooltip parameter specifies which attributes are shown when hovering on top of the points
# The vmax parameter specifies the maximum value for the colormap (here, all 1000 dollars and above are combined)
# To save RAM, we only visualize a random sample of 200 observations
data.sample(n=200).explore(column="price", cmap="Reds", scheme="quantiles", k=4, tooltip=["name", "price"], vmax=1000, tiles="CartoDB Positron")

Make this Notebook Trusted to load map: File -> Trust Notebook

Baseline (nonspatial) regression#

Before introducing explicitly spatial methods, we will run a simple linear regression model. This will allow us, on the one hand, set the main principles of hedonic modeling and how to interpret the coefficients, which is good because the spatial models will build on this; and, on the other hand, it will provide a baseline model that we can use to evaluate how meaningful the spatial extensions are.

Essentially, the core of a linear regression is to explain a given variable -the price of a listing \(i\) on AirBnb (\(P_i\))- as a linear function of a set of other characteristics we will collectively call \(X_i\):

\[ \ln(P_i) = \alpha + \beta X_i + \epsilon_i \]

For several reasons, it is common practice to introduce the price in logarithms, so we will do so here. Additionally, since this is a probabilistic model, we add an error term \(\epsilon_i\) that is assumed to be well-behaved (i.i.d. as a normal).

For our example, we will consider the following set of explanatory features of each listed property:

explanatory_vars = ['host_listings_count', 'bathrooms', 'bedrooms', 'beds', 'guests_included']

Additionally, we are going to derive a new feature of a listing from the amenities variable. Let us construct a variable that takes 1 if the listed property has a pool and 0 otherwise:

def has_pool(a):
    if 'Pool' in a:
        return 1
    else:
        return 0
    
data['pool'] = data['amenities'].apply(has_pool)

Let’s then calculate the logarithmic value of the price:

data["log_price"] = np.log(data["price"] + 0.000001)

Do we have any missing values in our dependent or explanatory variables?

all_model_attributes = ["price"] + explanatory_vars
has_nans = False
for attr in all_model_attributes:
    if data[attr].hasnans:
        has_nans = True
print("Has missing values:", has_nans)

Has missing values: True

Okay, as we can see there are missing values, hence, let’s remove them before continuing:

# Drop NaN values from model attributes
data = data.dropna(subset=all_model_attributes).copy()

# Check again that there are no NaNs
has_nans = False
for attr in all_model_attributes:
    if data[attr].hasnans:
        has_nans = True
print("Has missing values:", has_nans)

Has missing values: False

To run the model, we can use the spreg module in PySAL, which implements a standard OLS routine, but is particularly well suited for regressions on spatial data. Also, although for the initial model we do not need it, let us build a spatial weights matrix that connects every observation to its 8 nearest neighbors. This will allow us to get extra diagnostics from the baseline model.

w = weights.KNN.from_dataframe(data, k=8, silence_warnings=True)
w.transform = 'R'
w

<libpysal.weights.distance.KNN at 0x30cd51f10>

At this point, we are ready to fit the regression:

m1 = spreg.OLS(data[['log_price']].values, data[explanatory_vars].values, 
                  name_y = 'log_price', name_x = explanatory_vars)

To get a quick glimpse of the results, we can print its summary:

print(m1.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        5760
Mean dependent var  :      5.1955                Number of Variables   :           6
S.D. dependent var  :      0.9457                Degrees of Freedom    :        5754
R-squared           :      0.4022
Adjusted R-squared  :      0.4016
Sum squared residual:     3079.25                F-statistic           :    774.1034
Sigma-square        :       0.535                Prob(F-statistic)     :           0
S.E. of regression  :       0.732                Log likelihood        :   -6369.483
Sigma-square ML     :       0.535                Akaike info criterion :   12750.967
S.E of regression ML:      0.7312                Schwarz criterion     :   12790.919

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         4.12252         0.02153       191.44355         0.00000
 host_listings_count        -0.00003         0.00018        -0.14170         0.88732
           bathrooms         0.30297         0.01943        15.59639         0.00000
            bedrooms         0.32200         0.01595        20.18397         0.00000
                beds         0.02355         0.00976         2.41428         0.01580
     guests_included         0.00642         0.00606         1.05975         0.28930
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           9.000

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2    1302841.040           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                5       1079.446           0.0000
Koenker-Bassett test              5         28.614           0.0000
================================ END OF REPORT =====================================

Results are largely unsurprising, but nonetheless reassuring. Both an extra bedroom and an extra bathroom increase the final price around 30%. Accounting for those, an extra bed pushes the price about 2%. Neither the number of guests included nor the number of listings the host has in total have a significant effect on the final price.

Including a spatial weights object in the regression buys you an extra bit: the summary provides results on the diagnostics for spatial dependence. These are a series of statistics that test whether the residuals of the regression are spatially correlated, against the null of a random distribution over space. If the latter is rejected a key assumption of OLS, independently distributed error terms, is violated. Depending on the structure of the spatial pattern, different strategies have been defined within the spatial econometrics literature to deal with them. The main summary from the diagnostics for spatial dependence is that there is clear evidence to reject the null of spatial randomness in the residuals, hence an explicitly spatial approach is warranted.

Spatially lagged exogenous regressors (`WX`)#

The first and most straightforward way to introduce space is by “spatially lagging” one of the explanatory variables. Mathematically, this can be expressed as follows:

\[ \ln(P_i) = \alpha + \beta X_i + \delta \sum_j w_{ij} X'_i + \epsilon_i \]

where \(\ln(P_i)\) is our dependent variable (logarithmic price), \(X'_i\) is a subset of \(X_i\), although it could encompass all of the explanatory variables, and \(w_{ij}\) is the \(ij\)-th cell of a spatial weights matrix \(W\). Because \(W\) assigns non-zero values only to spatial neighbors, if \(W\) is row-standardized (customary in this context), then \(\sum_j w_{ij} X'_i\) captures the average value of \(X'_i\) in the surroundings of location \(i\). This is what we call the spatial lag of \(X_i\). Also, since it is a spatial transformation of an explanatory variable, the standard estimation approach -OLS- is sufficient: spatially lagging the variables does not violate any of the assumptions on which OLS relies.

Usually, we will want to spatially lag variables that we think may affect the price of a house in a given location. For example, one could think that pools represent a visual amenity. If that is the case, then listed properties surrounded by other properties with pools might, everything else equal, be more expensive. To calculate the number of pools surrounding each property, we can build an alternative weights matrix that we do not row-standardize:

# Create weigts
w_pool = weights.KNN.from_dataframe(data, k=8, silence_warnings=True)
# Assign spatial lag based on the pool values
lagged = data.assign(w_pool=weights.spatial_lag.lag_spatial(w_pool, data['pool'].values))
lagged.head()

	id	listing_url	scrape_id	last_scraped	name	summary	space	description	experiences_offered	neighborhood_overview	...	cancellation_policy	require_guest_profile_picture	require_guest_phone_verification	calculated_host_listings_count	reviews_per_month	geometry	pool	log_price	w_pool
0	72635	https://www.airbnb.com/rooms/72635	20151107173015	2015-11-08	3 Private Bedrooms, SW Austin	Conveniently located 10-15 from downtown in SW...	We have three spare bedrooms, each with a quee...	Conveniently located 10-15 from downtown in SW...	none	Location and convenience are key. Easy access...	...	moderate	f	f	1	0.02	POINT (-97.88431 30.20282)	0	5.703782	2.0
1	5386323	https://www.airbnb.com/rooms/5386323	20151107173015	2015-11-07	Cricket Trailer	Rent this cool concept trailer that has everyt...	Rental arrangements for this trailer allows yo...	Rent this cool concept trailer that has everyt...	none	We're talking about wherever you'd like in the...	...	moderate	f	f	1	NaN	POINT (-97.90068 30.19941)	0	4.595120	1.0
2	8826517	https://www.airbnb.com/rooms/8826517	20151107173015	2015-11-07	Private room 1 in South Austin	Upstairs, private, 12ft x 13 1/2ft room. Priv...	NaN	Upstairs, private, 12ft x 13 1/2ft room. Priv...	none	NaN	...	flexible	f	f	2	NaN	POINT (-97.86448 30.1685)	1	4.605170	3.0
3	8828616	https://www.airbnb.com/rooms/8828616	20151107173015	2015-11-08	Private room 2 in South Austin	Upstairs, private, 11ft x 13 1/2ft room. Priv...	NaN	Upstairs, private, 11ft x 13 1/2ft room. Priv...	none	NaN	...	flexible	f	f	2	NaN	POINT (-97.86487 30.16862)	1	4.605170	3.0
4	8536913	https://www.airbnb.com/rooms/8536913	20151107173015	2015-11-08	Brand-New 3BR Austin Home	Brand-new 3BR/2BA Austin home with landscaped ...	Feel instantly at home at our brand new 3BR/2B...	Brand-new 3BR/2BA Austin home with landscaped ...	none	Entertainment and activities are plentiful her...	...	strict	f	f	2	NaN	POINT (-97.88832 30.16943)	0	6.395262	2.0

5 rows × 97 columns

And now we can run the model, which has the same setup as m1, with the exception that it includes the number of AirBnb properties with pools surrounding each house:

# Add pool to the explanatory variables
extended_vars = explanatory_vars + ["pool", "w_pool"]

m2 = spreg.OLS(lagged[['log_price']].values, lagged[extended_vars].values, 
               name_y = 'log_price', name_x = extended_vars)

print(m2.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set            :     unknown
Weights matrix      :        None
Dependent Variable  :   log_price                Number of Observations:        5760
Mean dependent var  :      5.1955                Number of Variables   :           8
S.D. dependent var  :      0.9457                Degrees of Freedom    :        5752
R-squared           :      0.4049
Adjusted R-squared  :      0.4042
Sum squared residual:     3065.23                F-statistic           :    559.0276
Sigma-square        :       0.533                Prob(F-statistic)     :           0
S.E. of regression  :       0.730                Log likelihood        :   -6356.336
Sigma-square ML     :       0.532                Akaike info criterion :   12728.672
S.E of regression ML:      0.7295                Schwarz criterion     :   12781.941

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     t-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         4.07296         0.02366       172.11254         0.00000
 host_listings_count        -0.00001         0.00018        -0.03278         0.97385
           bathrooms         0.29411         0.01950        15.08594         0.00000
            bedrooms         0.32937         0.01599        20.59915         0.00000
                beds         0.02490         0.00974         2.55633         0.01060
     guests_included         0.00798         0.00606         1.31753         0.18771
                pool         0.04048         0.02691         1.50405         0.13262
              w_pool         0.01597         0.00500         3.19322         0.00141
------------------------------------------------------------------------------------

REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER           9.697

TEST ON NORMALITY OF ERRORS
TEST                             DF        VALUE           PROB
Jarque-Bera                       2    1397854.351           0.0000

DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST                             DF        VALUE           PROB
Breusch-Pagan test                7       1827.305           0.0000
Koenker-Bassett test              7         46.806           0.0000
================================ END OF REPORT =====================================

Results are largely consistent with the original model. Also, incidentally, the number of pools surrounding a property does not appear to have any significant effect on the price of a given property. This could be for a host of reasons: maybe AirBnb customers do not value the number of pools surrounding a property where they are looking to stay; but maybe they do but our dataset only allows us to capture the number of pools in other AirBnb properties, which is not necessarily a good proxy of the number of pools in the immediate surroundings of a given property.

Spatially lagged endogenous regressors (`WY`)#

In a similar way to how we have included the spatial lag, one could think the prices of houses surrounding a given property also enter its own price function. In math terms, this implies the following:

\[ \ln(P_i) = \alpha + \lambda \sum_j w_{ij} \ln(P_i) + \beta X_i + \epsilon_i \]

This is essentially what we call a spatial lag model in spatial econometrics. Two calls for caution:

Unlike before, this specification does violate some of the assumptions on which OLS relies. In particular, it is including an endogenous variable \(\ln(P_i)\) on the right-hand side. This means we need a new estimation method to obtain reliable coefficients. The technical details of this go well beyond the scope of this tutorial. But we can offload those to PySAL and use the GM_Lag class, which implements the state-of-the-art approach to estimate this model.
A more conceptual gotcha: you might be tempted to read the equation above as the effect of the price in neighboring locations \(j\) on that of location \(i\). This is not exactly the exact interpretation. Instead, we need to realize this is all assumed to be a “joint decission”: rather than some houses setting their price first and that having a subsequent effect on others, what the equation models is an interdependent process by which each owner sets her own price taking into account the price that will be set in neighboring locations. This might read a bit like a technical subtlety and, to some extent, it is; but it is important to keep it in mind when you are interpreting the results.

Let us see how you would run this using PySAL:

variables = explanatory_vars + ["pool"]
m3 = spreg.GM_Lag(data[['log_price']].values, data[variables].values, 
                  w=w,
                  name_y = 'ln(price)', name_x = variables)

print(m3.summary)

REGRESSION RESULTS
------------------

SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set            :     unknown
Weights matrix      :     unknown
Dependent Variable  :   ln(price)                Number of Observations:        5760
Mean dependent var  :      5.1955                Number of Variables   :           8
S.D. dependent var  :      0.9457                Degrees of Freedom    :        5752
Pseudo R-squared    :      0.4332
Spatial Pseudo R-squared:  0.4046

------------------------------------------------------------------------------------
            Variable     Coefficient       Std.Error     z-Statistic     Probability
------------------------------------------------------------------------------------
            CONSTANT         3.39010         0.15810        21.44315         0.00000
 host_listings_count        -0.00008         0.00018        -0.46182         0.64421
           bathrooms         0.28200         0.01922        14.66960         0.00000
            bedrooms         0.32801         0.01559        21.04132         0.00000
                beds         0.02237         0.00951         2.35201         0.01867
     guests_included         0.00563         0.00592         0.95062         0.34180
                pool         0.09087         0.02165         4.19670         0.00003
         W_ln(price)         0.14158         0.03133         4.51898         0.00001
------------------------------------------------------------------------------------
Instrumented: W_ln(price)
Instruments: W_bathrooms, W_bedrooms, W_beds, W_guests_included,
             W_host_listings_count, W_pool

DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST                              DF         VALUE           PROB
Anselin-Kelejian Test             1         53.311           0.0000

SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
            Variable         Direct        Indirect          Total
 host_listings_count        -0.0001         -0.0000         -0.0001
           bathrooms         0.2820          0.0465          0.3285
            bedrooms         0.3280          0.0541          0.3821
                beds         0.0224          0.0037          0.0261
     guests_included         0.0056          0.0009          0.0066
                pool         0.0909          0.0150          0.1059
================================ END OF REPORT =====================================

As we can see, results are again very similar in all the other variable. It is also very clear that the estimate of the spatial lag of price is statistically significant (the P-value is lower than 0.05 which is reported in the “Probability” column). This points to evidence that there are processes of spatial interaction between property owners when they set their price.

Prediction performance of spatial models#

Even if we are not interested in the interpretation of the model to learn more about how alternative factors determine the price of an AirBnb property, spatial econometrics can be useful. In a purely predictive setting, the use of explicitly spatial models is likely to improve accuracy in cases where space plays a key role in the data generating process. To have a quick look at this issue, we can use the mean squared error (MSE), a standard metric of accuracy in the machine learning literature, to evaluate whether explicitly spatial models are better than traditional, non-spatial ones (the smaller the value, the better):

from sklearn.metrics import mean_squared_error as mse

mses = pd.Series({'OLS': mse(data["log_price"], m1.predy.flatten()), \
                  'OLS+W': mse(data["log_price"], m2.predy.flatten()), \
                  'Lag': mse(data["log_price"], m3.predy_e)
                    })
mses.sort_values()

OLS+W    0.532157
Lag      0.532403
OLS      0.534592
dtype: float64

We can see that the inclusion of the number of surrounding pools slightly reduces the MSE, and the inclusion of the spatial lag of price improves the accuracy of the model even further.

Where to go next?#

If you are interested to learn more about spatial regression, we recommend to read the Chapter 11 from the “Geographic Data Science in Python” book by Sergio J. Rey, Dani Arribas-Bel and Levi J. Wolf.

Tutorial 3: Spatial Regression in Python

Contents

Tutorial 3: Spatial Regression in Python#

Prepare data#

Baseline (nonspatial) regression#

Spatially lagged exogenous regressors (WX)#

Spatially lagged endogenous regressors (WY)#

Prediction performance of spatial models#

Where to go next?#

Spatially lagged exogenous regressors (`WX`)#

Spatially lagged endogenous regressors (`WY`)#