Tutorial 3: Spatial Regression in Python#
Attribution
Standing on the shoulders of giants: This tutorial is based on excellent open source materials developed by Daniel Arribas-Bel (Uni. Liverpool) available in here (licensed with Creative Commons BY-NC-SA). Inspiration is also driven from Chapter 11 of the forthcoming book “Geographic Data Science with Python” by Rey, Arribas-Bel & Wolf which you can access from here (licensed with Creative Commons BY-NC-ND).
This notebook covers a brief and gentle introduction to spatial econometrics in Python. To do that, we will use a set of Austin properties listed in AirBnb.
The core idea of spatial econometrics is to introduce a formal representation of space into the statistical framework for regression. This can be done in many ways: by including predictors based on space (e.g. distance to relevant features), by splitting the datasets into subsets that map into different geographical regions (e.g. spatial regimes), by exploiting close distance to other observations to borrow information in the estimation (e.g. kriging), or by introducing variables that put in relation their value at a given location with those in nearby locations, to give a few examples. Some of these approaches can be implemented with standard non-spatial techniques, while others require bespoke models that can deal with the issues introduced. In this short tutorial, we will focus on the latter group. In particular, we will introduce some of the most commonly used methods in the field of spatial econometrics.
The example we will use to demonstrate this draws on hedonic house price modelling. This a well-established methodology that was developed by Rosen (1974) that is capable of recovering the marginal willingness to pay for goods or services that are not traded in the market. In other words, this allows us to put an implicit price on things such as living close to a park or in a neighborhood with good quality of air. In addition, since hedonic models are based on linear regression, the technique can also be used to obtain predictions of house prices.
Prepare data#
Before anything, let us load up the libraries we will use:
from pysal.model import spreg
from pysal.lib import weights
from scipy import stats
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import osmnx as ox
sns.set(style="whitegrid")
Let’s read the Airbnb data and OSM data for Austin, Texas:
# Read listings
fp = "data/listings.csv"
data = pd.read_csv(fp)
data.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
'space', 'description', 'experiences_offered', 'neighborhood_overview',
'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
'host_location', 'host_about', 'host_response_time',
'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
'host_listings_count', 'host_total_listings_count',
'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
'street', 'neighbourhood', 'neighbourhood_cleansed',
'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
'smart_location', 'country_code', 'country', 'latitude', 'longitude',
'is_location_exact', 'property_type', 'room_type', 'accommodates',
'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
'price', 'weekly_price', 'monthly_price', 'security_deposit',
'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
'maximum_nights', 'calendar_updated', 'has_availability',
'availability_30', 'availability_60', 'availability_90',
'availability_365', 'calendar_last_scraped', 'number_of_reviews',
'first_review', 'last_review', 'review_scores_rating',
'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication',
'review_scores_location', 'review_scores_value', 'requires_license',
'license', 'jurisdiction_names', 'instant_bookable',
'cancellation_policy', 'require_guest_profile_picture',
'require_guest_phone_verification', 'calculated_host_listings_count',
'reviews_per_month'],
dtype='object')
data.head(2)
id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | ... | review_scores_value | requires_license | license | jurisdiction_names | instant_bookable | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | reviews_per_month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 72635 | https://www.airbnb.com/rooms/72635 | 20151107173015 | 2015-11-08 | 3 Private Bedrooms, SW Austin | Conveniently located 10-15 from downtown in SW... | We have three spare bedrooms, each with a quee... | Conveniently located 10-15 from downtown in SW... | none | Location and convenience are key. Easy access... | ... | 10.0 | f | NaN | NaN | f | moderate | f | f | 1 | 0.02 |
1 | 5386323 | https://www.airbnb.com/rooms/5386323 | 20151107173015 | 2015-11-07 | Cricket Trailer | Rent this cool concept trailer that has everyt... | Rental arrangements for this trailer allows yo... | Rent this cool concept trailer that has everyt... | none | We're talking about wherever you'd like in the... | ... | NaN | f | NaN | NaN | f | moderate | f | f | 1 | NaN |
2 rows Ă— 92 columns
# Read OSM data - get administrative boundaries
# define the place query
query = {'city': 'Austin'}
# get the boundaries of the place (add additional buffer around the query)
boundaries = ox.geocode_to_gdf(query)
# Add a bit of buffer around (in decimal degrees)
# 0.05 is approximately 5 kilometers
boundaries["geometry"] = boundaries.buffer(0.05)
# Let's check the boundaries on a map
boundaries.explore()
/var/folders/f7/rhmqxfmx40s4yv9bhh7skq4m0000gp/T/ipykernel_80495/243181281.py:11: UserWarning: Geometry is in a geographic CRS. Results from 'buffer' are likely incorrect. Use 'GeoSeries.to_crs()' to re-project geometries to a projected CRS before this operation.
boundaries["geometry"] = boundaries.buffer(0.05)
Let’s convert the Airbnb data into GeoDataFrame based on the longitude
and latitude
columns and filter the data geographically based on Austing boundaries:
# Create a GeoDataFrame
data["geometry"] = gpd.points_from_xy(data["longitude"], data["latitude"])
data = gpd.GeoDataFrame(data, crs="epsg:4326")
# Filter data geographically
data = data.sjoin(boundaries[["geometry"]])
data = data.reset_index(drop=True)
# Check the first rows
data.head()
id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | ... | license | jurisdiction_names | instant_bookable | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | reviews_per_month | geometry | index_right | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 72635 | https://www.airbnb.com/rooms/72635 | 20151107173015 | 2015-11-08 | 3 Private Bedrooms, SW Austin | Conveniently located 10-15 from downtown in SW... | We have three spare bedrooms, each with a quee... | Conveniently located 10-15 from downtown in SW... | none | Location and convenience are key. Easy access... | ... | NaN | NaN | f | moderate | f | f | 1 | 0.02 | POINT (-97.88431 30.20282) | 0 |
1 | 5386323 | https://www.airbnb.com/rooms/5386323 | 20151107173015 | 2015-11-07 | Cricket Trailer | Rent this cool concept trailer that has everyt... | Rental arrangements for this trailer allows yo... | Rent this cool concept trailer that has everyt... | none | We're talking about wherever you'd like in the... | ... | NaN | NaN | f | moderate | f | f | 1 | NaN | POINT (-97.90068 30.19941) | 0 |
2 | 8826517 | https://www.airbnb.com/rooms/8826517 | 20151107173015 | 2015-11-07 | Private room 1 in South Austin | Upstairs, private, 12ft x 13 1/2ft room. Priv... | NaN | Upstairs, private, 12ft x 13 1/2ft room. Priv... | none | NaN | ... | NaN | NaN | f | flexible | f | f | 2 | NaN | POINT (-97.86448 30.1685) | 0 |
3 | 8828616 | https://www.airbnb.com/rooms/8828616 | 20151107173015 | 2015-11-08 | Private room 2 in South Austin | Upstairs, private, 11ft x 13 1/2ft room. Priv... | NaN | Upstairs, private, 11ft x 13 1/2ft room. Priv... | none | NaN | ... | NaN | NaN | f | flexible | f | f | 2 | NaN | POINT (-97.86487 30.16862) | 0 |
4 | 8536913 | https://www.airbnb.com/rooms/8536913 | 20151107173015 | 2015-11-08 | Brand-New 3BR Austin Home | Brand-new 3BR/2BA Austin home with landscaped ... | Feel instantly at home at our brand new 3BR/2B... | Brand-new 3BR/2BA Austin home with landscaped ... | none | Entertainment and activities are plentiful her... | ... | NaN | NaN | f | strict | f | f | 2 | NaN | POINT (-97.88832 30.16943) | 0 |
5 rows Ă— 94 columns
One of the most interesting attributes in our data is naturally price:
data["price"].head()
0 $300.00
1 $99.00
2 $100.00
3 $100.00
4 $599.00
Name: price, dtype: object
As we can see, our values are represented as strings with a dollar sign. Before we can take a logarithmic value out of them, we need to remove the dollar sign and convert the values to floats:
# Remove dollar sign and the thousand separator (comma, e.g. 1000,000.00) and convert to float
data["price"] = data["price"].str.replace("$", '').str.replace(",", "").astype(float)
# Here the tooltip parameter specifies which attributes are shown when hovering on top of the points
# The vmax parameter specifies the maximum value for the colormap (here, all 1000 dollars and above are combined)
# To save RAM, we only visualize a random sample of 200 observations
data.sample(n=200).explore(column="price", cmap="Reds", scheme="quantiles", k=4, tooltip=["name", "price"], vmax=1000, tiles="CartoDB Positron")
Baseline (nonspatial) regression#
Before introducing explicitly spatial methods, we will run a simple linear regression model. This will allow us, on the one hand, set the main principles of hedonic modeling and how to interpret the coefficients, which is good because the spatial models will build on this; and, on the other hand, it will provide a baseline model that we can use to evaluate how meaningful the spatial extensions are.
Essentially, the core of a linear regression is to explain a given variable -the price of a listing \(i\) on AirBnb (\(P_i\))- as a linear function of a set of other characteristics we will collectively call \(X_i\):
For several reasons, it is common practice to introduce the price in logarithms, so we will do so here. Additionally, since this is a probabilistic model, we add an error term \(\epsilon_i\) that is assumed to be well-behaved (i.i.d. as a normal).
For our example, we will consider the following set of explanatory features of each listed property:
explanatory_vars = ['host_listings_count', 'bathrooms', 'bedrooms', 'beds', 'guests_included']
Additionally, we are going to derive a new feature of a listing from the amenities variable. Let us construct a variable that takes 1 if the listed property has a pool and 0 otherwise:
def has_pool(a):
if 'Pool' in a:
return 1
else:
return 0
data['pool'] = data['amenities'].apply(has_pool)
Let’s then calculate the logarithmic value of the price:
data["log_price"] = np.log(data["price"] + 0.000001)
Do we have any missing values in our dependent or explanatory variables?
all_model_attributes = ["price"] + explanatory_vars
has_nans = False
for attr in all_model_attributes:
if data[attr].hasnans:
has_nans = True
print("Has missing values:", has_nans)
Has missing values: True
Okay, as we can see there are missing values, hence, let’s remove them before continuing:
# Drop NaN values from model attributes
data = data.dropna(subset=all_model_attributes).copy()
# Check again that there are no NaNs
has_nans = False
for attr in all_model_attributes:
if data[attr].hasnans:
has_nans = True
print("Has missing values:", has_nans)
Has missing values: False
To run the model, we can use the spreg
module in PySAL
, which implements a standard OLS routine, but is particularly well suited for regressions on spatial data. Also, although for the initial model we do not need it, let us build a spatial weights matrix that connects every observation to its 8 nearest neighbors. This will allow us to get extra diagnostics from the baseline model.
w = weights.KNN.from_dataframe(data, k=8, silence_warnings=True)
w.transform = 'R'
w
<libpysal.weights.distance.KNN at 0x30cd51f10>
At this point, we are ready to fit the regression:
m1 = spreg.OLS(data[['log_price']].values, data[explanatory_vars].values,
name_y = 'log_price', name_x = explanatory_vars)
To get a quick glimpse of the results, we can print its summary:
print(m1.summary)
REGRESSION RESULTS
------------------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : None
Dependent Variable : log_price Number of Observations: 5760
Mean dependent var : 5.1955 Number of Variables : 6
S.D. dependent var : 0.9457 Degrees of Freedom : 5754
R-squared : 0.4022
Adjusted R-squared : 0.4016
Sum squared residual: 3079.25 F-statistic : 774.1034
Sigma-square : 0.535 Prob(F-statistic) : 0
S.E. of regression : 0.732 Log likelihood : -6369.483
Sigma-square ML : 0.535 Akaike info criterion : 12750.967
S.E of regression ML: 0.7312 Schwarz criterion : 12790.919
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 4.12252 0.02153 191.44355 0.00000
host_listings_count -0.00003 0.00018 -0.14170 0.88732
bathrooms 0.30297 0.01943 15.59639 0.00000
bedrooms 0.32200 0.01595 20.18397 0.00000
beds 0.02355 0.00976 2.41428 0.01580
guests_included 0.00642 0.00606 1.05975 0.28930
------------------------------------------------------------------------------------
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 9.000
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 1302841.040 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 5 1079.446 0.0000
Koenker-Bassett test 5 28.614 0.0000
================================ END OF REPORT =====================================
Results are largely unsurprising, but nonetheless reassuring. Both an extra bedroom and an extra bathroom increase the final price around 30%. Accounting for those, an extra bed pushes the price about 2%. Neither the number of guests included nor the number of listings the host has in total have a significant effect on the final price.
Including a spatial weights object in the regression buys you an extra bit: the summary provides results on the diagnostics for spatial dependence. These are a series of statistics that test whether the residuals of the regression are spatially correlated, against the null of a random distribution over space. If the latter is rejected a key assumption of OLS, independently distributed error terms, is violated. Depending on the structure of the spatial pattern, different strategies have been defined within the spatial econometrics literature to deal with them. The main summary from the diagnostics for spatial dependence is that there is clear evidence to reject the null of spatial randomness in the residuals, hence an explicitly spatial approach is warranted.
Spatially lagged exogenous regressors (WX
)#
The first and most straightforward way to introduce space is by “spatially lagging” one of the explanatory variables. Mathematically, this can be expressed as follows:
where \(\ln(P_i)\) is our dependent variable (logarithmic price), \(X'_i\) is a subset of \(X_i\), although it could encompass all of the explanatory variables, and \(w_{ij}\) is the \(ij\)-th cell of a spatial weights matrix \(W\). Because \(W\) assigns non-zero values only to spatial neighbors, if \(W\) is row-standardized (customary in this context), then \(\sum_j w_{ij} X'_i\) captures the average value of \(X'_i\) in the surroundings of location \(i\). This is what we call the spatial lag of \(X_i\). Also, since it is a spatial transformation of an explanatory variable, the standard estimation approach -OLS- is sufficient: spatially lagging the variables does not violate any of the assumptions on which OLS relies.
Usually, we will want to spatially lag variables that we think may affect the price of a house in a given location. For example, one could think that pools represent a visual amenity. If that is the case, then listed properties surrounded by other properties with pools might, everything else equal, be more expensive. To calculate the number of pools surrounding each property, we can build an alternative weights matrix that we do not row-standardize:
# Create weigts
w_pool = weights.KNN.from_dataframe(data, k=8, silence_warnings=True)
# Assign spatial lag based on the pool values
lagged = data.assign(w_pool=weights.spatial_lag.lag_spatial(w_pool, data['pool'].values))
lagged.head()
id | listing_url | scrape_id | last_scraped | name | summary | space | description | experiences_offered | neighborhood_overview | ... | cancellation_policy | require_guest_profile_picture | require_guest_phone_verification | calculated_host_listings_count | reviews_per_month | geometry | index_right | pool | log_price | w_pool | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 72635 | https://www.airbnb.com/rooms/72635 | 20151107173015 | 2015-11-08 | 3 Private Bedrooms, SW Austin | Conveniently located 10-15 from downtown in SW... | We have three spare bedrooms, each with a quee... | Conveniently located 10-15 from downtown in SW... | none | Location and convenience are key. Easy access... | ... | moderate | f | f | 1 | 0.02 | POINT (-97.88431 30.20282) | 0 | 0 | 5.703782 | 2.0 |
1 | 5386323 | https://www.airbnb.com/rooms/5386323 | 20151107173015 | 2015-11-07 | Cricket Trailer | Rent this cool concept trailer that has everyt... | Rental arrangements for this trailer allows yo... | Rent this cool concept trailer that has everyt... | none | We're talking about wherever you'd like in the... | ... | moderate | f | f | 1 | NaN | POINT (-97.90068 30.19941) | 0 | 0 | 4.595120 | 1.0 |
2 | 8826517 | https://www.airbnb.com/rooms/8826517 | 20151107173015 | 2015-11-07 | Private room 1 in South Austin | Upstairs, private, 12ft x 13 1/2ft room. Priv... | NaN | Upstairs, private, 12ft x 13 1/2ft room. Priv... | none | NaN | ... | flexible | f | f | 2 | NaN | POINT (-97.86448 30.1685) | 0 | 1 | 4.605170 | 3.0 |
3 | 8828616 | https://www.airbnb.com/rooms/8828616 | 20151107173015 | 2015-11-08 | Private room 2 in South Austin | Upstairs, private, 11ft x 13 1/2ft room. Priv... | NaN | Upstairs, private, 11ft x 13 1/2ft room. Priv... | none | NaN | ... | flexible | f | f | 2 | NaN | POINT (-97.86487 30.16862) | 0 | 1 | 4.605170 | 3.0 |
4 | 8536913 | https://www.airbnb.com/rooms/8536913 | 20151107173015 | 2015-11-08 | Brand-New 3BR Austin Home | Brand-new 3BR/2BA Austin home with landscaped ... | Feel instantly at home at our brand new 3BR/2B... | Brand-new 3BR/2BA Austin home with landscaped ... | none | Entertainment and activities are plentiful her... | ... | strict | f | f | 2 | NaN | POINT (-97.88832 30.16943) | 0 | 0 | 6.395262 | 2.0 |
5 rows Ă— 97 columns
And now we can run the model, which has the same setup as m1
, with the exception that it includes the number of AirBnb properties with pools surrounding each house:
# Add pool to the explanatory variables
extended_vars = explanatory_vars + ["pool", "w_pool"]
m2 = spreg.OLS(lagged[['log_price']].values, lagged[extended_vars].values,
name_y = 'log_price', name_x = extended_vars)
print(m2.summary)
REGRESSION RESULTS
------------------
SUMMARY OF OUTPUT: ORDINARY LEAST SQUARES
-----------------------------------------
Data set : unknown
Weights matrix : None
Dependent Variable : log_price Number of Observations: 5760
Mean dependent var : 5.1955 Number of Variables : 8
S.D. dependent var : 0.9457 Degrees of Freedom : 5752
R-squared : 0.4049
Adjusted R-squared : 0.4042
Sum squared residual: 3065.23 F-statistic : 559.0276
Sigma-square : 0.533 Prob(F-statistic) : 0
S.E. of regression : 0.730 Log likelihood : -6356.336
Sigma-square ML : 0.532 Akaike info criterion : 12728.672
S.E of regression ML: 0.7295 Schwarz criterion : 12781.941
------------------------------------------------------------------------------------
Variable Coefficient Std.Error t-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 4.07296 0.02366 172.11254 0.00000
host_listings_count -0.00001 0.00018 -0.03278 0.97385
bathrooms 0.29411 0.01950 15.08594 0.00000
bedrooms 0.32937 0.01599 20.59915 0.00000
beds 0.02490 0.00974 2.55633 0.01060
guests_included 0.00798 0.00606 1.31753 0.18771
pool 0.04048 0.02691 1.50405 0.13262
w_pool 0.01597 0.00500 3.19322 0.00141
------------------------------------------------------------------------------------
REGRESSION DIAGNOSTICS
MULTICOLLINEARITY CONDITION NUMBER 9.697
TEST ON NORMALITY OF ERRORS
TEST DF VALUE PROB
Jarque-Bera 2 1397854.351 0.0000
DIAGNOSTICS FOR HETEROSKEDASTICITY
RANDOM COEFFICIENTS
TEST DF VALUE PROB
Breusch-Pagan test 7 1827.305 0.0000
Koenker-Bassett test 7 46.806 0.0000
================================ END OF REPORT =====================================
Results are largely consistent with the original model. Also, incidentally, the number of pools surrounding a property does not appear to have any significant effect on the price of a given property. This could be for a host of reasons: maybe AirBnb customers do not value the number of pools surrounding a property where they are looking to stay; but maybe they do but our dataset only allows us to capture the number of pools in other AirBnb properties, which is not necessarily a good proxy of the number of pools in the immediate surroundings of a given property.
Spatially lagged endogenous regressors (WY
)#
In a similar way to how we have included the spatial lag, one could think the prices of houses surrounding a given property also enter its own price function. In math terms, this implies the following:
This is essentially what we call a spatial lag model in spatial econometrics. Two calls for caution:
Unlike before, this specification does violate some of the assumptions on which OLS relies. In particular, it is including an endogenous variable \(\ln(P_i)\) on the right-hand side. This means we need a new estimation method to obtain reliable coefficients. The technical details of this go well beyond the scope of this tutorial. But we can offload those to
PySAL
and use theGM_Lag
class, which implements the state-of-the-art approach to estimate this model.A more conceptual gotcha: you might be tempted to read the equation above as the effect of the price in neighboring locations \(j\) on that of location \(i\). This is not exactly the exact interpretation. Instead, we need to realize this is all assumed to be a “joint decission”: rather than some houses setting their price first and that having a subsequent effect on others, what the equation models is an interdependent process by which each owner sets her own price taking into account the price that will be set in neighboring locations. This might read a bit like a technical subtlety and, to some extent, it is; but it is important to keep it in mind when you are interpreting the results.
Let us see how you would run this using PySAL
:
variables = explanatory_vars + ["pool"]
m3 = spreg.GM_Lag(data[['log_price']].values, data[variables].values,
w=w,
name_y = 'ln(price)', name_x = variables)
print(m3.summary)
REGRESSION RESULTS
------------------
SUMMARY OF OUTPUT: SPATIAL TWO STAGE LEAST SQUARES
--------------------------------------------------
Data set : unknown
Weights matrix : unknown
Dependent Variable : ln(price) Number of Observations: 5760
Mean dependent var : 5.1955 Number of Variables : 8
S.D. dependent var : 0.9457 Degrees of Freedom : 5752
Pseudo R-squared : 0.4332
Spatial Pseudo R-squared: 0.4046
------------------------------------------------------------------------------------
Variable Coefficient Std.Error z-Statistic Probability
------------------------------------------------------------------------------------
CONSTANT 3.39010 0.15810 21.44315 0.00000
host_listings_count -0.00008 0.00018 -0.46182 0.64421
bathrooms 0.28200 0.01922 14.66960 0.00000
bedrooms 0.32801 0.01559 21.04132 0.00000
beds 0.02237 0.00951 2.35201 0.01867
guests_included 0.00563 0.00592 0.95062 0.34180
pool 0.09087 0.02165 4.19670 0.00003
W_ln(price) 0.14158 0.03133 4.51898 0.00001
------------------------------------------------------------------------------------
Instrumented: W_ln(price)
Instruments: W_bathrooms, W_bedrooms, W_beds, W_guests_included,
W_host_listings_count, W_pool
DIAGNOSTICS FOR SPATIAL DEPENDENCE
TEST DF VALUE PROB
Anselin-Kelejian Test 1 53.311 0.0000
SPATIAL LAG MODEL IMPACTS
Impacts computed using the 'simple' method.
Variable Direct Indirect Total
host_listings_count -0.0001 -0.0000 -0.0001
bathrooms 0.2820 0.0465 0.3285
bedrooms 0.3280 0.0541 0.3821
beds 0.0224 0.0037 0.0261
guests_included 0.0056 0.0009 0.0066
pool 0.0909 0.0150 0.1059
================================ END OF REPORT =====================================
As we can see, results are again very similar in all the other variable. It is also very clear that the estimate of the spatial lag of price is statistically significant (the P-value is lower than 0.05 which is reported in the “Probability” column). This points to evidence that there are processes of spatial interaction between property owners when they set their price.
Prediction performance of spatial models#
Even if we are not interested in the interpretation of the model to learn more about how alternative factors determine the price of an AirBnb property, spatial econometrics can be useful. In a purely predictive setting, the use of explicitly spatial models is likely to improve accuracy in cases where space plays a key role in the data generating process. To have a quick look at this issue, we can use the mean squared error (MSE), a standard metric of accuracy in the machine learning literature, to evaluate whether explicitly spatial models are better than traditional, non-spatial ones (the smaller the value, the better):
from sklearn.metrics import mean_squared_error as mse
mses = pd.Series({'OLS': mse(data["log_price"], m1.predy.flatten()), \
'OLS+W': mse(data["log_price"], m2.predy.flatten()), \
'Lag': mse(data["log_price"], m3.predy_e)
})
mses.sort_values()
OLS+W 0.532157
Lag 0.532403
OLS 0.534592
dtype: float64
We can see that the inclusion of the number of surrounding pools slightly reduces the MSE, and the inclusion of the spatial lag of price improves the accuracy of the model even further.
Where to go next?#
If you are interested to learn more about spatial regression, we recommend to read the Chapter 11 from the “Geographic Data Science in Python” book by Sergio J. Rey, Dani Arribas-Bel and Levi J. Wolf.