Tutorial 3 - Spatial Regression in Python#

Attention

Finnish university students are encouraged to use the CSC Notebooks platform.
CSC badge

Others can follow the lesson interactively using Binder. Check the rocket icon on the top of this page.

Attribution

Standing on the shoulders of giants: This tutorial is based on excellent open source materials developed by Daniel Arribas-Bel (Uni. Liverpool) available in here (licensed with Creative Commons BY-NC-SA). Inspiration is also driven from Chapter 11 of the forthcoming book “Geographic Data Science with Python” by Rey, Arribas-Bel & Wolf which you can access from here (licensed with Creative Commons BY-NC-ND).

This notebook covers a brief and gentle introduction to spatial econometrics in Python. To do that, we will use a set of Austin properties listed in AirBnb.

The core idea of spatial econometrics is to introduce a formal representation of space into the statistical framework for regression. This can be done in many ways: by including predictors based on space (e.g. distance to relevant features), by splitting the datasets into subsets that map into different geographical regions (e.g. spatial regimes), by exploiting close distance to other observations to borrow information in the estimation (e.g. kriging), or by introducing variables that put in relation their value at a given location with those in nearby locations, to give a few examples. Some of these approaches can be implemented with standard non-spatial techniques, while others require bespoke models that can deal with the issues introduced. In this short tutorial, we will focus on the latter group. In particular, we will introduce some of the most commonly used methods in the field of spatial econometrics.

The example we will use to demonstrate this draws on hedonic house price modelling. This a well-established methodology that was developed by Rosen (1974) that is capable of recovering the marginal willingness to pay for goods or services that are not traded in the market. In other words, this allows us to put an implicit price on things such as living close to a park or in a neighborhood with good quality of air. In addition, since hedonic models are based on linear regression, the technique can also be used to obtain predictions of house prices.

Prepare data#

Before anything, let us load up the libraries we will use:

from pysal.model import spreg
from pysal.lib import weights
from scipy import stats
import numpy as np
import pandas as pd
import geopandas as gpd
import seaborn as sns
import osmnx as ox
sns.set(style="whitegrid")

Let’s read the Airbnb data and OSM data for Austin, Texas:

# Read listings
fp = "data/listings.csv"
data = pd.read_csv(fp)
data.columns
Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'summary',
       'space', 'description', 'experiences_offered', 'neighborhood_overview',
       'notes', 'transit', 'thumbnail_url', 'medium_url', 'picture_url',
       'xl_picture_url', 'host_id', 'host_url', 'host_name', 'host_since',
       'host_location', 'host_about', 'host_response_time',
       'host_response_rate', 'host_acceptance_rate', 'host_is_superhost',
       'host_thumbnail_url', 'host_picture_url', 'host_neighbourhood',
       'host_listings_count', 'host_total_listings_count',
       'host_verifications', 'host_has_profile_pic', 'host_identity_verified',
       'street', 'neighbourhood', 'neighbourhood_cleansed',
       'neighbourhood_group_cleansed', 'city', 'state', 'zipcode', 'market',
       'smart_location', 'country_code', 'country', 'latitude', 'longitude',
       'is_location_exact', 'property_type', 'room_type', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'bed_type', 'amenities', 'square_feet',
       'price', 'weekly_price', 'monthly_price', 'security_deposit',
       'cleaning_fee', 'guests_included', 'extra_people', 'minimum_nights',
       'maximum_nights', 'calendar_updated', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'calendar_last_scraped', 'number_of_reviews',
       'first_review', 'last_review', 'review_scores_rating',
       'review_scores_accuracy', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'requires_license',
       'license', 'jurisdiction_names', 'instant_bookable',
       'cancellation_policy', 'require_guest_profile_picture',
       'require_guest_phone_verification', 'calculated_host_listings_count',
       'reviews_per_month'],
      dtype='object')
data.head(2)
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... review_scores_value requires_license license jurisdiction_names instant_bookable cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count reviews_per_month
0 72635 https://www.airbnb.com/rooms/72635 20151107173015 2015-11-08 3 Private Bedrooms, SW Austin Conveniently located 10-15 from downtown in SW... We have three spare bedrooms, each with a quee... Conveniently located 10-15 from downtown in SW... none Location and convenience are key. Easy access... ... 10.0 f NaN NaN f moderate f f 1 0.02
1 5386323 https://www.airbnb.com/rooms/5386323 20151107173015 2015-11-07 Cricket Trailer Rent this cool concept trailer that has everyt... Rental arrangements for this trailer allows yo... Rent this cool concept trailer that has everyt... none We're talking about wherever you'd like in the... ... NaN f NaN NaN f moderate f f 1 NaN

2 rows × 92 columns

# Read OSM data - get administrative boundaries

# define the place query
query = {'city': 'Austin'}

# get the boundaries of the place (add additional buffer around the query)
boundaries = ox.geocode_to_gdf(query, buffer_dist=5000)

# Let's check the boundaries on a map
boundaries.explore()
/tmp/ipykernel_533465/2983256765.py:7: UserWarning: The buffer_dist argument has been deprecated and will be removed in a future release. Buffer your results directly, if desired.
  boundaries = ox.geocode_to_gdf(query, buffer_dist=5000)
Make this Notebook Trusted to load map: File -> Trust Notebook

Let’s convert the Airbnb data into GeoDataFrame based on the longitude and latitude columns and filter the data geographically based on Austing boundaries:

# Create a GeoDataFrame
data["geometry"] = gpd.points_from_xy(data["longitude"], data["latitude"])
data = gpd.GeoDataFrame(data, crs="epsg:4326")

# Filter data geographically
data = gpd.sjoin(data, boundaries[["geometry"]])
data = data.reset_index(drop=True)

# Check the first rows
data.head()
id listing_url scrape_id last_scraped name summary space description experiences_offered neighborhood_overview ... license jurisdiction_names instant_bookable cancellation_policy require_guest_profile_picture require_guest_phone_verification calculated_host_listings_count reviews_per_month geometry index_right
0 72635 https://www.airbnb.com/rooms/72635 20151107173015 2015-11-08 3 Private Bedrooms, SW Austin Conveniently located 10-15 from downtown in SW... We have three spare bedrooms, each with a quee... Conveniently located 10-15 from downtown in SW... none Location and convenience are key. Easy access... ... NaN NaN f moderate f f 1 0.02 POINT (-97.88431 30.20282) 0
1 5386323 https://www.airbnb.com/rooms/5386323 20151107173015 2015-11-07 Cricket Trailer Rent this cool concept trailer that has everyt... Rental arrangements for this trailer allows yo... Rent this cool concept trailer that has everyt... none We're talking about wherever you'd like in the... ... NaN NaN f moderate f f 1 NaN POINT (-97.90068 30.19941) 0
2 8826517 https://www.airbnb.com/rooms/8826517 20151107173015 2015-11-07 Private room 1 in South Austin Upstairs, private, 12ft x 13 1/2ft room. Priv... NaN Upstairs, private, 12ft x 13 1/2ft room. Priv... none NaN ... NaN NaN f flexible f f 2 NaN POINT (-97.86448 30.16850) 0
3 8828616 https://www.airbnb.com/rooms/8828616 20151107173015 2015-11-08 Private room 2 in South Austin Upstairs, private, 11ft x 13 1/2ft room. Priv... NaN Upstairs, private, 11ft x 13 1/2ft room. Priv... none NaN ... NaN NaN f flexible f f 2 NaN POINT (-97.86487 30.16862) 0
4 8536913 https://www.airbnb.com/rooms/8536913 20151107173015 2015-11-08 Brand-New 3BR Austin Home Brand-new 3BR/2BA Austin home with landscaped ... Feel instantly at home at our brand new 3BR/2B... Brand-new 3BR/2BA Austin home with landscaped ... none Entertainment and activities are plentiful her... ... NaN NaN f strict f f 2 NaN POINT (-97.88832 30.16943) 0

5 rows × 94 columns

One of the most interesting attributes in our data is naturally price:

data["price"].head()
0    $300.00
1     $99.00
2    $100.00
3    $100.00
4    $599.00
Name: price, dtype: object

As we can see, our values are represented as strings with a dollar sign. Before we can take a logarithmic value out of them, we need to remove the dollar sign and convert the values to floats:

# Remove dollar sign and the thousand separator (comma, e.g. 1000,000.00) and convert to float
data["price"] = data["price"].str.replace("$", '').str.replace(",", "").astype(float)
# Here the tooltip parameter specifies which attributes are shown when hovering on top of the points
# The vmax parameter specifies the maximum value for the colormap (here, all 1000 dollars and above are combined)
# To save RAM, we only visualize a random sample of 200 observations
data.sample(n=200).explore(column="price", cmap="Reds", scheme="quantiles", k=4, tooltip=["name", "price"], vmax=1000, tiles="CartoDB Positron")
Make this Notebook Trusted to load map: File -> Trust Notebook