Airbnb Property Data from Texas | Nicola Kollmann Data Portfolio & Blog

Analyzing Airbnb Property Data from Texas

This is an R Markdown document that documents part of my work for the term paper in the “Regression Analysis for Spatial Data” course. The course was taught by Professor Dr. Roland Füss and Dr. Zeno Adams as part of the GSERM Summer School 2020. The main textbook for the course was written by LeSage and Page (2009)1. Thus, the methods used in this term paper are also in close resemblance to their work.

Load the data set

The first step in any data analysis is to set the working directory and load the necessary data into the dataframe dat. The data set at hand contains more than 18’000 Airbnb property listings from Texas, United States, and can be found online on Kaggle.

Cleaning the data set

Since we are particularly interested in the spatial dependence of the Airbnb rates, we first have to check for missing values in the longitude and latitude columns. If these values are missing, the observation is not considered in the analysis. We only loose 42 observations due to this restriction. Another step in the cleaning process requires the removal of the dollar sign for the average rate per night, since this should be a numerical variable. Also, the variables could be named more nicely to produce comprehensible tables and illustrations. The code below takes care of all of this and some more cleaning steps.

# drop missing values
dat <- dat %>% drop_na()

# remove the dollar sign string
dat$average_rate_per_night <- dat$average_rate_per_night %>% str_remove("\\$")

# rename variables
dat <- rename(dat, bedrooms = bedrooms_count)
dat <- rename(dat, rate = average_rate_per_night)

# extract the year and month the property was listed on Airbnb
dat$date_of_listing <- mdy(dat$date_of_listing)
dat$year <- year(dat$date_of_listing)
dat$month <- month(dat$date_of_listing)
dat <- dat[ , -4] 

# code the 'Studio' string as a number
dat$bedrooms <- dat$bedrooms %>% str_replace("Studio", "1")

# convert all variables to the appropriate type
dat <- type_convert(dat)

Explorative Data Analysis

The first part of the term paper will focus on a explorative data analysis. To get a first idea about the data set, the table below provides a short glimpse into the first 10 observations and the available variables. It also allows for more interaction with the data by searching and filtering the entire data set.

The table above lets us observe that we have some characteristics for each property listing on Airbnb. The first column rate contains the average rate per night in US Dollars for each listing . Also, we have information on the number of bedrooms (where a value of 1 corresponds to either a Studio apartment or a one-bedroom property) and the year and month, when the property was first listed on Airbnb. Moreover, we see that the longitude and latitude for each property is provided, giving us the exact point on the map, where the property is located. The plot below shows how our property listings are distributed across the state of Texas.

One thing that becomes quite obvious when you look at the map is that most of the properties cluster in and around the larger cities in Texas (e.g., Dallas, Austin, San Antonio and Houston). Zooming into the map allows for a more granular point of view.

Visually exploring the data a bit more lets us find more interesting facts. For instance, one possible question that we would like to address is if there is any spatial dependence in the average rate per night. Our intuition would suggest that Airbnb’s pricing algorithm takes the spatial location of a property into account. By dividing the properties into 10 categories based on their rate and color-coding them, it is possible to check this intuition with the following map.

Even though it seems like there is no general pattern to be observed at first, the spatial dependence becomes apparent when zooming into urban areas around the major cities. For instance, focusing on the area in and around Houston (click the crosshairs icon), clearly shows that properties in the city center tend to be cheaper compared to the surrounding area. This is an interesting first observation that we will keep in mind in the following analysis.

Some more questions that can be explored visually are answered below.

What is the distribution with respect to the number of bedrooms?

The majority of properties either is listed as a studio or has one bedroom. The maximum number of bedrooms is 13.

Is there a visual, linear trend between average rate per night and the number of bedrooms?

Looking at the plot, one can observe a linear trend by focusing on the lower and upper end of the data clouds for each category. Even though every number of bedrooms has properties across the whole range of rates, the bulk of the data slightly shifts upwards moving from less bedrooms on the left hand side, to more bedrooms to the right. However, it is important to note that the number of bedrooms is not a very strong predictor of the rate, as long as it is not above 4.

Are more properties being listed over time?
plot <- dat %>%
  ggplot(aes(x = year)) +
    geom_bar() +
    theme(axis.title.y = element_blank()) +
    labs(x = "year of listing on Airbnb") +
    scale_x_continuous(breaks = round(seq(min(dat$year), max(dat$year), by = 1),1))

Yes, the number of listings has grown over the past decade. The data has been collected in the second half of 2017, which explains the drop of new listings in the graph.

Comparison Across Major Cities

Let us dive a little deeper into the data and look at the distribution of the average rate per night across cities. The nine cities, where we have most observations in our data set are in order: Houston, Austin, San Antonio, Dallas, Fort Worth, Corpus Christi, College Station, Irving and Bryan. First, lets look at some boxplots to compare the average rate per night across these major cities.

Looking at the data from this perspective shows that Corpus Christi has the highest median rate per night, while Irving seems to be on the lower end of the spectrum. As one can see, every city has a fair amount of extremely high priced properties, with rates of more than 750$ per night. This lets us infer that the distribution of rates is positively skewed, with a large tail to the right-hand side of the distribution. The following figure confirms this for most of the cities.


Regression Analysis

The exploratory part of this term paper left us with some interesting insights into the distribution of nightly Airbnb rates in Texas. In a next step, we will now restrict the analysis to the city, where most of our observations are coming from - Houston. Also the geographical location of Houston inside the state of Texas is interesting, since the city is in close proximity to the sea.

houston <- dat %>% filter(city == "Houston")
OLS

Our benchmark hedonic OLS regression model only has one predictor - bedrooms. It would, of course, be very interesting to have more information about the properties (e.g., the total size of the property, amenities, reviews and an overall rating).

# benchmark hedonic OLS regression
fit.ols <- lm(rate ~ bedrooms, data = houston)
summary(fit.ols)
## 
## Call:
## lm(formula = rate ~ bedrooms, data = houston)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1928.4   -26.6     0.4    32.4  8398.1 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -330.91      28.36  -11.67   <2e-16 ***
## bedrooms      386.56      16.54   23.37   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 578.5 on 1501 degrees of freedom
## Multiple R-squared:  0.2667, Adjusted R-squared:  0.2662 
## F-statistic:   546 on 1 and 1501 DF,  p-value: < 2.2e-16

This allows us, for instance, to predict the average rate per night for a property located in Houston with 2 bedrooms.

# counterfactual
cf <- data.frame(bedrooms = 2)
predict(fit.ols, newdata = cf, interval = "prediction", level = 0.25)
##        fit      lwr      upr
## 1 442.1988 257.7352 626.6624
Spatial Model (MESS)

To incorporate the spatial model we need a weight matrix that incorporates the spatial distance between the properties. The distance can be found with the so-called haversine function.

# create a weight matrix 
dist <- geodist(x = cbind(houston$longitude, houston$latitude), measure = "haversine")/1000

My starting point is to generate a 10-nearest neighbor matrix for point data using the distances we just calculated with this haversine function:

n <- nrow(houston)
W <- matrix(0, n, n)
k <- 10
for (i in 1:n) {
  dist_i <- dist[i, ]
  near <- sort(dist_i)[-1][1:k]
  W[i, which(dist_i %in% near)] <- 1/k
}

This procedure puts an entry of 1 into each combination of houses that are 10-nearest neighbors, while all other entries remain 0, making the weight matrix sparse. Moreover the non-zero entries are divided by 10 to make the weight matrix row-stochastic. In a next step we can estimate a MESS model for the average rate per night. This allows us to also take into account the spatial location of the property.

# MESS model
y <- houston$rate
q <- 12
y.tilde <- sapply(0:(q-1), function(x) (W%^%x)%*%y) # (9.5)
G <- matrix(0,q,q)
diag(G) <- sapply(0:(q-1), function(x) 1/factorial(x)) # (9.6)
X <- as.matrix(cbind(1, houston$bedrooms))
colnames(X) <- c("intercept", "bedrooms")
n <- nrow(X)
In <- diag(n)
M <- In - X%*%solve(t(X)%*%X)%*%t(X) # Residual Maker Matrix
Q <- G%*%(t(y.tilde)%*%M%*%y.tilde)%*%G

iter <- 100
alpha <- seq(-0.99, -0.01, length.out = iter)
logl <- numeric(iter)
for (i in 1:iter) {
  v <- sapply(0:(q-1), function(x) alpha[i]^x)
  ee <- t(v)%*%Q%*%v
  logl[i] <- -n/2*log(ee)
}
alpha_hat <- alpha[which.max(logl)] ; alpha_hat
## [1] -0.3267677

We find the value for \(\alpha\) by maximizing the log-likelihood function, as it is illustrated in the plot below.

# link between alpha and rho: 
rho <- 1 - exp(alpha)
rho_hat <- 1 - exp(alpha_hat) ; rho_hat
## [1] 0.2787487
S <- lapply(0:(q-1), function(x) alpha_hat^x*(W%^%x)/factorial(x)) # Eq.(9.1)
S <- Reduce("+", S)
Sy <- S%*%y

After all the preparatory calculations we have done above, the final model can be easily estimated as a linear model.

fit.mess <- lm(Sy ~ 0 + X)
summary(fit.mess)
## 
## Call:
## lm(formula = Sy ~ 0 + X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1774.1  -102.6    19.4    51.6  8558.6 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## Xintercept  -359.64      27.96  -12.86   <2e-16 ***
## Xbedrooms    362.76      16.31   22.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 570.4 on 1501 degrees of freedom
## Multiple R-squared:  0.2946, Adjusted R-squared:  0.2937 
## F-statistic: 313.5 on 2 and 1501 DF,  p-value: < 2.2e-16
Spatial Model (SAR)

To compare our results, we also estimate a SAR model that takes into account a spatial lag on the dependent variable.

W_listw <- mat2listw(W)
fit_sar <- lagsarlm(rate ~ bedrooms, W_listw, data = houston, method = "eigen", quiet = TRUE)
summary(fit_sar)
## 
## Call:spatialreg::lagsarlm(formula = formula, data = data, listw = listw, 
##     na.action = na.action, Durbin = Durbin, type = type, method = method, 
##     quiet = quiet, zero.policy = zero.policy, interval = interval, 
##     tol.solve = tol.solve, trs = trs, control = control)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -1870.962   -35.827     2.995    35.348  8442.655 
## 
## Type: lag 
## Coefficients: (numerical Hessian approximate standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -341.619     28.410 -12.024 < 2.2e-16
## bedrooms     377.378     16.809  22.451 < 2.2e-16
## 
## Rho: 0.10507, LR test value: 5.8164, p-value: 0.015877
## Approximate (numerical Hessian) standard error: 0.042844
##     z-value: 2.4524, p-value: 0.01419
## Wald statistic: 6.0144, p-value: 0.01419
## 
## Log likelihood: -11688.6 for lag model
## ML residual variance (sigma squared): 328450, (sigma: 573.1)
## Number of observations: 1503 
## Number of parameters estimated: 4 
## AIC: 23385, (AIC for lm: 23389)
Comparison

In the three sections above we constructed three different models to explain the average rate per night of Airbnb listings in Houston. The first one was a simple OLS type model, whereas the MESS and the SAR model take the spatial dependence between Airbnb listings into account. Let us have a look at their results in the table below.

As one could already suspect the models we look at are far away from perfect. Nevertheless, looking at the \(R^2\) values, even the simple OLS model with only the number of bedrooms as predictor is able to explain more than \(1/4\) of the variation in the average rate per night. The SAR model, taking into account a spatial lag parameter (\(\rho\)) does not improve the explanatory power by much and the value of \(\rho\) is relatively small. A little bit of improvement can be achieved with the MESS model, which shows a \(R^2\) of almost 0.3 and a \(\rho\) of 0.279.

##                OLS      SAR     MESS
## intercept -330.912 -341.619 -359.639
## bedrooms   386.556  377.378  362.757
## rho             NA    0.105    0.279
## alpha           NA       NA   -0.327
## R-squared    0.266    0.270    0.294

As expected, the spatial models show smaller coefficients on the explanatory variable. This, however, is due to the fact that we only observe the direct effect in this coefficient. Spillover and feedback effects are not incorporated here.

Distance Based Matrix

In a final step, I will construct a different weight matrix to see how this choice affects the models. Instead of applying the k-nearest method, I will now use a distance based weight matrix. Compared to the sparse neighborhood based matrix, this approach will result in a full matrix. However, due to the fact that we construct the weight matrix based on the inverse distance, as the distances between the properties get larger, the weights get smaller. Taking advantage of the already calculated distances, this matrix can be constructed as follows.

# distance based weight matrix

W <- 1/dist # inverse of the distance
W[!is.finite(W)] <- 0 # 0 on the main diagonal

# make the weight matrix row-stochastic
rtot <- rowSums(W)
W <- W / rtot 

Now I estimate the two spatial models again to analyze the impact of the new weight matrix. The comparison is summarized in the table below.

##           SAR (10-nearest) MESS (10-nearest) SAR (distance) MESS (distance)
## intercept         -341.619          -359.639       -413.657        -359.639
## bedrooms           377.378           362.757        376.414         362.757
## rho                  0.105             0.279          0.483           0.279
## alpha                   NA            -0.327             NA          -0.327
## R-squared            0.270             0.294          0.271           0.294

The two MESS models are exactly the same, whereas the SAR model shows different values. However, the explanatory power is almost identical and the overall interpretation does not change. It is, nevertheless, interesting to see that the coefficient for \(\rho\) is a lot larger with the weight matrix based on the inverse distances.


  1. LeSage, J., and R.K. Pace (2009): Introduction to Spatial Econometrics, London and New York: Taylor & Francis Group.↩︎