One of the coolest geospatial things to learn is spatial interaction models. I’m going to give a quick overview of them here and how I used a doubly constrained gravity model to model for hire vehicle movement in New York City. The classical model is the gravity model, borrowed from classical physics. Let’s take a look at that.

Let’s first clear up some terms. In the gravity model, \(T_ij\) is an \(n\)x\(m\) matrix of flows between \(n\) origins (subscripted by \(i\)) to \(m\) destinations (subscripted by \(j\)). These are the individual trips between an origin and a destination. \(V\) is an \(n\)x\(p\) vector of \(p\) origin attributes describing the emissiveness of \(i\). Meanwhile, \(W\) is an \(m\)x\(p\) vector of \(p\) destination attributes describing the attractiveness of \(j\). Those are attributes of origins and destinations that make them attractive or repulsive. The trips between origins and destinations are slowed by \(d\), an \(n\)x\(m\) matrix of the costs to overcome from physical separation between \(i\) and \(j\). While this is usually distance or time, it could possibly be other things such as monetary cost. \(k\) is a scaling figure that makes sure the total observed and predicted flows are consistant, that the number of predicted equals the number that actually took place. \(k\) must be estimated and is not a given. \(\mu\) is a \(p\)x\(1\) vector of parameters representing the effect of \(p\) origin attributes on flows while \(\alpha\) is another \(p\)x\(1\) vector of parameters representing the effect of \(p\) destination attributes on flows. Finally \(\beta\) is a parameter representing the effect of movement costs on flows. If we have \(T\),\(V\), \(W\), and \(d\), then we can estimate the model parameters in what we call model calibration. This summarizes the effect that each parameter and component explains the known flows \(T\).

In 1971, Alan Wilson published his “family” of four spatial interaction models:

Unconstrained: \[ T_{ij} = V_i^\mu W_j^\alpha f(d_{ij})\]

Production Constrained: \[T_{ij} = A_i O_i W_j^\alpha f(d_{ij})\] \[A_i = \dfrac{1}{\sum\limits_j W_j^\alpha f(d_ij)}\]

Attraction Constrained: \[T_{ij} = B_j D_j V_i^\mu f(d_{ij})\] \[B_j = \dfrac{1}{\sum\limits_i V_i^\mu f(d_{ij})}\] Doubly Constrained: \[T_{ij} = A_i B_j O_i D_j f(d_{ij})\] \[A_i = \dfrac{1}{\sum\limits_j B_j D_j f(d_{ij})}\] \[B_j = \dfrac{1}{\sum\limits_i A_i O_i f(d_{ij})}\]

For the doubly constrained model, \(O_i\) is an \(n\)x1 vector of the total number of flows emanating from origin \(i\). Likewise, \(D_j\) is an \(m\)x1 vector of the total number of flows terminating at destination \(j\). \(A_i\) and \(B_j\) are both balancing factors, with \(A_i\) for the origin ensuring that total out-flows are preserved in the predicted flows and \(B_j\) doing the same for destination in-flows. Finally, $f(d_{ij}) is a function of cost or distance referred to as the distance decay function. Most commonly, it is either an exponential or power function.

Power: \[f(d_{ij}) = d^\beta_{ij}\] Exponential: \[f(d_{ij}) = exp(\beta d_{ij})\]

\(\beta\) is expected to take a negative value; this only makes sense as the greater the distance, the less the flow. What does it mean to be constrained? What this means is we actually know before hand either the origin out-flows, the destination in-flows, or both. And thus, any model is constrained by that information. Calibration nowadays utilizes linear regression, particularly Poisson; this makes sense because we’re counting trips. To get the model equation into a form that a regression can be applied to, we take the natural log creating a log-linear equation.

Another way to look at the model is this way: \[T_{ij} = kV_i^\mu W_j^\alpha d_{ij}^{-\beta}\]

Where k, a constant, is: \[k = \dfrac{T}{\sum_i \sum_j V_i^\mu W_j^\alpha d_{ij}^{-\beta}}\] \[T = \sum\limits_i\sum\limits_jT_{ij}\]

Now, back to calibration. To create a linear function we take the natural log. For the power function:

\[ln T_{ij} = k + \mu ln V_i + \alpha ln W_j - \beta ln d_{ij} + \epsilon\] And for the exponential function: \[ln T_{ij} = k + \mu ln V_i + \alpha ln W_j - \beta d_{ij} + \epsilon\]

Now, again we use the Poisson distribution, with \(\lambda_{ij} = T_{ij}\). This leads to:

\[ln \lambda_{ij} = k + \mu ln V_i + \alpha ln W_j - \beta ln d_{ij}\]

Exponentiating, an unconstrained model is: \[T_{ij} = exp(k + \mu ln V_i + \alpha lnW_j - \beta lnd_{ij})\] A production constrained: \[T_{ij} = exp(k + \mu_i + \alpha lnW_j - \beta lnd_{ij})\] An attraction constrained: \[T_{ij} = exp(k + \mu ln V_i + \alpha_j - \beta lnd_{ij})\] And the doubly constrained: \[T_{ij} = exp(k + \mu_i + \alpha_j - \beta lnd_{ij})\]

\(\mu_i\) and \(\alpha_j\) are both fixed effects that are essentially balancing factors. Here \(k\) is the intercept and has to be kept in log-linear models. to make sure total number of flows is conserveed. Poisson regression can take place with a generalized linear model (GLM) employing iteratively weighted least squares (IWSL).

Now, this family of models is actually predicated not on the gravity model but on a concept of entropy maximization. The idea is there is a macro state, a meso state, and a micro state. Think of micro states as individuals making an individual trip. The meso state is the pattern of those individuals becoming trip origin and destination pairs. And the macro state is perhaps the sum of all these flows. In a sense, each meso state is the sum of all its micro states. Where we can show that a meso state is:

\[W\{T_{ij}\} = \dfrac{T!}{\prod\limits_{ij} T_{ij}!}\]

Entropy is the number of ways microstates can be organized to make up a mesostate. To maximize we take:

\[\max \Bigg[ln(T!) - \sum\limits_{ij} ln(T_{ij}!)\Bigg]\]

#Set working directory
setwd("~/Dropbox/development/geospatial-projects/transit-projects/nyc-tlc-data/")
#Read in libraries
library(tidyverse)
library(rgdal)
library(sf)
library(sp)
library(reshape2)
library(stplanr)
library(leaflet)
library(broom)
library(dplyr)
library(MASS)
#Read in taxi zones
zones <- readOGR("taxi_zones/taxi_zones.shp")
zones <- spTransform(zones, CRS("+init=epsg:4326"))
zones <- zones[order(zones$LocationID),]

#Create distance matrix
distance <- spDists(zones)

#Melt distance matrix to create pairs
distPair <- melt(distance)

#Read in flows data
flows <- read.csv("fhv_flows.csv")
flows <- flows[order(flows$D),]

#Match up flows with distances
flows$distance <- distPair$value

#Turn origins and destinations into factors
flows[,'O'] <- factor(flows[,'O'])
flows[,'D'] <- factor(flows[,'D'])

#Instead of removing zero values, add one to everything
flows$Tij <- flows$Tij + 1

#Remove airports
flows <- subset(flows, O!=1 & D!=1)
flows <- subset(flows, O!=132 & D!=132)
flows <- subset(flows, O!=138 & O!=138)

#Remove intrazonal flows
flows <- flows[flows$O != flows$D,]
#Take sample of data so it is not clogged up
visualizationData <- sample_n(flows, 50)

#Create flow lines
travel_network <- od2line(flow=visualizationData, zones=zones)
travel_networkwgs <- spTransform(travel_network, "+init=epsg:4326")

#Show flow lines
plot(travel_network)

#Show travel zones
plot(zones, add=T)