library(sp); library(terra); library(gstat)
# unconditional Gaussian simulation on a grid
<- expand.grid(1:140, 1:100)
xy names(xy) <- c("x","y")
gridded(xy) = ~x+y
<- vgm(nugget=1, model = 'Sph', psill = 0, range = 0.0001)
v.m plot(variogramLine(v.m, maxdist = 50), type = "l", ylim = c(0,1.1))
<- gstat(formula = z~1, dummy = TRUE, beta = 0, model = v.m, nmax = 100)
g.sim <- predict(g.sim, xy, nsim = 1)
r.sim plot(rast(r.sim))
Spatial Statistics
Session 2
Download slides PDF
1 Geostatistics: theory, variogram properties, maximum likelihood, prediction
Overview
Stochastic process
Realization of stochastic process
Variogram functions
Sample variogram and fitting of variogram function
Maximum likelihood estimation of model parameters
Model selection/inference
Predictions for ordinary and universal/external drift kriging
2 Stochastic process
2.1 Terminology and model notation (session 1)
Model for data: \(Y_{i} = S(\mathbf{x}_{i}) + Z_{i}\)
where
\(Y_{i}\) : \(i^\mathrm{th}\) datum
\(S(\mathbf{x}_{i})\) : “signal” (= true quantity) at location \(\mathbf{x}_{i}\)
\(Z_{i}\) : iid. random measurement error
Decomposition of signal into trend \(\mu(\mathbf{x}_{i})\) and stochastic fluctuation:
\[ S(\mathbf{x}_{i}) = \mu(\mathbf{x}_{i}) + E(\mathbf{x}_{i}) \]
where commonly a linear model is used for \(\mu(\mathbf{x}_{i})\)
\[ \mu(\mathbf{x}_{i}) = \sum_{k} d_{k}(\mathbf{x}_{i}) \beta_{k} = \mathbf{d}(\mathbf{x}_{i})^\mathrm{T} \mathbf{\beta} \] with \(d_{k}(\mathbf{x}_{i})\) denoting (spatial) covariates and \(\{E(\mathbf{x}_{i})\}\) a zero mean stochastic process (random field).
2.2 Realization of stochastic process
Spatial phenomena obey laws of physics ⇒ are deterministic, have physical causes
Numerous processes and interactions thereof produce current complex outcome
Variation appears random ⇒ in geostatistical terms viewed as a random process
e.g. rainfall pattern, soil properties, air pollution as a realization of a random process
Each location \(x\) is associated with a suite of values with a known distribution
Actual value observed at \(x\) is viewed as one value of this distribution, allocated at random
Random function has no mathematical description, but “structure” in the sense of correlation in space (or time) and describes relation of random values at different locations to each other
\(\{S(\mathbf{x})\}\): Collection (= set) of random variables \(S(\mathbf{x}): \mathbf{x} \in D \subset \mathbb{R}^d\), at location \(x\) in area \(D\), with a well defined joint distribution
Stationary and isotropic stochastic processes
Stationarity: Assumption that allows to treat data with same degree of variation over region of interest.
Strictly stationary process: Joint distributions of arbitrary collections of random variables \(\{S(\mathbf{x}_{1}), \ldots, S(\mathbf{x}_{n})\}\) are invariant to translations by vector \(\mathbf{h} \in \mathbb{R}^d\)
\(\{S(\mathbf{x}_{1}), \ldots, S(\mathbf{x}_{n})\}\) and \(\{S(\mathbf{x}_{1}+\mathbf{h}), \ldots, S(\mathbf{x}_{n}+\mathbf{h})\}\) have the same joint distribution:
\[ F(s_{1}, \ldots, s_{n}; \mathbf{x}_{1}, \ldots, \mathbf{x}_{n}) = F(s_{1}, \ldots, s_{n}; \mathbf{x}_{1}+\mathbf{h}, \ldots, \mathbf{x}_{n}+\mathbf{h}) \]
Isotropic: Weakly stationary process that is invariant to rotations (opposite: anisotropic).
Gaussian stochastic process: All joint and conditional distributions are normal.
Second-order stationary stochastic processes
Weakly or second-order stationary process:
Distributions of arbitrary pairs of random variables \((S(\mathbf{x}), S(\mathbf{x}+\mathbf{h}))\) satisfy:
\(\mathbb{E}[S(\mathbf{x})] = \text{constant}\) (independent of \(\mathbf{x}\))
\(\text{Cov}(S(\mathbf{x}+\mathbf{h}), S(\mathbf{x})) = \gamma(\mathbf{h})\) (independent of \(\mathbf{x}\))
\(\text{Var}(S(\mathbf{x})) = \text{constant}\) (independent of \(\mathbf{x}\))
⇒ Covariance depends on \(h\) and only on \(h\), the separation between samples in both distance and direction
⇒ Strict stationarity implies weak stationarity.
⇒ Stationarity is required for estimation/prediction with a single realization of the stochastic process.
2.3 Covariance function and variogram
Definition of variogram \(V(\mathbf{h})\) and covariance function \(\gamma(\mathbf{h})\):
\[ V(\mathbf{h}) = \frac{1}{2} \text{Var}\left(S(\mathbf{x}+\mathbf{h}) - S(\mathbf{x})\right) \]
\[ \gamma(\mathbf{h}) = \text{Cov}\left(S(\mathbf{x}+\mathbf{h}), S(\mathbf{x})\right) \]
Relation between variogram and covariance function: \[ V(\mathbf{h}) = \gamma(0) - \gamma(\mathbf{h}), \quad \text{with} \quad \gamma(0) = \text{Var}(S(\mathbf{x})) \]
⇒ Variogram is preferred.
Variogram is based on differences only (no distribution mean as in covariance). Allows to relax stationarity assumptions even further, instead of \(\mathbb{E}[S(\mathbf{x})] = \text{constant}\), we can assume (intrinsic stationarity):
\[ \mathbb{E}[S(\mathbf{x}+\mathbf{h}) - S(\mathbf{x})] =0 \]
Variogram and correlogram
Relation between correlogram and covariance function for weakly stationary process: \[ \rho(\mathbf{h}) = \frac{\gamma(\mathbf{h})}{\gamma(0)} \]
Relation between variogram and correlogram:
\[ V(\mathbf{h}) = \gamma(0)(1 - \rho(\mathbf{h})) \]
Symmetry:
\[ V(\mathbf{h}) = V(-\mathbf{h}), \quad \gamma(\mathbf{h}) = \gamma(-\mathbf{h}), \quad \rho(\mathbf{h}) = \rho(-\mathbf{h}) \]
Covariance and Variogram Plot
Summary stochastic process
Stochastic process: generalisation of multidimensional random variable
Stationarity assumption required for estimation from single realisation of stochastic process
In practice assumption of weak stationarity:
constant mean
constant variance
covariance and semivariance depends only on lag distance but not on location
Often additional assumption of isotropic auto-correlation
3 Variogram functions and their properties
Principle features of the variogram
Increase in variance with increasing lag
Function must guarantee non-negative variances
Sill variance \(\mathbf{c_0 + c}\), i.e. an upper bound
Range of spatial correlation \(\mathbf{a}\), where auto-correlation becomes 0
Nugget variance \(\mathbf{c_0}\), i.e. a positive intercept
with \(\mathbf{c}\) often called partial sill, i.e. part of variance with spatial structure up to \(\mathbf{a}\)
Special cases
Anisotropy, i.e. directional variation depending on angle
Unbounded variogram (not second-order stationary)
Pure Nugget variogram
Pure nugget variogram
Absence of auto-correlation (nugget effect covariance)
\[ V(h) = \left\{\begin{array}{ll} 0 & \mbox{if $h=0$}\\ c_0 & \mbox{otherwise} \end{array}\right. \]
Mechanism: measurement error and/or small-scale spatial variation
Pure nugget variogram – realization
Spherical variogram function
\[ V(h) = \left\{ \begin{array}{lll} c_0 + c \left\lbrace \frac{3h}{2a} - \frac{1}{2} \left(\frac{h}{a}\right)^3 \right\rbrace & \mbox{for $0 < h \leq a$}\\ c_0 + c & \mbox{for $h > a$} \\ c_0 & \mbox{for $h = 0$} \end{array}\right. \]
where \(c_0\) is the nugget variance, \(c\) the variance of spatially correlated component and \(a\) is the range of spatial dependence.
Spherical variogram function – realizations
<- vgm(nugget=0.3, model = 'Sph', psill = 0.7, range = 20)
A <- vgm(nugget=0.05, model = 'Sph', psill = 0.95, range = 20)
B <- vgm(nugget=0.3, model = 'Sph', psill = 0.7, range = 5) C
Solution
variogramm model – variogram curve – raster with realization
- A – 2 – top right
- B – 3 – bottom right
- C – 1 – bottom left
Exponential variogram function
\[ V(h) = \left\{ \begin{array}{ll} c_0 + c \left\lbrace 1 - exp\left(\frac{h}{a}\right) \right\rbrace & \mbox{for $0 < h$}\\ c_0 & \mbox{for $h = 0$} \end{array}\right. \]
with \(a\) here being a distance parameter.
The function approaches the sill asymtotically and does not have a finite range. For practical purposes usually an effective range \(a'\) is used which is approximately \(3a\).
Exponential variogram function – realizations
<- vgm(nugget=0.3, model = 'Exp', psill = 0.7, range = 7)
A <- vgm(nugget=0.0001, model = 'Exp', psill = 0.999, range = 20)
B <- vgm(nugget=0.1, model = 'Exp', psill = 0.9, range = 20) C
Solution
variogramm model – variogram curve – raster with realization
- A – 1 – bottom right
- B – 3 – bottom left
- C – 2 – top right
Matérn (Gaussian) variogram function
\[ V(h) = c_0 + c \left\lbrace 1 - \frac{1}{2^{v-1}\Gamma(v)} \left( -\frac{h}{a}\right)^v K_v\left(\frac{h}{a}\right) \right\rbrace \]
with \(c_0\), \(c\) and \(a\) being nugget, sill and range parameters as before.
The equation embodies the gamma function \(\Gamma\) with parameter \(v\) and the Bessel function of the second kind, \(K_v\) for parameter \(v\).
Parameter \(v\) describes the smoothness of variation and can vary from 0 (very rough) to infinity (very smooth).
With \(v=0.5\) it becomes the exponential variogram.
Matérn function – Variogram smoothness
Shape of variogram close to origin controls smoothness of realizations of stochastic processes:
Variogram with nugget: realizations non-continuous
Variogram grows linearly at origin: realizations continuous, but not everywhere differentiable
Variogram grows at at least quadratically at origin: realizations everywhere at least once differentiable
Matérn variogram function – realizations
<- vgm(nugget=0, model = 'Mat', psill = 1, range = 6, kappa = 0.5)
A <- vgm(nugget=0, model = 'Mat', psill = 1, range = 3.5, kappa = 1.5)
B <- vgm(nugget=0, model = 'Mat', psill = 1, range = 2.5, kappa = 3) C
Solution
variogramm model – variogram curve – raster with realization
- A – 1 – top right
- B – 2 – bottom left
- C – 3 – bottom right
4 Model for Gaussian spatial data
4.1 Model for Gaussian spatial data
Model for data: \(Y_{i} = S(\mathbf{x}_{i}) + Z_{i} = \mu(\mathbf{x}_{i}) + E(\mathbf{x}_{i}) + Z_{i}\)
with \(Y_{i}\): \(i^{\text{th}}\) datum; \(S(\mathbf{x}_{i})\): “signal” (true quantity) at location \(\mathbf{x}_{i}\); \(\mu(\mathbf{x}_{i})\): trend
\(\{E(\mathbf{x}_{i})\}\): a zero-mean Gaussian process, parametrized by covariance function \(\gamma(\mathbf{h}; \mathbf{\theta})\) or variogram \(V(\mathbf{h}; \mathbf{\theta})\)
\(Z_{i}\): iid Gaussian measurement error with variance \(\tau^2\)
Trend \(\mu(\mathbf{x}_{i})\) modeled by linear regression model with spatial covariates \(d_{k}(\mathbf{x}_{i})\)
\[ \mu(\mathbf{x}_{i}) = \sum_{k} d_{k}(\mathbf{x}_{i}) \beta_{k} = \mathbf{d}(\mathbf{x}_{i})^{T} \mathbf{\beta} \]
Unknown elements of the model:
Structure and parameters \(\mathbf{\beta}\) of the trend model
Covariance (or variogram) parameters \(\mathbf{\theta}\)
Nugget variance \(\tau^2\)
4.2 Trend modelling
Ordinary least squares (OLS) trend estimation
Gaussian model in vector notation: \(\mathbf{Y} = \mathbf{X} \mathbf{\beta} + \mathbf{E} + \mathbf{Z}\)
Estimation of trend parameters \(\mathbf{\beta}\) by ordinary least squares:
\[ \hat{\mathbf{\beta}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]
For spatially uncorrelated data (\(\mathbf{E} = \mathbf{0}\); \(\text{Cov}(\mathbf{Y}, \mathbf{Y}^\top) = \tau^2 \mathbf{I}\)):
\[ \hat{\mathbf{\beta}}_{\text{OLS}} \sim \mathcal{N}\left(\mathbf{\beta}, \tau^2 (\mathbf{X}^\top \mathbf{X})^{-1} \right) \]
For spatially auto-correlated data \(\text{Cov}(\mathbf{Y}, \mathbf{Y}^\top) = \text{Cov}(\mathbf{Z}, \mathbf{Z}^\top) + \text{Cov}(\mathbf{E}, \mathbf{E}^\top) = \mathbf{\Gamma}_{\theta} = \tau^2 \mathbf{I} + \mathbf{\Sigma}_{\theta}\)
\[ \hat{\mathbf{\beta}}_{\text{OLS}} \sim \mathcal{N}\left(\mathbf{\beta}, \tau^2(\mathbf{X}^\top \mathbf{X})^{-1} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Sigma}_{\theta} \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \right) \]
Conclusion: Ignoring auto-correlation: \(\hat{\mathbf{\beta}}_{\text{OLS}}\) unbiased, but the standard errors are too small. Tests based on OLS fit are biased!
Generalized least squares (GLS) trend estimation
Generalized least squares estimates:
\[ \hat{\mathbf{\beta}}_{\text{GLS}} = (\mathbf{X}^\top \mathbf{\Gamma}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Gamma}_{\theta}^{-1} \mathbf{Y} \]
GLS = OLS with “orthogonalized” response and design matrix.
Sampling distribution:
\[ \hat{\mathbf{\beta}}_{\text{GLS}} \sim \mathcal{N}\left(\mathbf{\beta}, (\mathbf{X}^\top \mathbf{\Gamma}_{\theta}^{-1} \mathbf{X})^{-1} \right) \]
For spatially uncorrelated data (\(\mathbf{\Gamma}_{\theta} = \tau^2 \mathbf{I}\)):
\[ \hat{\mathbf{\beta}}_{\text{GLS}} = \hat{\mathbf{\beta}}_{\text{OLS}} \]
\(\hat{\mathbf{\beta}}_{\text{GLS}}\) has the smallest standard errors among all linear estimators (Gauss-Markov theorem), hence it’s the BLUE (Best Linear Unbiased Estimator).
\(\hat{\mathbf{\beta}}_{\text{GLS}}\) is the maximum likelihood estimate for Gaussian \(\mathbf{Y}\).
Generalised least squares (GLS) is the method of choice for estimating coefficients of the trend model.
4.3 Computing the sample variogram
Sample variogram (also experimental variogram) consists of semivariances at a finite set of discrete lags \(h\).
Underlying function is continuous for all \(h\).
Variogram modelling:
Compute sample variogram
Fit smooth function that describes principal features of semivariance sequence
4.4 Computing sample variogram of residuals
Extract residuals \(\mathbf{R} = \mathbf{Y} - \mathbf{X} \hat{\mathbf{\beta}}\) of the fitted linear model (or use data \(\mathbf{Y}\) if the model has constant \(\mu(\mathbf{x})\)).
Choose bin width \(dh\) (and width of angular classes \(d\phi\)) to define the \((k,l)^{\text{th}}\) lag class, \(\mathbf{h}_{kl}\), characterized by:
Distance: \((h_k - dh, h_k + dh]\)
Angular class: \(\phi_l - d\phi, \phi_l + d\phi\)
Computing sample variogram – formally
Form all \(N_{kl}\) pairs \((i,j)\) with \(\mathbf{x}_{i} - \mathbf{x}_{j} \approx \mathbf{h}_{kl}\).
Compute for each lag class \(\mathbf{h}_{kl}\) the semivariance: \[ \hat{V}(\mathbf{h}_{kl}) = \frac{1}{2 N_{kl}} \sum_{(i,j) \in \mathbf{h}_{kl}} \left[R(\mathbf{x}_{i}) - R(\mathbf{x}_{j})\right]^2 \]
Sample variogram: Plot of \(\hat{V}(\mathbf{h}_{kl})\) vs. \(\mathbf{h}_{kl}\).
Rules of thumb:
Ensure enough data pairs per lag class. Choose \(dh\) (and \(d\phi\)) such that \(N_{kl} > 30-50\).
Only compute the variogram for half the size of the study area.
Largest \(\mathbf{h}_{kl} \leq 0.5 \max(\mathbf{x}_{i} - \mathbf{x}_{j})\).
Examples of different lags (Wolfcamp data)
4.5 Fitting variogram model to sample variogram
Semivariance is required for arbitrary lag distances when computing predictions.
Smoothing the sample variogram by fitting a parametric variogram function \(V(\mathbf{h}, \mathbf{\theta})\).
Choose a variogram function that approximates the shape of the sample variogram well, particularly close to the origin.
Fit parameters \(\mathbf{\theta}\) by (weighted) least squares: \[ \hat{\mathbf{\theta}} = \underset{\mathbf{\theta}}{\text{argmin}} \sum_{kl} w(\mathbf{h}_{kl}) \left( V(\mathbf{h}_{kl}) - \hat{V}(\mathbf{h}_{kl}, \mathbf{\theta}) \right)^2 \]
Options for weighing:
Equal weights: \(w(\mathbf{h}_{kl}) = 1\)
By number of pairs: \(w(\mathbf{h}_{kl}) = N_{kl}\)
Cressie’s weights: \(w(\mathbf{h}_{kl}) = \frac{N_{kl}}{V(\mathbf{h}_{kl}, \mathbf{\theta})^2}\)
Fits with different weights (Wolfcamp data)
Fits with different lags (Wolfcamp data)
4.6 Problems with ad-hoc model estimation
Subjective choice of lag class width and weighting method for model fitting.
Estimates of semivariance for different lag classes are mutually correlated; the choice of variogram function based on the sample variogram is problematic.
The sample variogram is susceptible to outliers, hence robust estimators are preferred.
Fitting a model function to the sample variogram requires further subjective choices.
An ad-hoc approach provides biased estimates of the variogram of the underlying stochastic process if the trend is modeled.
Therefore, the estimate of the variogram based on the sample variogram of OLS residuals is biased.
Thus, estimate trend and variogram parameters simultaneously using maximum likelihood.
5 Maximum likelihood estimation
5.1 Maximum likelihood (ML) estimation of parameters of Gaussian model for spatial data
Principle of maximum likelihood estimation: find parameters that maximize joint probability for observed data
Properties of maximum likelihood estimates: asymptotically unbiased and fully efficient; asymptotically normally distributed
profile likelihood useful for exploring shape of likelihood surface and for computing confidence intervals based on likelihood ratio test
Maximum likelihood estimation
Consider a Gaussian stochastic process \(\{Y(\mathbf{x})\}\) with a linear trend function.
Any arbitrary set of random variables \(\mathbf{Y} = (Y(\mathbf{x}_{1}), \ldots, Y(\mathbf{x}_{n}))\) has a multivariate Gaussian distribution with expectation: \[
\mathbb{E}[\mathbf{Y}] = \mathbf{X} \mathbf{\beta}
\]
and covariance matrix: \[
\text{Cov}(\mathbf{Y}, \mathbf{Y}^\mathrm{T}) = \mathbf{\Gamma}_{\mathbf{\theta}}
\]
The joint probability density for \(\mathbf{Y}\) is given by:
\[ f(\mathbf{y};\mathbf{\beta},\mathbf{\theta}) = (2\pi)^{-\frac{n}{2}} |\mathbf{\Gamma}_{\mathbf{\theta}}|^{-\frac{1}{2}} \exp\left(-\frac{1}{2} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\}\right) \]
Maximum likelihood estimation (cont.)
Unknown model parameters:
Regression coefficients \(\mathbf{\beta}\)
Covariance (or variogram) parameters \(\mathbf{\theta}\)
The log-likelihood function (up to a constant) is given by: \[ L(\mathbf{\beta}, \mathbf{\theta};\mathbf{y}) = -\frac{1}{2} \log(|\mathbf{\Gamma}_{\mathbf{\theta}}|) -\frac{1}{2} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\} \]
For known variogram parameters \(\mathbf{\theta}\) maximum likelihood estimate for \(\mathbf{\beta}\) equals GLS estimator: \[ \hat{\mathbf{\beta}}_{\mathrm{GLS}} = (\mathbf{X}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \mathbf{X})^{-1} \mathbf{X}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \mathbf{Y} \]
Plugging \(\hat{\mathbf{\beta}}_{\mathrm{GLS}}\) into \(L(\mathbf{\beta}, \mathbf{\theta};\mathbf{y})\) gives the profile likelihood function for \(\mathbf{\theta}\).
No closed form expression for maximum likelihood estimate
Maximize \(L\) numerically by a non-linear optimization method to find \(\hat{\theta}\)
Numerical optimization requires initial values of \(\theta\)
Make sure numerical optimization converged!
Example: MaxLik estimates Wolfcamp data
library(gstat); library(georob)
data("wolfcamp"); d.w <- wolfcamp
coordinates(d.w) <- c("x", "y")
<- georob(pressure~x+y, d.w,
r.georob.ml locations=~x+y, variogram.model="RMspheric",
param=c(variance=3000, nugget=1000, scale=100),
tuning.psi=1000, control=control.georob(ml.method="ML"))
summary(r.georob.ml)
Call:georob(formula = pressure ~ x + y, data = d.w, locations = ~x +
y, variogram.model = "RMspheric", param = c(variance = 3000,
nugget = 1000, scale = 100), tuning.psi = 1000, control = control.georob(ml.method = "ML"))
Tuning constant: 1000
Convergence in 12 function and 7 Jacobian/gradient evaluations
Estimating equations (gradient)
eta scale
Gradient : -8.880367e-05 7.654524e-04
Maximized log-likelihood: -458.3671
Predicted latent variable (B):
Min 1Q Median 3Q Max
-89.37 -54.98 -16.64 24.43 111.41
Residuals (epsilon):
Min 1Q Median 3Q Max
-62.789 -19.775 6.311 18.032 62.776
Standardized residuals:
Min 1Q Median 3Q Max
-2.3778 -0.7413 0.2249 0.6534 3.2793
Gaussian ML estimates
Variogram: RMspheric
Estimate Lower Upper
variance 3328.90 1453.95 7621.7
snugget(fixed) 0.00 NA NA
nugget 1236.27 616.01 2481.1
scale 122.95 95.13 158.9
Fixed effects coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 620.3550 17.0641 36.354 < 2e-16 ***
x -1.3256 0.1360 -9.750 2.33e-15 ***
y -1.2061 0.1793 -6.727 2.16e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error (sqrt(nugget)): 35.16
Robustness weights:
All 85 weights are ~= 1.
Wolfcamp: MaxLik fitted variogram
plot(r.georob.ml, lag.dist.def=20, max.lag=200)
5.2 Restricted maximum likelihood estimation
Equivalent number of independent observations of a sample of spatial data often much smaller than nominal sample size ⇒ bias of ML estimates of variance parameters important
The bias of MLEs of variogram parameters \(\mathbf{\theta}\) can be reduced by restricted maximum likelihood estimation (REML).
Principle of REML:
- Form linear combinations \(\mathbf{Z} = \mathbf{A} \mathbf{Y}\) of the data \(\mathbf{Y}\) that have zero expectation (no longer depend on \(\beta\)): \[ \mathbb{E}[\mathbf{Z}] = \mathbf{A} \mathbf{X} \mathbf{\beta} = \mathbf{0} \]
The matrix \(\mathbf{A}\) must satisfy: \(\mathbf{A} \mathbf{X} = \mathbf{0}\)
\(\mathbf{A}\) is non-unique with many possibilities
- Estimate \(\mathbf{\theta}\) by maximizing the likelihood function for \(n - p\) elements of \(\mathbf{Z}\).
REML estimates Wolfcamp data
<- georob(pressure~x+y, d.w,
r.georob.reml locations=~x+y, variogram.model="RMspheric",
param=c(variance=3000, nugget=1000, scale=100),
tuning.psi=1000)
summary(r.georob.reml)
Call:georob(formula = pressure ~ x + y, data = d.w, locations = ~x +
y, variogram.model = "RMspheric", param = c(variance = 3000,
nugget = 1000, scale = 100), tuning.psi = 1000)
Tuning constant: 1000
Convergence in 6 function and 5 Jacobian/gradient evaluations
Estimating equations (gradient)
eta scale
Gradient : -2.248651e-04 -1.070402e-01
Maximized restricted log-likelihood: -456.3802
Predicted latent variable (B):
Min 1Q Median 3Q Max
-94.58 -60.99 -17.59 23.10 115.72
Residuals (epsilon):
Min 1Q Median 3Q Max
-59.148 -18.009 6.251 15.982 54.620
Standardized residuals:
Min 1Q Median 3Q Max
-2.4030 -0.7131 0.2282 0.6937 3.1932
Gaussian REML estimates
Variogram: RMspheric
Estimate Lower Upper
variance 4358.84 1810.40 10494.6
snugget(fixed) 0.00 NA NA
nugget 1151.28 540.41 2452.7
scale 138.91 82.62 233.6
Fixed effects coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 624.3287 20.7961 30.021 < 2e-16 ***
x -1.3291 0.1611 -8.248 2.25e-12 ***
y -1.1804 0.2115 -5.581 3.00e-07 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error (sqrt(nugget)): 33.93
Robustness weights:
All 85 weights are ~= 1.
6 Model inference
6.1 Inference, model building and assessment
Data analysis often leads to a set of equally plausible candidate models that use different set of covariates and different variograms
compare fit of candidate models by hypothesis tests taking auto-correlation properly into account
use established goodness-of-fit criteria (AIC, BIC) to select a “best” model, again taking auto-correlation into account
use cross-validation to compare the power of candidate models to predict new data
Testing hypotheses about trend coefficients
Likelihood ratio (LRT) test can only be used to test hypotheses and build confidence regions for \(\mathbf{\theta}\)
LRT for regression for \(\mathbf{\beta}\) in general biased (too small \(p\)-values)
Use conditional \(F\)-tests for testing hypotheses about \(\mathbf{\beta}\):
Fit covariance parameters of “largest” regression model
⇒ \(\hat{\mathbf{\theta}}\)Compute covariance matrix ⇒ \(\mathbf{\Gamma}_{\hat{\theta}}\)
Orthogonalize response vector and design matrix (using Cholesky decomposition)
Conventional \(F\)-test with orthogonalized items \(\tilde{\mathbf{Y}}\) and \(\tilde{\mathbf{X}}\)
Fit quadratic trend surface model for Wolfcamp
<- update(r.georob.reml, .~.+I(x^2)+I(y^2)+x:y)
r.georob.full summary(r.georob.full)
Call:georob(formula = pressure ~ x + y + I(x^2) + I(y^2) + x:y, data = d.w,
locations = ~x + y, variogram.model = "RMspheric", param = c(variance = 3000,
nugget = 1000, scale = 100), tuning.psi = 1000)
Tuning constant: 1000
Convergence in 10 function and 8 Jacobian/gradient evaluations
Estimating equations (gradient)
eta scale
Gradient : 3.590344e-04 -4.553394e-03
Maximized restricted log-likelihood: -470.3894
Predicted latent variable (B):
Min 1Q Median 3Q Max
-89.22 -46.81 -11.06 20.80 94.07
Residuals (epsilon):
Min 1Q Median 3Q Max
-59.664 -18.086 6.783 16.245 49.986
Standardized residuals:
Min 1Q Median 3Q Max
-2.4096 -0.7213 0.2500 0.6830 3.1042
Gaussian REML estimates
Variogram: RMspheric
Estimate Lower Upper
variance 3740.02 1469.60 9518.1
snugget(fixed) 0.00 NA NA
nugget 1152.79 532.29 2496.6
scale 123.93 93.96 163.5
Fixed effects coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 608.033837 30.543683 19.907 < 2e-16 ***
x -1.180485 0.177576 -6.648 3.49e-09 ***
y -1.212317 0.211601 -5.729 1.76e-07 ***
I(x^2) 0.001592 0.001230 1.294 0.199
I(y^2) -0.001731 0.002133 -0.811 0.420
x:y 0.002067 0.001968 1.050 0.297
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error (sqrt(nugget)): 33.95
Robustness weights:
All 85 weights are ~= 1.
Conditional F-test on interaction and higher-order polynomials
waldtest(r.georob.full, .~.-x:y, test="F")
Wald test
Model 1: pressure ~ x + y + I(x^2) + I(y^2) + x:y
Model 2: pressure ~ x + y + I(x^2) + I(y^2)
Res.Df Df F Pr(>F)
1 79
2 80 -1 1.1032 0.2968
waldtest(r.georob.full, .~.-I(x^2)-I(y^2)-x:y, test="F")
Wald test
Model 1: pressure ~ x + y + I(x^2) + I(y^2) + x:y
Model 2: pressure ~ x + y
Res.Df Df F Pr(>F)
1 79
2 82 -3 1.6284 0.1895
6.2 Model selection with stepwise
Given estimates of covariance parameters \(\hat{\theta}\) and keeping them fixed, the usual stepwise procedures for selecting covariates can be used.
Selecting models based on AIC and BIC.
Stepwise selection with AIC (defaults to both directions)
step(r.georob.full)
Start: AIC=922.16
pressure ~ x + y + I(x^2) + I(y^2) + x:y
Df AIC Converged
- I(x^2) 1 922.05 1
- I(y^2) 1 922.13 1
<none> 922.16
- x:y 1 922.49 1
Step: AIC=922.05
pressure ~ x + y + I(y^2) + x:y
Df AIC Converged
<none> 922.05
+ I(x^2) 1 922.16 1
- I(y^2) 1 922.54 1
- x:y 1 924.61 1
Tuning constant: 1000
Fixed effects coefficients:
(Intercept) x y I(y^2) x:y
627.526464 -1.148338 -1.347008 -0.002587 0.003005
Variogram: RMspheric
variance(fixed) snugget(fixed) nugget(fixed) scale(fixed)
2060.4 0.0 1402.1 103.8
Stepwise selection (defaults to both directions) with BIC
step(r.georob.full, k=log(nrow(d.w)))
Start: AIC=936.81
pressure ~ x + y + I(x^2) + I(y^2) + x:y
Df AIC Converged
- I(x^2) 1 934.27 1
- I(y^2) 1 934.34 1
- x:y 1 934.70 1
<none> 936.81
Step: AIC=934.27
pressure ~ x + y + I(y^2) + x:y
Df AIC Converged
- I(y^2) 1 932.31 1
<none> 934.27
- x:y 1 934.38 1
+ I(x^2) 1 936.81 1
Step: AIC=932.31
pressure ~ x + y + x:y
Df AIC Converged
- x:y 1 931.79 1
<none> 932.31
+ I(y^2) 1 934.27 1
+ I(x^2) 1 934.34 1
Step: AIC=931.79
pressure ~ x + y
Df AIC Converged
+ I(x^2) 1 931.70 1
<none> 931.79
+ x:y 1 932.31 1
+ I(y^2) 1 934.38 1
- y 1 1006.98 1
- x 1 1085.42 1
Step: AIC=931.7
pressure ~ x + y + I(x^2)
Df AIC Converged
<none> 931.70
- I(x^2) 1 931.79 1
+ x:y 1 934.34 1
+ I(y^2) 1 934.70 1
- y 1 991.37 1
- x 1 1075.89 1
Tuning constant: 1000
Fixed effects coefficients:
(Intercept) x y I(x^2)
592.266784 -1.285190 -1.143500 0.001875
Variogram: RMspheric
variance(fixed) snugget(fixed) nugget(fixed) scale(fixed)
2060.4 0.0 1402.1 103.8
7 Kriging predictions
7.1 Prediction problem formulation
Observations \(\mathbf{y}^\mathrm{T} = (y_{1}, \ldots, y_{n})\) available for a set of \(n\) locations \(\mathbf{x}_{i}\)
Consider \(\mathbf{y}\) as a realization of the random variable \(\mathbf{Y}^\mathrm{T} = (Y_{1}, \ldots, Y_{n})\)
Model: \(Y_{i} = S(\mathbf{x}_{i}) + Z_{i}\) with:
- \(Y_{i}\): the \(i^\mathrm{th}\) datum
- \(S(\mathbf{x}_{i})\): “signal” (the true quantity) at location \(\mathbf{x}_{i}\)
- \(\{S(\mathbf{x}_{i})\}\): Gaussian process, parametrized by:
- Trend: \(\mu(\mathbf{x}_{i}) = \sum_{k} d_{k}(\mathbf{x}_{i}) \beta_{k} = \mathbf{d}(\mathbf{x}_{i})^\mathrm{T} \mathbf{\beta}\)
- Covariance function \(\gamma(\mathbf{h}; \mathbf{\theta})\) or variogram \(V(\mathbf{h}; \mathbf{\theta})\)
- \(Z_{i}\): independent, identically distributed (iid) Gaussian measurement error with variance \(\tau^2\)
Predictions: Let’s say \(\hat{\mathbf{S}}\) is the prediction of \(\mathbf{S}^\mathrm{T} = (S(\mathbf{x}_{1}^\prime), \ldots, S(\mathbf{x}_{m}^\prime))\) for a set of \(m\) locations \(\mathbf{x}_{j}^\prime\) without data.
\(\hat{\mathbf{S}}\) is computed from \(\mathbf{Y}\), therefore \(\hat{\mathbf{S}} = \hat{\mathbf{S}}(\mathbf{Y})\).
7.2 Kriging prediction at new point location
Ordinary punctual kriging
For same spatial support – prediction entities are assumed to have same extension as observations (i.e. sampling area, sensor size)
Prediction of signal \(S(\mathbf{x}_{0})\) at location \(\mathbf{x}_{0}\) without measurement
\[ \hat{S}(\mathbf{x}_{0}) = \sum_{i=1}^n \kappa_{i}(\mathbf{x}_{0})\, y(\mathbf{x}_{i}) \]
with kriging weights \(\kappa_{i}(\mathbf{x}_{0})\). To ensure unbiased estimates, weights are made to sum to 1:
\[ \sum_{i=i}^N \kappa_i(x_0) = 1 \]
Expected difference \(\mathbb{E}[S(\mathbf{x}_{0}) - \hat{S}(\mathbf{x}_{0})] = 0\).
Estimate of kriging variances
Estimate of the variance of the prediction error:
\[ var[\hat{S}(\mathbf{x}_{0})] = \mathbb{E}\left[\left\lbrace S(\mathbf{x}_{0}) - \hat{S}(\mathbf{x}_{0}) \right\rbrace\right] \]
\[ = 2\sum_{i=1}^N \kappa_{i}(\mathbf{x}_{0}) V(\mathbf{x_i}, \mathbf{x_0}) - \sum_{i=1}^N \sum_{j=1}^N \kappa_{i}(\mathbf{x}_{0}) \kappa_{j}(\mathbf{x}_{0}) V(\mathbf{x_i}, \mathbf{x_j}) \] with \(V(\mathbf{x_i}, \mathbf{x_0})\) being the semivariance of \(S\) between sampling point \(x_i\) and the target prediction point \(\mathbf{x_0}\) and \(V(\mathbf{x_i}, \mathbf{x_j})\) the semivariance between the \(i\)th and the \(j\)th sampling points.
Find kriging weights
Kriging weights are calculated by solving the system of equations that minimizes the prediction error variance subject to the constraints of the variogram model.
Find weights \(\kappa_{i}(\mathbf{x}_{0})\) that minimize kriging variances and sum to 1 by solving \(N + 1\) equations:
\[ \sum_{i=1}^N \kappa_{i}(\mathbf{x}_{0}) V(\mathbf{x_i}, \mathbf{x_j}) + \psi(\mathbf{x_0}) = V(\mathbf{x_j}, \mathbf{x_0}) \quad \text{for all} ~~j \]
\[ \sum_{i=i}^N \kappa_i(x_0) = 1 \]
The quantity \(\psi(\mathbf{x_0})\) is a Lagrange multiplier introduced to achieve minimization. Computed as the inverse of the semivariance matrix multiplied by the vector of semivariances to the target point.
Properties of kriging prediction
Shape of variogram function close to origin determine shape of prediction surface near data locations
Continuity and diffentiability of variogram at origin control geometrical properties of simple kriging prediction surface
7.3 Universal/external drift kriging predictions
Universal kriging: often referred if trend is modelled by coordinates
External-drift kriging: trend is modelled by spatial covariates
Evaluating \(\hat{\mathbf{S}}_{\text{opt}}\) requires a fully specified weakly stationary model:
Structure of trend function is known
Regression coefficients \(\mathbf{\beta}\) are known
Type of parametric covariance (variogram) function is known
Parameters \(\mathbf{\theta}\) and \(\tau^2\) of the covariance function are known
Relaxed assumptions: Only 1, 3, and 4 are assumed to be known, while \(\mathbf{\beta}\) is implicitly estimated from the data using generalized least squares (GLS).
UK/EDK predictions
Therefore, we use the universal kriging (UK) or external drift kriging (EDK) plug-in predictor: \[
\hat{\mathbf{S}}_{\text{k}} = \mathbf{X}_{\mathbf{S}} \hat{\mathbf{\beta}}_{\text{GLS}} + \mathbf{\Lambda} \left( \mathbf{y} - \mathbf{X}_{\mathbf{Y}} \hat{\mathbf{\beta}}_{\text{GLS}} \right)
\]
with: \[
\mathbf{\Lambda} = \mathbf{\Sigma}_{\mathbf{S}\mathbf{Y}} \mathbf{\Gamma}_{\mathbf{Y}\mathbf{Y}}^{-1}
\]
Computing universal kriging predictor requires:
known structure of trend function
known structure and parameters of variogram
“Plug-in” predictor: uncertainty of variogram is ignored when computing predictions