Spatial Statistics

Session 2

Authors

Affiliations

Madlene Nussbaum

Department of Physical Geography, Utrecht University

Andreas Papritz

(ehem. ETHZ)

Published

15 Oct 2024

Download slides PDF

1 Geostatistics: theory, variogram properties, maximum likelihood, prediction

Overview

Stochastic process
Realization of stochastic process
Variogram functions
Sample variogram and fitting of variogram function
Maximum likelihood estimation of model parameters
Model selection/inference
Predictions for ordinary and universal/external drift kriging

2 Stochastic process

2.1 Terminology and model notation (session 1)

Model for data: $Y_{i} = S(\mathbf{x}_{i}) + Z_{i}$

where

$Y_{i}$ : $i^\mathrm{th}$ datum

$S(\mathbf{x}_{i})$ : “signal” (= true quantity) at location $\mathbf{x}_{i}$

$Z_{i}$ : iid. random measurement error

Decomposition of signal into trend $\mu(\mathbf{x}_{i})$ and stochastic fluctuation:

\[ S(\mathbf{x}_{i}) = \mu(\mathbf{x}_{i}) + E(\mathbf{x}_{i}) \]

where commonly a linear model is used for $\mu(\mathbf{x}_{i})$

\[ \mu(\mathbf{x}_{i}) = \sum_{k} d_{k}(\mathbf{x}_{i}) \beta_{k} = \mathbf{d}(\mathbf{x}_{i})^\mathrm{T} \mathbf{\beta} \] with $d_{k}(\mathbf{x}_{i})$ denoting (spatial) covariates and $\{E(\mathbf{x}_{i})\}$ a zero mean stochastic process (random field).

2.2 Realization of stochastic process

Spatial phenomena obey laws of physics ⇒ are deterministic, have physical causes
Numerous processes and interactions thereof produce current complex outcome
Variation appears random ⇒ in geostatistical terms viewed as a random process
e.g. rainfall pattern, soil properties, air pollution as a realization of a random process
Each location $x$ is associated with a suite of values with a known distribution
Actual value observed at $x$ is viewed as one value of this distribution, allocated at random
Random function has no mathematical description, but “structure” in the sense of correlation in space (or time) and describes relation of random values at different locations to each other

Spatial stochastic process (random process)

$\{S(\mathbf{x})\}$: Collection (= set) of random variables $S(\mathbf{x}): \mathbf{x} \in D \subset \mathbb{R}^d$, at location $x$ in area $D$, with a well defined joint distribution

Stationary and isotropic stochastic processes

Stationarity: Assumption that allows to treat data with same degree of variation over region of interest.

Strictly stationary process: Joint distributions of arbitrary collections of random variables $\{S(\mathbf{x}_{1}), \ldots, S(\mathbf{x}_{n})\}$ are invariant to translations by vector $\mathbf{h} \in \mathbb{R}^d$

$\{S(\mathbf{x}_{1}), \ldots, S(\mathbf{x}_{n})\}$ and $\{S(\mathbf{x}_{1}+\mathbf{h}), \ldots, S(\mathbf{x}_{n}+\mathbf{h})\}$ have the same joint distribution:

\[ F(s_{1}, \ldots, s_{n}; \mathbf{x}_{1}, \ldots, \mathbf{x}_{n}) = F(s_{1}, \ldots, s_{n}; \mathbf{x}_{1}+\mathbf{h}, \ldots, \mathbf{x}_{n}+\mathbf{h}) \]

Isotropic: Weakly stationary process that is invariant to rotations (opposite: anisotropic).

Gaussian stochastic process: All joint and conditional distributions are normal.

Second-order stationary stochastic processes

Weakly or second-order stationary process:

Distributions of arbitrary pairs of random variables $(S(\mathbf{x}), S(\mathbf{x}+\mathbf{h}))$ satisfy:

$\mathbb{E}[S(\mathbf{x})] = \text{constant}$ (independent of $\mathbf{x}$)
$\text{Cov}(S(\mathbf{x}+\mathbf{h}), S(\mathbf{x})) = \gamma(\mathbf{h})$ (independent of $\mathbf{x}$)
$\text{Var}(S(\mathbf{x})) = \text{constant}$ (independent of $\mathbf{x}$)

⇒ Covariance depends on $h$ and only on $h$, the separation between samples in both distance and direction

⇒ Strict stationarity implies weak stationarity.

⇒ Stationarity is required for estimation/prediction with a single realization of the stochastic process.

2.3 Covariance function and variogram

Definition of variogram $V(\mathbf{h})$ and covariance function $\gamma(\mathbf{h})$:

\[ V(\mathbf{h}) = \frac{1}{2} \text{Var}\left(S(\mathbf{x}+\mathbf{h}) - S(\mathbf{x})\right) \]

\[ \gamma(\mathbf{h}) = \text{Cov}\left(S(\mathbf{x}+\mathbf{h}), S(\mathbf{x})\right) \]

Relation between variogram and covariance function: \[ V(\mathbf{h}) = \gamma(0) - \gamma(\mathbf{h}), \quad \text{with} \quad \gamma(0) = \text{Var}(S(\mathbf{x})) \]

⇒ Variogram is preferred.

Variogram is based on differences only (no distribution mean as in covariance). Allows to relax stationarity assumptions even further, instead of $\mathbb{E}[S(\mathbf{x})] = \text{constant}$, we can assume (intrinsic stationarity):

\[ \mathbb{E}[S(\mathbf{x}+\mathbf{h}) - S(\mathbf{x})] =0 \]

Variogram and correlogram

Relation between correlogram and covariance function for weakly stationary process: \[ \rho(\mathbf{h}) = \frac{\gamma(\mathbf{h})}{\gamma(0)} \]

Relation between variogram and correlogram:

\[ V(\mathbf{h}) = \gamma(0)(1 - \rho(\mathbf{h})) \]

Symmetry:

\[ V(\mathbf{h}) = V(-\mathbf{h}), \quad \gamma(\mathbf{h}) = \gamma(-\mathbf{h}), \quad \rho(\mathbf{h}) = \rho(-\mathbf{h}) \]

Covariance and Variogram Plot

Summary stochastic process

Stochastic process: generalisation of multidimensional random variable
Stationarity assumption required for estimation from single realisation of stochastic process
In practice assumption of weak stationarity:
1. constant mean
2. constant variance
3. covariance and semivariance depends only on lag distance but not on location
Often additional assumption of isotropic auto-correlation

3 Variogram functions and their properties

Principle features of the variogram

Increase in variance with increasing lag
Function must guarantee non-negative variances
Sill variance $\mathbf{c_0 + c}$, i.e. an upper bound
Range of spatial correlation $\mathbf{a}$, where auto-correlation becomes 0
Nugget variance $\mathbf{c_0}$, i.e. a positive intercept
with $\mathbf{c}$ often called partial sill, i.e. part of variance with spatial structure up to $\mathbf{a}$

Special cases

Anisotropy, i.e. directional variation depending on angle
Unbounded variogram (not second-order stationary)
Pure Nugget variogram

Pure nugget variogram

Absence of auto-correlation (nugget effect covariance)

\[ V(h) = \left\{\begin{array}{ll} 0 & \mbox{if $h=0$}\\ c_0 & \mbox{otherwise} \end{array}\right. \]

Mechanism: measurement error and/or small-scale spatial variation

Pure nugget variogram – realization

library(sp); library(terra); library(gstat)
# unconditional Gaussian simulation on a grid
xy <- expand.grid(1:140, 1:100)
names(xy) <- c("x","y")
gridded(xy) = ~x+y
v.m <- vgm(nugget=1, model = 'Sph', psill = 0, range = 0.0001)
plot(variogramLine(v.m, maxdist = 50), type = "l", ylim = c(0,1.1))
g.sim <- gstat(formula = z~1, dummy = TRUE, beta = 0,   model = v.m, nmax = 100) 
r.sim <- predict(g.sim, xy, nsim = 1)
plot(rast(r.sim))

Spherical variogram function

\[ V(h) = \left\{ \begin{array}{lll} c_0 + c \left\lbrace \frac{3h}{2a} - \frac{1}{2} \left(\frac{h}{a}\right)^3 \right\rbrace & \mbox{for $0 < h \leq a$}\\ c_0 + c & \mbox{for $h > a$} \\ c_0 & \mbox{for $h = 0$} \end{array}\right. \]

where $c_0$ is the nugget variance, $c$ the variance of spatially correlated component and $a$ is the range of spatial dependence.

Spherical variogram function – realizations

A <- vgm(nugget=0.3, model = 'Sph', psill = 0.7, range = 20)
B <- vgm(nugget=0.05, model = 'Sph', psill = 0.95, range = 20)
C <- vgm(nugget=0.3, model = 'Sph', psill = 0.7, range = 5)

Solution

variogramm model – variogram curve – raster with realization

A – 2 – top right
B – 3 – bottom right
C – 1 – bottom left

Exponential variogram function

\[ V(h) = \left\{ \begin{array}{ll} c_0 + c \left\lbrace 1 - exp\left(\frac{h}{a}\right) \right\rbrace & \mbox{for $0 < h$}\\ c_0 & \mbox{for $h = 0$} \end{array}\right. \]

with $a$ here being a distance parameter.

The function approaches the sill asymtotically and does not have a finite range. For practical purposes usually an effective range $a'$ is used which is approximately $3a$.

Exponential variogram function – realizations

A <- vgm(nugget=0.3, model = 'Exp', psill = 0.7, range = 7)
B <- vgm(nugget=0.0001, model = 'Exp', psill = 0.999, range = 20)
C <- vgm(nugget=0.1, model = 'Exp', psill = 0.9, range = 20)

Solution

variogramm model – variogram curve – raster with realization

A – 1 – bottom right
B – 3 – bottom left
C – 2 – top right

Matérn (Gaussian) variogram function

\[ V(h) = c_0 + c \left\lbrace 1 - \frac{1}{2^{v-1}\Gamma(v)} \left( -\frac{h}{a}\right)^v K_v\left(\frac{h}{a}\right) \right\rbrace \]

with $c_0$, $c$ and $a$ being nugget, sill and range parameters as before.

The equation embodies the gamma function $\Gamma$ with parameter $v$ and the Bessel function of the second kind, $K_v$ for parameter $v$.

Parameter $v$ describes the smoothness of variation and can vary from 0 (very rough) to infinity (very smooth).

With $v=0.5$ it becomes the exponential variogram.

Matérn function – Variogram smoothness

Shape of variogram close to origin controls smoothness of realizations of stochastic processes:

Variogram with nugget: realizations non-continuous
Variogram grows linearly at origin: realizations continuous, but not everywhere differentiable
Variogram grows at at least quadratically at origin: realizations everywhere at least once differentiable

Matérn variogram function – realizations

A <- vgm(nugget=0, model = 'Mat', psill = 1, range = 6, kappa = 0.5)
B <- vgm(nugget=0, model = 'Mat', psill = 1, range = 3.5, kappa = 1.5)
C <- vgm(nugget=0, model = 'Mat', psill = 1, range = 2.5, kappa = 3)

Solution

variogramm model – variogram curve – raster with realization

A – 1 – top right
B – 2 – bottom left
C – 3 – bottom right

4 Model for Gaussian spatial data

4.1 Model for Gaussian spatial data

Model for data: $Y_{i} = S(\mathbf{x}_{i}) + Z_{i} = \mu(\mathbf{x}_{i}) + E(\mathbf{x}_{i}) + Z_{i}$

with $Y_{i}$: $i^{\text{th}}$ datum; $S(\mathbf{x}_{i})$: “signal” (true quantity) at location $\mathbf{x}_{i}$; $\mu(\mathbf{x}_{i})$: trend
$\{E(\mathbf{x}_{i})\}$: a zero-mean Gaussian process, parametrized by covariance function $\gamma(\mathbf{h}; \mathbf{\theta})$ or variogram $V(\mathbf{h}; \mathbf{\theta})$
$Z_{i}$: iid Gaussian measurement error with variance $\tau^2$

Trend $\mu(\mathbf{x}_{i})$ modeled by linear regression model with spatial covariates $d_{k}(\mathbf{x}_{i})$

\[ \mu(\mathbf{x}_{i}) = \sum_{k} d_{k}(\mathbf{x}_{i}) \beta_{k} = \mathbf{d}(\mathbf{x}_{i})^{T} \mathbf{\beta} \]

Unknown elements of the model:

Structure and parameters $\mathbf{\beta}$ of the trend model
Covariance (or variogram) parameters $\mathbf{\theta}$
Nugget variance $\tau^2$

4.2 Trend modelling

Ordinary least squares (OLS) trend estimation

Gaussian model in vector notation: $\mathbf{Y} = \mathbf{X} \mathbf{\beta} + \mathbf{E} + \mathbf{Z}$

Estimation of trend parameters $\mathbf{\beta}$ by ordinary least squares:

\[ \hat{\mathbf{\beta}}_{\text{OLS}} = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{Y} \]

For spatially uncorrelated data ($\mathbf{E} = \mathbf{0}$; $\text{Cov}(\mathbf{Y}, \mathbf{Y}^\top) = \tau^2 \mathbf{I}$):

\[ \hat{\mathbf{\beta}}_{\text{OLS}} \sim \mathcal{N}\left(\mathbf{\beta}, \tau^2 (\mathbf{X}^\top \mathbf{X})^{-1} \right) \]

For spatially auto-correlated data $\text{Cov}(\mathbf{Y}, \mathbf{Y}^\top) = \text{Cov}(\mathbf{Z}, \mathbf{Z}^\top) + \text{Cov}(\mathbf{E}, \mathbf{E}^\top) = \mathbf{\Gamma}_{\theta} = \tau^2 \mathbf{I} + \mathbf{\Sigma}_{\theta}$

\[ \hat{\mathbf{\beta}}_{\text{OLS}} \sim \mathcal{N}\left(\mathbf{\beta}, \tau^2(\mathbf{X}^\top \mathbf{X})^{-1} + (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Sigma}_{\theta} \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1} \right) \]

Conclusion: Ignoring auto-correlation: $\hat{\mathbf{\beta}}_{\text{OLS}}$ unbiased, but the standard errors are too small. Tests based on OLS fit are biased!

Generalized least squares (GLS) trend estimation

Generalized least squares estimates:

\[ \hat{\mathbf{\beta}}_{\text{GLS}} = (\mathbf{X}^\top \mathbf{\Gamma}_{\theta}^{-1} \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Gamma}_{\theta}^{-1} \mathbf{Y} \]

GLS = OLS with “orthogonalized” response and design matrix.

Sampling distribution:

\[ \hat{\mathbf{\beta}}_{\text{GLS}} \sim \mathcal{N}\left(\mathbf{\beta}, (\mathbf{X}^\top \mathbf{\Gamma}_{\theta}^{-1} \mathbf{X})^{-1} \right) \]

For spatially uncorrelated data ($\mathbf{\Gamma}_{\theta} = \tau^2 \mathbf{I}$):

\[ \hat{\mathbf{\beta}}_{\text{GLS}} = \hat{\mathbf{\beta}}_{\text{OLS}} \]

$\hat{\mathbf{\beta}}_{\text{GLS}}$ has the smallest standard errors among all linear estimators (Gauss-Markov theorem), hence it’s the BLUE (Best Linear Unbiased Estimator).

$\hat{\mathbf{\beta}}_{\text{GLS}}$ is the maximum likelihood estimate for Gaussian $\mathbf{Y}$.

Generalised least squares (GLS) is the method of choice for estimating coefficients of the trend model.

4.3 Computing the sample variogram

Sample variogram (also experimental variogram) consists of semivariances at a ﬁnite set of discrete lags $h$.
Underlying function is continuous for all $h$.
Variogram modelling:
1. Compute sample variogram
2. Fit smooth function that describes principal features of semivariance sequence

4.4 Computing sample variogram of residuals

Extract residuals $\mathbf{R} = \mathbf{Y} - \mathbf{X} \hat{\mathbf{\beta}}$ of the fitted linear model (or use data $\mathbf{Y}$ if the model has constant $\mu(\mathbf{x})$).

Choose bin width $dh$ (and width of angular classes $d\phi$) to define the $(k,l)^{\text{th}}$ lag class, $\mathbf{h}_{kl}$, characterized by:

Distance: $(h_k - dh, h_k + dh]$
Angular class: $\phi_l - d\phi, \phi_l + d\phi$

Computing sample variogram – formally

Form all $N_{kl}$ pairs $(i,j)$ with $\mathbf{x}_{i} - \mathbf{x}_{j} \approx \mathbf{h}_{kl}$.

Compute for each lag class $\mathbf{h}_{kl}$ the semivariance: \[ \hat{V}(\mathbf{h}_{kl}) = \frac{1}{2 N_{kl}} \sum_{(i,j) \in \mathbf{h}_{kl}} \left[R(\mathbf{x}_{i}) - R(\mathbf{x}_{j})\right]^2 \]

Sample variogram: Plot of $\hat{V}(\mathbf{h}_{kl})$ vs. $\mathbf{h}_{kl}$.

Rules of thumb:

Ensure enough data pairs per lag class. Choose $dh$ (and $d\phi$) such that $N_{kl} > 30-50$.
Only compute the variogram for half the size of the study area.
Largest $\mathbf{h}_{kl} \leq 0.5 \max(\mathbf{x}_{i} - \mathbf{x}_{j})$.

Examples of different lags (Wolfcamp data)

4.5 Fitting variogram model to sample variogram

Semivariance is required for arbitrary lag distances when computing predictions.
Smoothing the sample variogram by fitting a parametric variogram function $V(\mathbf{h}, \mathbf{\theta})$.
Choose a variogram function that approximates the shape of the sample variogram well, particularly close to the origin.
Fit parameters $\mathbf{\theta}$ by (weighted) least squares: \[ \hat{\mathbf{\theta}} = \underset{\mathbf{\theta}}{\text{argmin}} \sum_{kl} w(\mathbf{h}_{kl}) \left( V(\mathbf{h}_{kl}) - \hat{V}(\mathbf{h}_{kl}, \mathbf{\theta}) \right)^2 \]

Options for weighing:

Equal weights: $w(\mathbf{h}_{kl}) = 1$
By number of pairs: $w(\mathbf{h}_{kl}) = N_{kl}$
Cressie’s weights: $w(\mathbf{h}_{kl}) = \frac{N_{kl}}{V(\mathbf{h}_{kl}, \mathbf{\theta})^2}$

Fits with different weights (Wolfcamp data)

Fits with different lags (Wolfcamp data)

4.6 Problems with ad-hoc model estimation

Subjective choice of lag class width and weighting method for model fitting.
Estimates of semivariance for different lag classes are mutually correlated; the choice of variogram function based on the sample variogram is problematic.
The sample variogram is susceptible to outliers, hence robust estimators are preferred.
Fitting a model function to the sample variogram requires further subjective choices.
An ad-hoc approach provides biased estimates of the variogram of the underlying stochastic process if the trend is modeled.
Therefore, the estimate of the variogram based on the sample variogram of OLS residuals is biased.
Thus, estimate trend and variogram parameters simultaneously using maximum likelihood.

5 Maximum likelihood estimation

5.1 Maximum likelihood (ML) estimation of parameters of Gaussian model for spatial data

Principle of maximum likelihood estimation: find parameters that maximize joint probability for observed data
Properties of maximum likelihood estimates: asymptotically unbiased and fully efficient; asymptotically normally distributed
profile likelihood useful for exploring shape of likelihood surface and for computing confidence intervals based on likelihood ratio test

Maximum likelihood estimation

Consider a Gaussian stochastic process $\{Y(\mathbf{x})\}$ with a linear trend function.

Any arbitrary set of random variables $\mathbf{Y} = (Y(\mathbf{x}_{1}), \ldots, Y(\mathbf{x}_{n}))$ has a multivariate Gaussian distribution with expectation: \[ \mathbb{E}[\mathbf{Y}] = \mathbf{X} \mathbf{\beta} \]
and covariance matrix: \[ \text{Cov}(\mathbf{Y}, \mathbf{Y}^\mathrm{T}) = \mathbf{\Gamma}_{\mathbf{\theta}} \]

The joint probability density for $\mathbf{Y}$ is given by:

\[ f(\mathbf{y};\mathbf{\beta},\mathbf{\theta}) = (2\pi)^{-\frac{n}{2}} |\mathbf{\Gamma}_{\mathbf{\theta}}|^{-\frac{1}{2}} \exp\left(-\frac{1}{2} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\}\right) \]

Maximum likelihood estimation (cont.)

Unknown model parameters:

Regression coefficients $\mathbf{\beta}$
Covariance (or variogram) parameters $\mathbf{\theta}$

The log-likelihood function (up to a constant) is given by: \[ L(\mathbf{\beta}, \mathbf{\theta};\mathbf{y}) = -\frac{1}{2} \log(|\mathbf{\Gamma}_{\mathbf{\theta}}|) -\frac{1}{2} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \{\mathbf{y} - \mathbf{X}\mathbf{\beta}\} \]

For known variogram parameters $\mathbf{\theta}$ maximum likelihood estimate for $\mathbf{\beta}$ equals GLS estimator: \[ \hat{\mathbf{\beta}}_{\mathrm{GLS}} = (\mathbf{X}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \mathbf{X})^{-1} \mathbf{X}^\mathrm{T} \mathbf{\Gamma}_{\mathbf{\theta}}^{-1} \mathbf{Y} \]

Plugging $\hat{\mathbf{\beta}}_{\mathrm{GLS}}$ into $L(\mathbf{\beta}, \mathbf{\theta};\mathbf{y})$ gives the profile likelihood function for $\mathbf{\theta}$.
No closed form expression for maximum likelihood estimate
Maximize $L$ numerically by a non-linear optimization method to find $\hat{\theta}$
Numerical optimization requires initial values of $\theta$
Make sure numerical optimization converged!

Example: MaxLik estimates Wolfcamp data

library(gstat); library(georob)
data("wolfcamp"); d.w <- wolfcamp
coordinates(d.w) <- c("x", "y")
r.georob.ml <- georob(pressure~x+y, d.w, 
  locations=~x+y, variogram.model="RMspheric",
  param=c(variance=3000, nugget=1000, scale=100),
  tuning.psi=1000, control=control.georob(ml.method="ML"))
summary(r.georob.ml)


Call:georob(formula = pressure ~ x + y, data = d.w, locations = ~x + 
    y, variogram.model = "RMspheric", param = c(variance = 3000, 
    nugget = 1000, scale = 100), tuning.psi = 1000, control = control.georob(ml.method = "ML"))

Tuning constant:  1000 

Convergence in 12 function and 7 Jacobian/gradient evaluations

Estimating equations (gradient)

                                 eta         scale
  Gradient           : -8.880367e-05  7.654524e-04

Maximized log-likelihood: -458.3671 

Predicted latent variable (B):
   Min     1Q Median     3Q    Max 
-89.37 -54.98 -16.64  24.43 111.41 

Residuals (epsilon):
    Min      1Q  Median      3Q     Max 
-62.789 -19.775   6.311  18.032  62.776 

Standardized residuals:
    Min      1Q  Median      3Q     Max 
-2.3778 -0.7413  0.2249  0.6534  3.2793 


Gaussian ML estimates

Variogram:  RMspheric 
               Estimate   Lower  Upper
variance        3328.90 1453.95 7621.7
snugget(fixed)     0.00      NA     NA
nugget          1236.27  616.01 2481.1
scale            122.95   95.13  158.9


Fixed effects coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 620.3550    17.0641  36.354  < 2e-16 ***
x            -1.3256     0.1360  -9.750 2.33e-15 ***
y            -1.2061     0.1793  -6.727 2.16e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error (sqrt(nugget)): 35.16 

Robustness weights: 
 All 85 weights are ~= 1.

Wolfcamp: MaxLik fitted variogram

plot(r.georob.ml, lag.dist.def=20, max.lag=200)

5.2 Restricted maximum likelihood estimation

Equivalent number of independent observations of a sample of spatial data often much smaller than nominal sample size ⇒ bias of ML estimates of variance parameters important

The bias of MLEs of variogram parameters $\mathbf{\theta}$ can be reduced by restricted maximum likelihood estimation (REML).

Principle of REML:

Form linear combinations $\mathbf{Z} = \mathbf{A} \mathbf{Y}$ of the data $\mathbf{Y}$ that have zero expectation (no longer depend on $\beta$): \[ \mathbb{E}[\mathbf{Z}] = \mathbf{A} \mathbf{X} \mathbf{\beta} = \mathbf{0} \]

The matrix $\mathbf{A}$ must satisfy: $\mathbf{A} \mathbf{X} = \mathbf{0}$

$\mathbf{A}$ is non-unique with many possibilities

Estimate $\mathbf{\theta}$ by maximizing the likelihood function for $n - p$ elements of $\mathbf{Z}$.

REML estimates Wolfcamp data

r.georob.reml <- georob(pressure~x+y, d.w, 
  locations=~x+y, variogram.model="RMspheric",
  param=c(variance=3000, nugget=1000, scale=100),
  tuning.psi=1000)
summary(r.georob.reml)


Call:georob(formula = pressure ~ x + y, data = d.w, locations = ~x + 
    y, variogram.model = "RMspheric", param = c(variance = 3000, 
    nugget = 1000, scale = 100), tuning.psi = 1000)

Tuning constant:  1000 

Convergence in 6 function and 5 Jacobian/gradient evaluations

Estimating equations (gradient)

                                 eta         scale
  Gradient           : -2.248651e-04 -1.070402e-01

Maximized restricted log-likelihood: -456.3802 

Predicted latent variable (B):
   Min     1Q Median     3Q    Max 
-94.58 -60.99 -17.59  23.10 115.72 

Residuals (epsilon):
    Min      1Q  Median      3Q     Max 
-59.148 -18.009   6.251  15.982  54.620 

Standardized residuals:
    Min      1Q  Median      3Q     Max 
-2.4030 -0.7131  0.2282  0.6937  3.1932 


Gaussian REML estimates

Variogram:  RMspheric 
               Estimate   Lower   Upper
variance        4358.84 1810.40 10494.6
snugget(fixed)     0.00      NA      NA
nugget          1151.28  540.41  2452.7
scale            138.91   82.62   233.6


Fixed effects coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 624.3287    20.7961  30.021  < 2e-16 ***
x            -1.3291     0.1611  -8.248 2.25e-12 ***
y            -1.1804     0.2115  -5.581 3.00e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error (sqrt(nugget)): 33.93 

Robustness weights: 
 All 85 weights are ~= 1.

6 Model inference

6.1 Inference, model building and assessment

Data analysis often leads to a set of equally plausible candidate models that use different set of covariates and different variograms

compare fit of candidate models by hypothesis tests taking auto-correlation properly into account
use established goodness-of-fit criteria (AIC, BIC) to select a “best” model, again taking auto-correlation into account
use cross-validation to compare the power of candidate models to predict new data

Testing hypotheses about trend coefficients

Likelihood ratio (LRT) test can only be used to test hypotheses and build confidence regions for $\mathbf{\theta}$
LRT for regression for $\mathbf{\beta}$ in general biased (too small $p$-values)
Use conditional $F$-tests for testing hypotheses about $\mathbf{\beta}$:
1. Fit covariance parameters of “largest” regression model
  ⇒ $\hat{\mathbf{\theta}}$
2. Compute covariance matrix ⇒ $\mathbf{\Gamma}_{\hat{\theta}}$
3. Orthogonalize response vector and design matrix (using Cholesky decomposition)
4. Conventional $F$-test with orthogonalized items $\tilde{\mathbf{Y}}$ and $\tilde{\mathbf{X}}$

Fit quadratic trend surface model for Wolfcamp

r.georob.full <- update(r.georob.reml, .~.+I(x^2)+I(y^2)+x:y)
summary(r.georob.full)


Call:georob(formula = pressure ~ x + y + I(x^2) + I(y^2) + x:y, data = d.w, 
    locations = ~x + y, variogram.model = "RMspheric", param = c(variance = 3000, 
        nugget = 1000, scale = 100), tuning.psi = 1000)

Tuning constant:  1000 

Convergence in 10 function and 8 Jacobian/gradient evaluations

Estimating equations (gradient)

                                 eta         scale
  Gradient           :  3.590344e-04 -4.553394e-03

Maximized restricted log-likelihood: -470.3894 

Predicted latent variable (B):
   Min     1Q Median     3Q    Max 
-89.22 -46.81 -11.06  20.80  94.07 

Residuals (epsilon):
    Min      1Q  Median      3Q     Max 
-59.664 -18.086   6.783  16.245  49.986 

Standardized residuals:
    Min      1Q  Median      3Q     Max 
-2.4096 -0.7213  0.2500  0.6830  3.1042 


Gaussian REML estimates

Variogram:  RMspheric 
               Estimate   Lower  Upper
variance        3740.02 1469.60 9518.1
snugget(fixed)     0.00      NA     NA
nugget          1152.79  532.29 2496.6
scale            123.93   93.96  163.5


Fixed effects coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) 608.033837  30.543683  19.907  < 2e-16 ***
x            -1.180485   0.177576  -6.648 3.49e-09 ***
y            -1.212317   0.211601  -5.729 1.76e-07 ***
I(x^2)        0.001592   0.001230   1.294    0.199    
I(y^2)       -0.001731   0.002133  -0.811    0.420    
x:y           0.002067   0.001968   1.050    0.297    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error (sqrt(nugget)): 33.95 

Robustness weights: 
 All 85 weights are ~= 1.

Conditional F-test on interaction and higher-order polynomials

waldtest(r.georob.full, .~.-x:y, test="F")

Wald test

Model 1: pressure ~ x + y + I(x^2) + I(y^2) + x:y
Model 2: pressure ~ x + y + I(x^2) + I(y^2)
  Res.Df Df      F Pr(>F)
1     79                 
2     80 -1 1.1032 0.2968

waldtest(r.georob.full, .~.-I(x^2)-I(y^2)-x:y, test="F")

Wald test

Model 1: pressure ~ x + y + I(x^2) + I(y^2) + x:y
Model 2: pressure ~ x + y
  Res.Df Df      F Pr(>F)
1     79                 
2     82 -3 1.6284 0.1895

6.2 Model selection with stepwise

Given estimates of covariance parameters $\hat{\theta}$ and keeping them ﬁxed, the usual stepwise procedures for selecting covariates can be used.

Selecting models based on AIC and BIC.

Stepwise selection with AIC (defaults to both directions)

step(r.georob.full)

Start:  AIC=922.16
pressure ~ x + y + I(x^2) + I(y^2) + x:y

         Df    AIC Converged
- I(x^2)  1 922.05         1
- I(y^2)  1 922.13         1
<none>      922.16          
- x:y     1 922.49         1

Step:  AIC=922.05
pressure ~ x + y + I(y^2) + x:y

         Df    AIC Converged
<none>      922.05          
+ I(x^2)  1 922.16         1
- I(y^2)  1 922.54         1
- x:y     1 924.61         1


Tuning constant:  1000 

Fixed effects coefficients:
(Intercept)            x            y       I(y^2)          x:y  
 627.526464    -1.148338    -1.347008    -0.002587     0.003005  

Variogram:  RMspheric 
 variance(fixed)    snugget(fixed)     nugget(fixed)      scale(fixed)  
          2060.4               0.0            1402.1             103.8

Stepwise selection (defaults to both directions) with BIC

step(r.georob.full, k=log(nrow(d.w)))

Start:  AIC=936.81
pressure ~ x + y + I(x^2) + I(y^2) + x:y

         Df    AIC Converged
- I(x^2)  1 934.27         1
- I(y^2)  1 934.34         1
- x:y     1 934.70         1
<none>      936.81          

Step:  AIC=934.27
pressure ~ x + y + I(y^2) + x:y

         Df    AIC Converged
- I(y^2)  1 932.31         1
<none>      934.27          
- x:y     1 934.38         1
+ I(x^2)  1 936.81         1

Step:  AIC=932.31
pressure ~ x + y + x:y

         Df    AIC Converged
- x:y     1 931.79         1
<none>      932.31          
+ I(y^2)  1 934.27         1
+ I(x^2)  1 934.34         1

Step:  AIC=931.79
pressure ~ x + y

         Df     AIC Converged
+ I(x^2)  1  931.70         1
<none>       931.79          
+ x:y     1  932.31         1
+ I(y^2)  1  934.38         1
- y       1 1006.98         1
- x       1 1085.42         1

Step:  AIC=931.7
pressure ~ x + y + I(x^2)

         Df     AIC Converged
<none>       931.70          
- I(x^2)  1  931.79         1
+ x:y     1  934.34         1
+ I(y^2)  1  934.70         1
- y       1  991.37         1
- x       1 1075.89         1


Tuning constant:  1000 

Fixed effects coefficients:
(Intercept)            x            y       I(x^2)  
 592.266784    -1.285190    -1.143500     0.001875  

Variogram:  RMspheric 
 variance(fixed)    snugget(fixed)     nugget(fixed)      scale(fixed)  
          2060.4               0.0            1402.1             103.8

7 Kriging predictions

7.1 Prediction problem formulation

Observations $\mathbf{y}^\mathrm{T} = (y_{1}, \ldots, y_{n})$ available for a set of $n$ locations $\mathbf{x}_{i}$

Consider $\mathbf{y}$ as a realization of the random variable $\mathbf{Y}^\mathrm{T} = (Y_{1}, \ldots, Y_{n})$

Model: $Y_{i} = S(\mathbf{x}_{i}) + Z_{i}$ with:

$Y_{i}$: the $i^\mathrm{th}$ datum
$S(\mathbf{x}_{i})$: “signal” (the true quantity) at location $\mathbf{x}_{i}$
$\{S(\mathbf{x}_{i})\}$: Gaussian process, parametrized by:
- Trend: $\mu(\mathbf{x}_{i}) = \sum_{k} d_{k}(\mathbf{x}_{i}) \beta_{k} = \mathbf{d}(\mathbf{x}_{i})^\mathrm{T} \mathbf{\beta}$
- Covariance function $\gamma(\mathbf{h}; \mathbf{\theta})$ or variogram $V(\mathbf{h}; \mathbf{\theta})$
$Z_{i}$: independent, identically distributed (iid) Gaussian measurement error with variance $\tau^2$

Predictions: Let’s say $\hat{\mathbf{S}}$ is the prediction of $\mathbf{S}^\mathrm{T} = (S(\mathbf{x}_{1}^\prime), \ldots, S(\mathbf{x}_{m}^\prime))$ for a set of $m$ locations $\mathbf{x}_{j}^\prime$ without data.

$\hat{\mathbf{S}}$ is computed from $\mathbf{Y}$, therefore $\hat{\mathbf{S}} = \hat{\mathbf{S}}(\mathbf{Y})$.

7.2 Kriging prediction at new point location

Ordinary punctual kriging

For same spatial support – prediction entities are assumed to have same extension as observations (i.e. sampling area, sensor size)

Prediction of signal $S(\mathbf{x}_{0})$ at location $\mathbf{x}_{0}$ without measurement

\[ \hat{S}(\mathbf{x}_{0}) = \sum_{i=1}^n \kappa_{i}(\mathbf{x}_{0})\, y(\mathbf{x}_{i}) \]

with kriging weights $\kappa_{i}(\mathbf{x}_{0})$. To ensure unbiased estimates, weights are made to sum to 1:

\[ \sum_{i=i}^N \kappa_i(x_0) = 1 \]

Expected difference $\mathbb{E}[S(\mathbf{x}_{0}) - \hat{S}(\mathbf{x}_{0})] = 0$.

Estimate of kriging variances

Estimate of the variance of the prediction error:

\[ var[\hat{S}(\mathbf{x}_{0})] = \mathbb{E}\left[\left\lbrace S(\mathbf{x}_{0}) - \hat{S}(\mathbf{x}_{0}) \right\rbrace\right] \]

\[ = 2\sum_{i=1}^N \kappa_{i}(\mathbf{x}_{0}) V(\mathbf{x_i}, \mathbf{x_0}) - \sum_{i=1}^N \sum_{j=1}^N \kappa_{i}(\mathbf{x}_{0}) \kappa_{j}(\mathbf{x}_{0}) V(\mathbf{x_i}, \mathbf{x_j}) \] with $V(\mathbf{x_i}, \mathbf{x_0})$ being the semivariance of $S$ between sampling point $x_i$ and the target prediction point $\mathbf{x_0}$ and $V(\mathbf{x_i}, \mathbf{x_j})$ the semivariance between the $i$th and the $j$th sampling points.

Find kriging weights

Kriging weights are calculated by solving the system of equations that minimizes the prediction error variance subject to the constraints of the variogram model.

Find weights $\kappa_{i}(\mathbf{x}_{0})$ that minimize kriging variances and sum to 1 by solving $N + 1$ equations:

\[ \sum_{i=1}^N \kappa_{i}(\mathbf{x}_{0}) V(\mathbf{x_i}, \mathbf{x_j}) + \psi(\mathbf{x_0}) = V(\mathbf{x_j}, \mathbf{x_0}) \quad \text{for all} ~~j \]

\[ \sum_{i=i}^N \kappa_i(x_0) = 1 \]

The quantity $\psi(\mathbf{x_0})$ is a Lagrange multiplier introduced to achieve minimization. Computed as the inverse of the semivariance matrix multiplied by the vector of semivariances to the target point.

Properties of kriging prediction

Shape of variogram function close to origin determine shape of prediction surface near data locations
Continuity and diffentiability of variogram at origin control geometrical properties of simple kriging prediction surface

7.3 Universal/external drift kriging predictions

Universal kriging: often referred if trend is modelled by coordinates

External-drift kriging: trend is modelled by spatial covariates

Evaluating $\hat{\mathbf{S}}_{\text{opt}}$ requires a fully specified weakly stationary model:

Structure of trend function is known
Regression coefficients $\mathbf{\beta}$ are known
Type of parametric covariance (variogram) function is known
Parameters $\mathbf{\theta}$ and $\tau^2$ of the covariance function are known

Relaxed assumptions: Only 1, 3, and 4 are assumed to be known, while $\mathbf{\beta}$ is implicitly estimated from the data using generalized least squares (GLS).

UK/EDK predictions

Therefore, we use the universal kriging (UK) or external drift kriging (EDK) plug-in predictor: \[ \hat{\mathbf{S}}_{\text{k}} = \mathbf{X}_{\mathbf{S}} \hat{\mathbf{\beta}}_{\text{GLS}} + \mathbf{\Lambda} \left( \mathbf{y} - \mathbf{X}_{\mathbf{Y}} \hat{\mathbf{\beta}}_{\text{GLS}} \right) \]
with: \[ \mathbf{\Lambda} = \mathbf{\Sigma}_{\mathbf{S}\mathbf{Y}} \mathbf{\Gamma}_{\mathbf{Y}\mathbf{Y}}^{-1} \]

Computing universal kriging predictor requires:

known structure of trend function
known structure and parameters of variogram

“Plug-in” predictor: uncertainty of variogram is ignored when computing predictions