library(spDataLarge)
library(data.table)
data(lsl, package = "spDataLarge")
table(lsl$lslpts)/nrow(lsl)*100
FALSE TRUE
50 50
Computer lab – Session 3
Download PDFs
We will spatially predict landslide incidents in dataset lsl
from R package spDataLarge
(see excercieses session 1). The response is a binary variable lslpts
. Further, five covariates derived from the elevation model are available.
Investigate the dataset: Compute the percentage of each class. Is the dataset balanced?
Create boxplots of response vs. covariates. Which relationships do you see?
library(spDataLarge)
library(data.table)
data(lsl, package = "spDataLarge")
table(lsl$lslpts)/nrow(lsl)*100
FALSE TRUE
50 50
The response is exactly balanced. No steps have to be taken to account for uneven distribution of the response classes.
library(ggplot2)
# transform from wide to long table format
<- melt(data.table(lsl), id.vars = "lslpts")
dt.lsl
par(mfrow = c(2,3))
ggplot(dt.lsl, aes(x = lslpts, y = value))+
geom_boxplot()+
facet_wrap(. ~variable, scales = "free")+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The value distributions for each class overlap for all of the covariates, but we do observe clear differences in some plots. While cprof (profile curvature of terrain) is about the same for both response classes, cplan (planar curvature) shows different median. Slope seems also to differentiate across the response classes well with landslides occurring in steeper terrains, also at higher elevations. Along the coordinate axis (x, y) we do not see too much structure useful for prediction.
Fit a random forest model using the terrain information as covariates (without spatial coordinates x
and y
).
You can for example use package ranger
. Use arguments importance = "permutation"
to obtain permuted covariate importance and probability = T
to obtain probability predictions for the next tasks.
Evaluate covariate importance using the function importance()
. Create a barplot for better visibility. Is the performance of the covariates as expected from the exploratory analysis?
library(ranger)
.1 <- ranger(y = lsl$lslpts,
rfx = lsl[4:8],
importance = "permutation",
probability = T, seed = 13)
.1 rf
Ranger result
Call:
ranger(y = lsl$lslpts, x = lsl[4:8], importance = "permutation", probability = T, seed = 13)
Type: Probability estimation
Number of trees: 500
Sample size: 350
Number of independent variables: 5
Mtry: 2
Target node size: 10
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.1553703
<- importance(rf.1)
t.i <- t.i[order(t.i)]
t.i par(mar=c(3,6,2,2))
barplot(t.i, horiz = T, las = 2)
Largest importance is reported for slope, cplan and elev. This alings with the different distribution per response class observed in the boxplots above. However, log10_carea (log of flow accumulation area) and cprof still contribute to the model fit, most likely in tails of their distributions or in by (non-linear) interactions with the other covariates.
Load the area to be predicted containing the covariates:
library(terra)
<- rast(system.file("raster/ta.tif", package = "spDataLarge")) ta
Create predictions for the entire extent of the ta.tif
. Use predict()
with the random forest object from the section above. Use as.data.frame()
to transform the raster values into a data.frame
. What information/values does the function predict()
return?
Create a SpatRaster
object with the function rast()
containing the predictions for the class TRUE
. Plot the resulting map.
x
and y
coordinates for each pixel using as.data.frame(, xy = TRUE)
.
The predictions were now created for the entire extent. We only have observations in the central part. Therefore, we need crop the predictions by the study area to avoid spatial extrapolation.
Note: this could also have been done before predicting to increase computational efficiency.
Load the study area:
data("study_mask", package = "spDataLarge")
Set the pixels outside of the study area to NA
using the function mask()
from terra
.
<- predict(object = rf.1, data = as.data.frame(ta))
d.pred str(d.pred)
List of 5
$ predictions : num [1:158326, 1:2] 0.375 0.344 0.3 0.16 0.251 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:2] "FALSE" "TRUE"
$ num.trees : num 500
$ num.independent.variables: num 5
$ num.samples : int 158326
$ treetype : chr "Probability estimation"
- attr(*, "class")= chr "ranger.prediction"
A list is returned with some general information. The first slot contains a matrix of the predictions with probability for each class. The order of the classes is given in the attributes.
<- as.data.frame(ta, xy = TRUE)
d.xy $predTRUE <- d.pred$predictions[,2]
d.xy<- rast(d.xy, type = "xyz", crs = crs(ta))
r.pred plot(r.pred[["predTRUE"]])
<- mask(r.pred, study_mask)
r.pred plot(r.pred[["predTRUE"]])
Inform yourself on accumulated local effects plots: Chapter 8.2, Molnar, 2024.
Print accumulated local effects plots for all covariates using the function ALEPlot()
from the package ALEPlot
. What is the interpretation of these plots?
Note: You will need to define a prediction function (yhat
on the help page of ALEPlot()
). Make sure this function returns a vector of the probabilities just for one class, see output of predict()
above.
library(ALEPlot)
<- function(X.model, newdata){ as.numeric(predict(object = X.model, data = newdata, )$predictions[,2]) }
yhat
par(mfrow = c(2,3))
for( ii in 1:5){
.1=ALEPlot(lsl[,4:8], rf.1, pred.fun = yhat, J = ii, K = 50, NA.plot = TRUE)
ALE }
Slope shows a nearly linear relationship starting from value of 25 degrees. cplan seems to have an inflection point at a value of 0 (no curvature), hence the covariate nearly operates as a binary class covariate. There are still some tree nodes split at other values than 0, hence it is still adisiable to use cplan as a continious covariate. elev seems to relate in a nearly quadratic way.
Note also the different range of the y-axis of the plots. The more important covariates according to the importance plot above show larger range compared to the less important ones.
<- 30 # rotation angle
a *cos(a/180*pi) - y*sin(a/180*pi)
x*sin(a/180*pi) + y*cos(a/180*pi) x
Add spatial coordinate axis including rotation by e.g. 30 and 60 degrees as additional covariates. Fit the above random forest model again and compute predictions and create the map.
Compare the models by comparing the out-of-bag (OOB) model fit, covariate importance and the output maps. Create a difference map with new.raster <- raster1 - raster2
.
If you have done so above, also compute accumulated local effects plots.
$x30 <- lsl$x*cos(30/180*pi) - lsl$y*sin(30/180*pi)
lsl$y30 <- lsl$x*sin(30/180*pi) + lsl$y*cos(30/180*pi)
lsl$x60 <- lsl$x*cos(60/180*pi) - lsl$y*sin(60/180*pi)
lsl$y60 <- lsl$x*sin(60/180*pi) + lsl$y*cos(60/180*pi)
lsl
.2 <- ranger(y = lsl$lslpts,
rfx = lsl[c(1:2,4:12)],
importance = "permutation",
probability = T, seed = 13)
.2 rf
Ranger result
Call:
ranger(y = lsl$lslpts, x = lsl[c(1:2, 4:12)], importance = "permutation", probability = T, seed = 13)
Type: Probability estimation
Number of trees: 500
Sample size: 350
Number of independent variables: 11
Mtry: 3
Target node size: 10
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.1500099
# add rotated coordinate axis to prediction data.frame
$x30 <- d.xy$x*cos(30/180*pi) - d.xy$y*sin(30/180*pi)
d.xy$y30 <- d.xy$x*sin(30/180*pi) + d.xy$y*cos(30/180*pi)
d.xy$x60 <- d.xy$x*cos(60/180*pi) - d.xy$y*sin(60/180*pi)
d.xy$y60 <- d.xy$x*sin(60/180*pi) + d.xy$y*cos(60/180*pi)
d.xy
# predict
<- predict(object = rf.2, data = d.xy)
d.pred2
# extract predictions and create raster
$predTRUE_xy <- d.pred2$predictions[,2]
d.xy
<- rast(d.xy, type = "xyz", crs = crs(ta))
r.pred <- mask(r.pred, study_mask)
r.pred plot(r.pred, c(6, 11)) # plot multiple bands
.1 rf
Ranger result
Call:
ranger(y = lsl$lslpts, x = lsl[4:8], importance = "permutation", probability = T, seed = 13)
Type: Probability estimation
Number of trees: 500
Sample size: 350
Number of independent variables: 5
Mtry: 2
Target node size: 10
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.1553703
.2 rf
Ranger result
Call:
ranger(y = lsl$lslpts, x = lsl[c(1:2, 4:12)], importance = "permutation", probability = T, seed = 13)
Type: Probability estimation
Number of trees: 500
Sample size: 350
Number of independent variables: 11
Mtry: 3
Target node size: 10
Variable importance mode: permutation
Splitrule: gini
OOB prediction error (Brier s.): 0.1500099
Brier score (similar to a RMSE, but for probabilistic predictions) is somewhat lower for the larger model including coordinate axis.
<- importance(rf.2)
t.i2 <- t.i2[order(t.i2)]
t.i2 par(mar=c(3,6,2,2), mfrow = c(1,2))
barplot(t.i, horiz = T, las = 2)
barplot(t.i2, horiz = T, las = 2)
slope, cplan and elev still rank high in the importance plot. But, the covariates of lower importance (log10_carea, cprof) are outperformed by some of the rotated coordinate axis. This indicates there is some spatial trend not accounted for by the thematic covariates.
# summary of the difference of predicted probabilities
summary(d.xy$predTRUE - d.xy$predTRUE_xy)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.572753 -0.075642 -0.003978 -0.010429 0.066360 0.527910
library(sf)
Linking to GEOS 3.11.1, GDAL 3.6.2, PROJ 9.1.1; sf_use_s2() is TRUE
$diff <- r.pred[["predTRUE"]] - r.pred[["predTRUE_xy"]]
r.predplot(r.pred[["diff"]], main = "Difference map")
# add observed locations
<- st_as_sf(lsl, coords = c("x", "y"), crs = 32717)
v.lsl points(v.lsl)
On average, the model with x
and y
coordinates predicts slightly larger landslide probabilities. By inspecting the maps, we see the pattern becoming more pronounced. In the North of the study are the probabilities become lower (bluer), in the Southeast the probabilities have become larger (yellower in the probability prediction map). The overall pattern remains the same, it seems including the x
and y
will lead to a clearer decision to assign the final class (discrete prediction).
par(mfrow = c(2,3))
for( ii in 1:(ncol(lsl)-1)){
.1=ALEPlot(lsl[,-c(3)], rf.2, pred.fun = yhat, J = ii, K = 50, NA.plot = TRUE)
ALE }
The modelled relationships of the thematic covariates remain roughly the same. The rotated coordinate axis show highly non-linear behaviour. This indicates overall non-linear trends, but most likely more modelling of local fluctuations. To know more about the contribution to local prediction we would need to investiate the interactions of the coordinate axis to see how the study area gets segmeted by the axis.
We will use the berne
soil mapping dataset from the R package geoGAM
. Due to size limitations of CRAN berne.grid
for predictions only covers a very small part of the study area. We can still use it, to exemplify the prediction process.
Use topsoil pH ph.0.10
as response. Starting from column cl_mt_etap_pe
you find a large number of potential covariates in the dataset.
library(geoGAM)
data(berne)
data(berne.grid)
Plot a histogram of the response. Investigate the value range and the completeness of the data.
Screen the covariates. What data types are there?
Create an sf
object and display the data points with the response using mapview
. The coordinate reference system information can be found on the help page of the dataset.
hist(berne$ph.0.10, breaks = 20, main = "Topsoil pH", xlab = "")
summary(berne$ph.0.10)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
4.50 5.80 6.30 6.33 6.90 8.50 182
The distribution of the response is about normal. There are 182 missing values in the response.
# ?berne # for more information
table( unlist( lapply(berne[,c(14:ncol(berne))], class) ) )
factor numeric
11 214
There are mostly numeric covariates in the data, but also 11 factors. They originate from geological maps or overview soil maps.
table(complete.cases(berne[,c(14:ncol(berne))]))
FALSE TRUE
13 1039
table(NAcovars = !complete.cases(berne[,c(14:ncol(berne))]),
NAresponse = is.na(berne$ph.0.10))
NAresponse
NAcovars FALSE TRUE
FALSE 862 177
TRUE 8 5
There are some additional rows with missing values in the covariates. berne
is a legacy soil dataset with soil data sampled in the 1970ties. Not all of the observed soil properties are available for all locations. We will remove the rows with missing values for the subsequent analysis.
table(complete.cases(berne.grid))
FALSE
4594
<- function(x){sum(is.na(x))}
f.na.sum <- unlist(lapply(berne.grid, f.na.sum))
l
>0] l[l
ge_geo500h1id tr_se_twi2m
4594 39
$ge_geo500h1id <- NULL
berne$tr_se_twi2m <- NULL berne
The prediction grid seems also to have missing values. Since we do not have any further knowledge about the dataset, we for now remove these columns from the analysis.
library(mapview)
<- st_as_sf(berne, coords = c("x", "y"), crs = 21781)
v.berne mapview(v.berne, zcol = "ph.0.10")
Fit a random forest model using ranger
. Remove rows with missing values in the response or the covariates. We use the predefined calibration and validation datasets. Hence, only use the rows labeled “calibration” in row dataset
.
Run the Boruta model selection algorithm with function Boruta()
from the R package Boruta
. Remove covariates that are considered unimportant according to the results.
Tune the number of covariates randomly chosen for testing at each tree split (mtry
). You can either implement this yourself (apply
or for
over the possible mtry
values which are 1:number.covariates
, then select mtry
of the model with the lowest OOB mean squared error) or use a meta-package like caret
.
Here some input how it can be done with caret
:
# create a matrix with all settings to test
# we limit ourself to mtry
<- expand.grid(
options.to.test .splitrule = c("variance"),
.min.node.size = c(5),
.mtry = c(5, 15, 20, ... )
)# model training with different settings
<- train(
rf.caret x = d.ph[, names .. of covariates still in the game ],
y = d.ph$ph.0.10,
tuneGrid = options.to.test,
method = "ranger",
trControl = trainControl(method = "cv", number = 10)
)
<- berne[berne$dataset == "calibration" & !is.na(berne$ph.0.10), ]
d.ph <- d.ph[complete.cases(d.ph[14:ncol(d.ph)]), ]
d.ph
<- ranger(x = d.ph[, 14:ncol(d.ph) ],
rf.ph y = d.ph$ph.0.10,
seed = 13)
rf.ph
Ranger result
Call:
ranger(x = d.ph[, 14:ncol(d.ph)], y = d.ph$ph.0.10, seed = 13)
Type: Regression
Number of trees: 500
Sample size: 668
Number of independent variables: 223
Mtry: 14
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.2863646
R squared (OOB): 0.4860319
library(Boruta)
set.seed(13)
<- Boruta(x = d.ph[, 14:ncol(d.ph) ],
boruta.model y = d.ph$ph.0.10,
maxRuns = 30)
# Check ranking compared to shadow variables
par(oma = c(8,3,2,2))
plot(boruta.model, las = 2, xlab = "", cex.lab = 0.8)
# use for example "Confirmed" (green) and "Tentative" (yellow) > Max.Shadow.
<- names(boruta.model$finalDecision[ boruta.model$finalDecision %in% c("Tentative", "Confirmed")[2] ])
names.cov.selected names.cov.selected
[1] "cl_mt_etap_ro" "cl_mt_gh_2" "cl_mt_gh_3" "cl_mt_gh_4"
[5] "cl_mt_gh_5" "cl_mt_gh_6" "cl_mt_gh_8" "cl_mt_gh_9"
[9] "cl_mt_pet_pe" "cl_mt_pet_ro" "cl_mt_rr_1" "cl_mt_rr_10"
[13] "cl_mt_rr_11" "cl_mt_rr_12" "cl_mt_rr_2" "cl_mt_rr_3"
[17] "cl_mt_rr_4" "cl_mt_rr_5" "cl_mt_rr_7" "cl_mt_rr_8"
[21] "cl_mt_rr_9" "cl_mt_rr_y" "cl_mt_swb_pe" "cl_mt_swb_ro"
[25] "sl_physio_neu" "tr_cindx50_25" "tr_nego" "tr_ns25_145mn"
[29] "tr_ns25_75sd" "tr_se_twi2m_s30" "tr_se_twi2m_s7" "tr_tsc25_40"
<- ranger(x = d.ph[, names.cov.selected],
rf.ph.sel y = d.ph$ph.0.10,
seed = 13)
rf.ph.sel
Ranger result
Call:
ranger(x = d.ph[, names.cov.selected], y = d.ph$ph.0.10, seed = 13)
Type: Regression
Number of trees: 500
Sample size: 668
Number of independent variables: 32
Mtry: 5
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.2921226
R squared (OOB): 0.4756975
Own implementation based on OOB MSE
library(parallel)
# Define function to use below
<- function(test.mtry, # mtry to test
f.tune.randomforest # calibration data.frame
d.cal, # list of covariates
l.covariates ){ # fit random forest with mtry = test.mtry
<- ranger(x = d.cal[, l.covariates ],
rf.tune y = d.cal[, "ph.0.10"],
mtry = test.mtry,
seed = 13)
# return the mean squared error (mse) of this model fit
return( rf.tune$prediction.error )
}
# Vector of mtry to test
<- 1:(length(names.cov.selected) - 1)
seq.mtry # Only take every fifth mtry to speed up tuning
# seq.mtry <- seq.mtry[ seq.mtry %% 5 == 0 ]
# Apply function to sequence.
# (without parallel computing this takes a while)
<- mclapply(seq.mtry, # give sequence
t.OOBe FUN = f.tune.randomforest, # give function name
mc.cores = 1, ## number of CPUs
mc.set.seed = FALSE, # do not use new seed each time
# now here giv the arguments to the function:
d.cal = d.ph,
l.covariates = names.cov.selected )
# Hint: Who is not comfortable with "mclapply"
# the same could be achieved with
# for(test.mtry in seq.mtry){
# .. content of function + vector to collect result... }
# create a dataframe of the results
<- data.frame(mtry.n = seq.mtry, mtry.OOBe = unlist(t.OOBe))
mtry.oob
# get the mtry with the minimum MSE
<- mtry.oob$mtry.n[ which.min(mtry.oob$mtry.OOBe) ]
s.mtry # Optimal mtry
s.mtry
[1] 2
# plot result
plot( mtry.oob$mtry.n, mtry.oob$mtry.OOBe, pch = 4, las = 2,
ylab = "out-of-bag MSE error", xlab = "mtry")
abline(v = s.mtry, lty = "dashed", col = "darkgrey")
lines( lowess( mtry.oob$mtry.n, mtry.oob$mtry.OOBe ), lwd = 1.5, col = "darkgrey")
Use caret
with cross-validation
library(caret)
# create a matrix with all settings to test
# we limit ourself to mtry
<- expand.grid(
options.to.test .splitrule = c("variance"),
.min.node.size = c(5),
.mtry = 1:(length(names.cov.selected) - 1)
)# model training with different settings
<- train(
rf.caret x = d.ph[, names.cov.selected ],
y = d.ph$ph.0.10,
tuneGrid = options.to.test,
method = "ranger",
trControl = trainControl(method = "cv", number = 10)
)# print best settings
$bestTune rf.caret
mtry splitrule min.node.size
1 1 variance 5
Fit random forest with optimal mtry
<- ranger(x = d.ph[, names.cov.selected ],
rf.ph.sel.tuned y = d.ph$ph.0.10,
mtry = rf.caret$bestTune$mtry,
seed = 13)
rf.ph.sel.tuned
Ranger result
Call:
ranger(x = d.ph[, names.cov.selected], y = d.ph$ph.0.10, mtry = rf.caret$bestTune$mtry, seed = 13)
Type: Regression
Number of trees: 500
Sample size: 668
Number of independent variables: 32
Mtry: 1
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.2895098
R squared (OOB): 0.4803869
Optimal mtry
is very low. Depending on how it is precisely evaluated optimal mtry
is 1 or 2. Low mtry
means a large degree of randomization is used to fit the regression trees as only 1 or 2 randomly chosen covariates are tried at each tree node to split the data.
Compute the random forest residuals (use predict()
on the calibration data). Investigate the residuals for spatial structure. Plot the residuals against x
and y
axis and create a sample variogram.
Test if the model can be improved by adding rotated coordinate axis. Compare the OOB mean squared error of the model with and without the coordinates.
<- predict(rf.ph.sel.tuned, data = d.ph[, names.cov.selected ])
fitted $residuals <- d.ph$ph.0.10 - fitted$predictions
d.ph
ggplot(d.ph, aes(x = x, y = residuals)) +
geom_point() + geom_smooth(se=F)
ggplot(d.ph, aes(x = y, y = residuals)) +
geom_point() + geom_smooth(se=F)
library(georob)
plot(sample.variogram(d.ph$residuals, locations = d.ph[, c("x", "y")],
estimator = "matheron", lag.dist.def = 200, max.lag = 9000))
According to the lowess scatterplot smoother we do not observe any remaining trend along the coordinate axis.
The sample variogram shows a “pure nugget” variogram, i.e. there seems to be not short range auto-correlation in the residuals.
From this result, we do not expect that the predictions improve when including spatial coordinates.
# add rotated coordinate axis to training data.frame
$x30 <- d.ph$x*cos(30/180*pi) - d.ph$y*sin(30/180*pi)
d.ph$y30 <- d.ph$x*sin(30/180*pi) + d.ph$y*cos(30/180*pi)
d.ph$x60 <- d.ph$x*cos(60/180*pi) - d.ph$y*sin(60/180*pi)
d.ph$y60 <- d.ph$x*sin(60/180*pi) + d.ph$y*cos(60/180*pi)
d.ph
<- c(names.cov.selected, paste0( rep(c("x", "y"), 3), c("", "30", "60")) )
names.xy
<- ranger(x = d.ph[, names.xy],
rf.ph.sel.tuned.xy y = d.ph$ph.0.10,
mtry = rf.caret$bestTune$mtry,
seed = 13)
rf.ph.sel.tuned.xy
Ranger result
Call:
ranger(x = d.ph[, names.xy], y = d.ph$ph.0.10, mtry = rf.caret$bestTune$mtry, seed = 13)
Type: Regression
Number of trees: 500
Sample size: 668
Number of independent variables: 38
Mtry: 1
Target node size: 5
Variable importance mode: none
Splitrule: variance
OOB prediction error (MSE): 0.2873347
R squared (OOB): 0.4842908
According to OOB metrics the model fit improves slightly. We observe an R2 of 0.480 and a MSE of 0.29 without and a slightly larger R2 of 0.484 with a slightly lower MSE of 0.287 with the rotated coordinate axis.
This is opposed to the plots (scatterplots and variogram) above. Therefore, rotated x
and y
coordinate axis likely form interactions with other covariates. Due to the hierarchical structure of regression trees, they designed to model interactions. We would need to investigate the modeled interaction structure by for example using second-order ALE plots, see Chapter 8.2, Molnar, 2024.
This shows that classical “diagnostics” used in the linear model context may not be reliable indicators for machine learning models.
Compute predictions for berne.grid
and plot the result. Are you happy with the visual result?
Compute predictions for the rows labeled with “validation” in the column dataset
in the berne
dataset. Compute the R2 and compare it with the OOB R2.
Save the predictions for the “validation” rows together with the observed values and coordinates. We will inspect them in the next lab session.
library(terra)
<- berne.grid[ complete.cases(berne.grid[names.cov.selected]), ]
berne.grid
# add rotated coordinate axis to prediction data.frame
$x30 <- berne.grid$x*cos(30/180*pi) - berne.grid$y*sin(30/180*pi)
berne.grid$y30 <- berne.grid$x*sin(30/180*pi) + berne.grid$y*cos(30/180*pi)
berne.grid$x60 <- berne.grid$x*cos(60/180*pi) - berne.grid$y*sin(60/180*pi)
berne.grid$y60 <- berne.grid$x*sin(60/180*pi) + berne.grid$y*cos(60/180*pi)
berne.grid
$pred.ph <- predict(rf.ph.sel.tuned.xy, berne.grid)$predictions
berne.grid<- rast(berne.grid[, c("x", "y", "pred.ph")], type = "xyz")
r.ph plot(r.ph)
The predictions created for the small section of the study area available in berne.grid
seem to be rather “ugly”. There are two types of artefacts:
sort of small scale flow lines, mostly in the Northwest of the map section,
large vertical lines and large rectangular areas, mainly in the east of the map section.
The latter is certainly not typical for the spatial variation of soil pH. The small scale pattern most likely originates from a derivative of the elevation model. Since the color scale only covers a part of the value range of our response (min: 4.5, max: 8.8) these patterns get exaggerated by the current display.
The large rectangular structures originate from the covariates starting with “cl_mt_..”. Those are climatic maps (e.g. average yearly rainfall) that were only available at a large pixel resolution. The pixels are now directly visible in the prediction map. To avoid this outcome we could use data with higher resolution (if available) or smooth the raster with a filter before using it as covariate (e.g. function terra::focal()
). Smoothing is a practical solution, but the uncertainty in the covariates is not accounted for.
Closely investigating the resulting maps is a required step of spatial prediction to avoid publishing data sets with non-feasible content.
<- berne[berne$dataset == "validation" & !is.na(berne$ph.0.10), ]
d.ph.val <- d.ph.val[complete.cases(d.ph.val[13:ncol(d.ph.val)]), ]
d.ph.val
# add rotated coordinate axis to validation data.frame
$x30 <- d.ph.val$x*cos(30/180*pi) - d.ph.val$y*sin(30/180*pi)
d.ph.val$y30 <- d.ph.val$x*sin(30/180*pi) + d.ph.val$y*cos(30/180*pi)
d.ph.val$x60 <- d.ph.val$x*cos(60/180*pi) - d.ph.val$y*sin(60/180*pi)
d.ph.val$y60 <- d.ph.val$x*sin(60/180*pi) + d.ph.val$y*cos(60/180*pi)
d.ph.val
$pred.ph.0.10 <- predict(rf.ph.sel.tuned.xy, d.ph.val)$predictions
d.ph.val
# R2 = 1 - RSS / TSS (RSS: residual sum of squares, TSS: total sum of squares)
1 - sum( (d.ph.val$pred.ph.0.10 - d.ph.val$ph.0.10)^2) / sum((d.ph.val$ph.0.10 - mean(d.ph.val$ph.0.10))^2)
[1] 0.4519854
$r.squared rf.ph.sel.tuned.xy
[1] 0.4842908
The R2 computed on the validation set is only slightly lower compared to the out-of-bag error. Therefore, it seems the prediction model is generalizing well to predict new locations unseen to the model.
However, we would need to investigate the locations of the validation data and if those allow for a valid conclusion regarding the accuracy of the resulting map.
<- d.ph.val[, c("site_id_unique", "x", "y", "ph.0.10","pred.ph.0.10")]
d.ph.val write.csv(d.ph.val, file = "lab3-validation-set-ph-predictions.csv")