Analysing the Seeds data in INLA with a GAM
Haakon Bakka
BTopic102 updated 1/22/2017
About
This is a detailed example of a good model for a simple dataset called Seeds. The goal is to show a simple example of INLA. The unique identifier is BTopic102.For an overview of all the related topics see [todo] http://r-inla.org/nourl. For an overview of the words and concepts see http://r-inla.org/nourl.
0. Initialise R session
We load any packages and set global options.library(INLA); rm(list=ls())
options(width=70, digits=2)
1. Load data, rename, rescale
data(Seeds); dat = Seeds
This dat
is the orginial dataframe. We will keep dat unchanged, and create another object (df
) later. Next we explain the data.?Seeds
head(dat)
## r n x1 x2 plate
## 1 10 39 0 0 1
## 2 23 62 0 0 2
## 3 23 81 0 0 3
## 4 26 51 0 0 4
## 5 17 39 0 0 5
## 6 5 6 0 1 6
# - r is the number of seed germinated (successes)
# - n is the number of seeds attempted (trials)
# - x1 is the type of seed
# - x2 is the type of root extract
# - plate is the numbering of the plates/experiments
All the covariates are factors in this case, as the numbering of the plates are arbitrary. Therefore we do not re-scale any covariates. The observations are integers, and we do not rescale these either.df = data.frame(y = dat$r, Ntrials = dat$n, dat[, 3:5])
I always name the dataframe that is going to be used in the inference to ‘df’, keeping ‘dat’ as the original dataframe. The observations are always named ‘y’.2. Explore data
summary(df)
## y Ntrials x1 x2 plate
## Min. : 0 Min. : 4 Min. :0.00 Min. :0.00 Min. : 1
## 1st Qu.: 8 1st Qu.:16 1st Qu.:0.00 1st Qu.:0.00 1st Qu.: 6
## Median :17 Median :39 Median :0.00 Median :1.00 Median :11
## Mean :20 Mean :40 Mean :0.48 Mean :0.52 Mean :11
## 3rd Qu.:26 3rd Qu.:51 3rd Qu.:1.00 3rd Qu.:1.00 3rd Qu.:16
## Max. :55 Max. :81 Max. :1.00 Max. :1.00 Max. :21
table(df$x1)
##
## 0 1
## 11 10
table(df$x2)
##
## 0 1
## 10 11
This checks whether any covariates or factors are still out of scale, or senseless.plot(df)
This looks good. This scatterplot of the data can highlight outliers and other issues.
3. Observation Likelihood
family1 = "binomial"
control.family1 = list(control.link=list(model="logit"))
This specifies which likelihood we are going to use for the observations. The binomial distribution is defined with a certain number of trials, in INLA known as Ntrials
. In control.family
we would also specify the priors for any hyper-parameters.4. Formula
hyper1 = list(theta = list(prior="pc.prec", param=c(1,0.01)))
formula1 = y ~ x1 + x2 + f(plate, model="iid", hyper=hyper1)
This specifies the formula, and the priors for any hyper-parameters in the random effects.5. Call INLA
res1 = inla(formula=formula1, data=df,
family=family1, Ntrials=Ntrials,
control.family=control.family1)
This is the inla-call, where we just collect all the variables as defined above.6. Look at results
summary(res1)
##
## Call:
## c("inla(formula = formula1, family = family1, data = df, Ntrials = Ntrials, ", " control.family = control.family1)")
##
## Time used:
## Pre-processing Running inla Post-processing Total
## 0.989 0.146 0.065 1.200
##
## Fixed effects:
## mean sd 0.025quant 0.5quant 0.97quant mode kld
## (Intercept) -0.39 0.18 -0.73 -0.39 -0.046 -0.40 0
## x1 -0.35 0.23 -0.82 -0.35 0.054 -0.33 0
## x2 1.03 0.22 0.60 1.04 1.440 1.04 0
##
## Random effects:
## Name Model
## plate IID model
##
## Model hyperparameters:
## mean sd 0.025quant 0.5quant 0.97quant mode
## Precision for plate 19.62 30.37 3.085 10.69 94.01 6.048
##
## Expected number of effective parameters(std dev): 9.895(2.915)
## Number of equivalent replicates : 2.122
##
## Marginal log-Likelihood: -68.44
This summary shows many of the results, notably not including the random effects (the parameters are not shown).#str(res1, 1)
This command, if run, shows one step in the list-of-lists hierarchy that the inla result is.7. Look at the random effect
res1$summary.random$plate
## ID mean sd 0.025quant 0.5quant 0.97quant mode kld
## 1 1 -0.282 0.27 -0.894 -0.251 0.151 -0.1531 3.6e-09
## 2 2 -0.079 0.22 -0.557 -0.066 0.330 -0.0309 2.0e-12
## 3 3 -0.311 0.24 -0.842 -0.289 0.083 -0.2450 2.6e-07
## 4 4 0.213 0.24 -0.213 0.193 0.706 0.1199 3.8e-10
## 5 5 0.054 0.24 -0.418 0.047 0.523 0.0236 1.3e-12
## 6 6 0.097 0.31 -0.486 0.071 0.748 0.0211 2.3e-12
## 7 7 0.155 0.23 -0.262 0.138 0.616 0.0786 2.5e-11
## 8 8 0.279 0.24 -0.136 0.257 0.785 0.2037 9.6e-08
## 9 9 -0.060 0.23 -0.544 -0.050 0.370 -0.0239 1.0e-12
## 10 10 -0.184 0.22 -0.665 -0.166 0.202 -0.1032 9.6e-10
## 11 11 0.113 0.29 -0.425 0.087 0.719 0.0294 2.5e-12
## 12 12 0.202 0.30 -0.309 0.165 0.843 0.0557 3.9e-09
## 13 13 0.020 0.26 -0.491 0.015 0.526 0.0043 3.2e-12
## 14 14 -0.058 0.26 -0.605 -0.048 0.429 -0.0208 2.2e-12
## 15 15 0.380 0.29 -0.089 0.350 0.990 0.2895 7.3e-07
## 16 16 -0.124 0.32 -0.860 -0.091 0.438 -0.0257 1.1e-11
## 17 17 -0.286 0.32 -1.028 -0.240 0.213 -0.0844 3.4e-09
## 18 18 -0.059 0.24 -0.553 -0.052 0.400 -0.0282 1.5e-12
## 19 19 -0.108 0.25 -0.645 -0.092 0.355 -0.0428 1.3e-12
## 20 20 0.123 0.24 -0.322 0.102 0.625 0.0434 2.8e-10
## 21 21 -0.083 0.30 -0.736 -0.063 0.467 -0.0208 4.1e-12
More details on df
This dataframe contains only numerical values, and all factors have been expanded. All covariates have a mean near zero and a standard deviation near 1 (i.e. 0.1<sd<10). They are also transformed to the appropriate scale, often with a log
.