Sunday, January 22, 2017

BTopic102

Analysing the Seeds data in INLA with a GAM

About

This is a detailed example of a good model for a simple dataset called Seeds. The goal is to show a simple example of INLA. The unique identifier is BTopic102.
For an overview of all the related topics see [todo] http://r-inla.org/nourl. For an overview of the words and concepts see http://r-inla.org/nourl.

0. Initialise R session

We load any packages and set global options.
library(INLA); rm(list=ls())
options(width=70, digits=2)

1. Load data, rename, rescale

data(Seeds); dat = Seeds
This dat is the orginial dataframe. We will keep dat unchanged, and create another object (df) later. Next we explain the data.
?Seeds
head(dat)
##    r  n x1 x2 plate
## 1 10 39  0  0     1
## 2 23 62  0  0     2
## 3 23 81  0  0     3
## 4 26 51  0  0     4
## 5 17 39  0  0     5
## 6  5  6  0  1     6
# - r is the number of seed germinated (successes)
# - n is the number of seeds attempted (trials)
# - x1 is the type of seed
# - x2 is the type of root extract
# - plate is the numbering of the plates/experiments
All the covariates are factors in this case, as the numbering of the plates are arbitrary. Therefore we do not re-scale any covariates. The observations are integers, and we do not rescale these either.
df = data.frame(y = dat$r, Ntrials = dat$n, dat[, 3:5])
I always name the dataframe that is going to be used in the inference to ‘df’, keeping ‘dat’ as the original dataframe. The observations are always named ‘y’.

2. Explore data

summary(df)
##        y         Ntrials         x1             x2           plate   
##  Min.   : 0   Min.   : 4   Min.   :0.00   Min.   :0.00   Min.   : 1  
##  1st Qu.: 8   1st Qu.:16   1st Qu.:0.00   1st Qu.:0.00   1st Qu.: 6  
##  Median :17   Median :39   Median :0.00   Median :1.00   Median :11  
##  Mean   :20   Mean   :40   Mean   :0.48   Mean   :0.52   Mean   :11  
##  3rd Qu.:26   3rd Qu.:51   3rd Qu.:1.00   3rd Qu.:1.00   3rd Qu.:16  
##  Max.   :55   Max.   :81   Max.   :1.00   Max.   :1.00   Max.   :21
table(df$x1)
## 
##  0  1 
## 11 10
table(df$x2)
## 
##  0  1 
## 10 11
This checks whether any covariates or factors are still out of scale, or senseless.
plot(df)

This looks good. This scatterplot of the data can highlight outliers and other issues.

3. Observation Likelihood

family1 = "binomial"
control.family1 = list(control.link=list(model="logit"))
This specifies which likelihood we are going to use for the observations. The binomial distribution is defined with a certain number of trials, in INLA known as Ntrials. In control.family we would also specify the priors for any hyper-parameters.

4. Formula

hyper1 = list(theta = list(prior="pc.prec", param=c(1,0.01)))
formula1 = y ~ x1 + x2 + f(plate, model="iid", hyper=hyper1)
This specifies the formula, and the priors for any hyper-parameters in the random effects.

5. Call INLA

res1 = inla(formula=formula1, data=df, 
            family=family1, Ntrials=Ntrials, 
            control.family=control.family1)
This is the inla-call, where we just collect all the variables as defined above.

6. Look at results

summary(res1)
## 
## Call:
## c("inla(formula = formula1, family = family1, data = df, Ntrials = Ntrials, ",  "    control.family = control.family1)")
## 
## Time used:
##  Pre-processing    Running inla Post-processing           Total 
##           0.989           0.146           0.065           1.200 
## 
## Fixed effects:
##              mean   sd 0.025quant 0.5quant 0.97quant  mode kld
## (Intercept) -0.39 0.18      -0.73    -0.39    -0.046 -0.40   0
## x1          -0.35 0.23      -0.82    -0.35     0.054 -0.33   0
## x2           1.03 0.22       0.60     1.04     1.440  1.04   0
## 
## Random effects:
## Name   Model
##  plate   IID model 
## 
## Model hyperparameters:
##                      mean    sd 0.025quant 0.5quant 0.97quant  mode
## Precision for plate 19.62 30.37      3.085    10.69     94.01 6.048
## 
## Expected number of effective parameters(std dev): 9.895(2.915)
## Number of equivalent replicates : 2.122 
## 
## Marginal log-Likelihood:  -68.44
This summary shows many of the results, notably not including the random effects (the parameters are not shown).
#str(res1, 1)
This command, if run, shows one step in the list-of-lists hierarchy that the inla result is.

7. Look at the random effect

res1$summary.random$plate
##    ID   mean   sd 0.025quant 0.5quant 0.97quant    mode     kld
## 1   1 -0.282 0.27     -0.894   -0.251     0.151 -0.1531 3.6e-09
## 2   2 -0.079 0.22     -0.557   -0.066     0.330 -0.0309 2.0e-12
## 3   3 -0.311 0.24     -0.842   -0.289     0.083 -0.2450 2.6e-07
## 4   4  0.213 0.24     -0.213    0.193     0.706  0.1199 3.8e-10
## 5   5  0.054 0.24     -0.418    0.047     0.523  0.0236 1.3e-12
## 6   6  0.097 0.31     -0.486    0.071     0.748  0.0211 2.3e-12
## 7   7  0.155 0.23     -0.262    0.138     0.616  0.0786 2.5e-11
## 8   8  0.279 0.24     -0.136    0.257     0.785  0.2037 9.6e-08
## 9   9 -0.060 0.23     -0.544   -0.050     0.370 -0.0239 1.0e-12
## 10 10 -0.184 0.22     -0.665   -0.166     0.202 -0.1032 9.6e-10
## 11 11  0.113 0.29     -0.425    0.087     0.719  0.0294 2.5e-12
## 12 12  0.202 0.30     -0.309    0.165     0.843  0.0557 3.9e-09
## 13 13  0.020 0.26     -0.491    0.015     0.526  0.0043 3.2e-12
## 14 14 -0.058 0.26     -0.605   -0.048     0.429 -0.0208 2.2e-12
## 15 15  0.380 0.29     -0.089    0.350     0.990  0.2895 7.3e-07
## 16 16 -0.124 0.32     -0.860   -0.091     0.438 -0.0257 1.1e-11
## 17 17 -0.286 0.32     -1.028   -0.240     0.213 -0.0844 3.4e-09
## 18 18 -0.059 0.24     -0.553   -0.052     0.400 -0.0282 1.5e-12
## 19 19 -0.108 0.25     -0.645   -0.092     0.355 -0.0428 1.3e-12
## 20 20  0.123 0.24     -0.322    0.102     0.625  0.0434 2.8e-10
## 21 21 -0.083 0.30     -0.736   -0.063     0.467 -0.0208 4.1e-12

More details on df

This dataframe contains only numerical values, and all factors have been expanded. All covariates have a mean near zero and a standard deviation near 1 (i.e. 0.1<sd<10). They are also transformed to the appropriate scale, often with a log.

Old code:

The remaining is outdated! This is a test using R Markdown see http://rmarkdown.rstudio.com. And \(x = 1+1\)