library(dplyr)
library(gt)
library(broom)
Tobit regression
Libraries
General
Methodology specific
library(emmeans)
library(censReg)
library(survival)
library(VGAM)
Tobit model
Censoring occurs when data on the dependent variable is only partially known. For example, in virology, sample results could be below the lower limit of detection (eg, 100 copies/mL) and in such a case we only know that the sample result is <100 copies/mL, but we don’t know the exact value.
Let \(y^{*}\) be the the true underlying latent variable, and \(y\) the observed variable. We discuss here censoring on the left:
\[ y = \begin{cases} y^{*}, & y^{*} > \tau \\ \tau, & y^{*} \leq \tau \end{cases} \] We consider tobit regression with a censored normal distribution. The model equation is \[ y_{i}^{*} = X_{i}\beta + \epsilon_{i} \] with \(\epsilon_{i} \sim N(0,\sigma^2)\). But we only observe \(y = max(\tau, y^{*})\). The tobit model uses maximum likelihood estimation (for details see for example Breen, 1996). It is important to note that \(\beta\) estimates the effect of \(x\) on the latent variable \(y^{*}\), and not on the observed value \(y\).
Data used
We assume two equally sized groups (n=10 in each group). The data is censored on the left at a value of \(\tau=8.0\). In group A 4/10 records are censored, and 1/10 in group B.
= tribble(
dat_used ~ID, ~ARM, ~Y, ~CENS,
"001", "A", 8.0, 1,
"002", "A", 8.0, 1,
"003", "A", 8.0, 1,
"004", "A", 8.0, 1,
"005", "A", 8.9, 0,
"006", "A", 9.5, 0,
"007", "A", 9.9, 0,
"008", "A", 10.3, 0,
"009", "A", 11.0, 0,
"010", "A", 11.2, 0,
"011", "B", 8.0, 1,
"012", "B", 9.2, 0,
"013", "B", 9.9, 0,
"014", "B", 10.0, 0,
"015", "B", 10.6, 0,
"016", "B", 10.6, 0,
"017", "B", 11.3, 0,
"018", "B", 11.8, 0,
"019", "B", 12.9, 0,
"020", "B", 13.0, 0,
)
gt(dat_used)
ID | ARM | Y | CENS |
---|---|---|---|
001 | A | 8.0 | 1 |
002 | A | 8.0 | 1 |
003 | A | 8.0 | 1 |
004 | A | 8.0 | 1 |
005 | A | 8.9 | 0 |
006 | A | 9.5 | 0 |
007 | A | 9.9 | 0 |
008 | A | 10.3 | 0 |
009 | A | 11.0 | 0 |
010 | A | 11.2 | 0 |
011 | B | 8.0 | 1 |
012 | B | 9.2 | 0 |
013 | B | 9.9 | 0 |
014 | B | 10.0 | 0 |
015 | B | 10.6 | 0 |
016 | B | 10.6 | 0 |
017 | B | 11.3 | 0 |
018 | B | 11.8 | 0 |
019 | B | 12.9 | 0 |
020 | B | 13.0 | 0 |
Example Code using R
The analysis will be based on a Tobit analysis of variance with \(Y\), rounded to 1 decimal places, as dependent variable and study group as a fixed covariate. A normally distributed error term will be used. Values will be left censored at the value 8.0.
Several R functions and packages are presented.
censReg
The censReg
function from the censReg
package performs tobit models for left and right censored. The model is estimated by Maximum Likelihood (ML) assuming a Gaussian (normal) distribution of the error term. The maximization of the likelihood function is done by function maxLik
of the maxLik
package. The optimization method can be changed.
= censReg(Y ~ ARM,
res_censreg left = 8.0,
data = dat_used)
summary(res_censreg)
Call:
censReg(formula = Y ~ ARM, left = 8, data = dat_used)
Observations:
Total Left-censored Uncensored Right-censored
20 5 15 0
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 8.8323 0.5918 14.925 < 2e-16 ***
ARMB 1.8225 0.8061 2.261 0.02376 *
logSigma 0.5491 0.1947 2.819 0.00481 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 4 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -34.3154 on 3 Df
# Difference between groups (Wald CIs)
round(res_censreg$estimate[2], 3)
ARMB
1.823
round(stats::confint(res_censreg, level = 0.95)[2,], 3)
2.5 % 97.5 %
0.243 3.402
The output provides an estimate of difference between groups A and B (B-A), namely 1.8225 (se=0.8061). The presented p-value is a two-sided p-value based on the Z-test. The output also provides an estimate for \(log(\sigma) = 0.5491\). Wald based confidence intervals can be obtained by the stats::confint
function.
survreg
Using the survreg
function from the survival
package a tobit model can be fit. For more information, refer to the survival package.
= survreg(Surv(Y, 1-CENS, type="left") ~ ARM,
res_survreg dist = "gaussian",
data = dat_used)
summary(res_survreg)
Call:
survreg(formula = Surv(Y, 1 - CENS, type = "left") ~ ARM, data = dat_used,
dist = "gaussian")
Value Std. Error z p
(Intercept) 8.832 0.592 14.92 <2e-16
ARMB 1.823 0.806 2.26 0.0238
Log(scale) 0.549 0.195 2.82 0.0048
Scale= 1.73
Gaussian distribution
Loglik(model)= -34.3 Loglik(intercept only)= -36.7
Chisq= 4.72 on 1 degrees of freedom, p= 0.03
Number of Newton-Raphson Iterations: 4
n= 20
# Least square means by group
= emmeans(res_survreg, specs = trt.vs.ctrl ~ ARM)
lsm $emmeans lsm
ARM emmean SE df lower.CL upper.CL
A 8.83 0.592 17 7.58 10.1
B 10.65 0.552 17 9.49 11.8
Confidence level used: 0.95
# Difference between groups (Wald CIs)
= broom::tidy(lsm$contrasts, conf.int=TRUE, conf.level=0.95)
lsm_contrast gt(lsm_contrast) %>%
fmt_number(decimals = 3)
term | contrast | null.value | estimate | std.error | df | conf.low | conf.high | statistic | p.value |
---|---|---|---|---|---|---|---|---|---|
ARM | B - A | 0.000 | 1.823 | 0.806 | 17.000 | 0.122 | 3.523 | 2.261 | 0.037 |
The output provides an estimate of difference between groups A and B (B-A), namely 1.823 (se=0.806). The presented p-value is a two-sided p-value based on the Z-test. The output also provides an estimate for \(log(\sigma) = 0.549\). Using the emmeans
package/function least square means and contrast can be easily obtained. The confidence intervals and p-value is based on the t-test.
vglm
The VGAM
package provides functions for fitting vector generalized linear and additive models (VGLMs and VGAMs). This package centers on the iteratively reweighted least squares (IRLS) algorithm. The vglm
function offers the possibility to fit a tobit model.
= vglm(Y ~ ARM,
res_vglm tobit(Lower = 8.0),
data = dat_used)
summary(res_vglm)
Call:
vglm(formula = Y ~ ARM, family = tobit(Lower = 8), data = dat_used)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept):1 8.8323 0.5727 15.422 < 2e-16 ***
(Intercept):2 0.5491 0.1807 3.039 0.00238 **
ARMB 1.8226 0.7942 2.295 0.02173 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Names of linear predictors: mu, loglink(sd)
Log-likelihood: -34.3154 on 37 degrees of freedom
Number of Fisher scoring iterations: 7
No Hauck-Donner effect found in any of the estimates
# Difference between groups
round(res_vglm@coefficients[3], 3)
ARMB
1.823
round(confintvglm(res_vglm, level = 0.95)[3,], 3)
2.5 % 97.5 %
0.266 3.379
The output provides an estimate of difference between groups A and B (B-A), namely 1.823 (se=0.794). The presented p-value is a two-sided p-value based on the Z-test. Note that point estimate for the difference (and associated SE) are slightly different from the results obtained by censReg
and tobit
due to the difference in estimation procedure used. Wald based confidence intervals can be obtained by the confintvglm
function. The \((Intercept):2\) in the model output is an estimate for \(log(\sigma)\).
Discussion
The results from the censReg::censReg
and survival::survreg
are similar. The survival::survreg
allows for easy incorporation with the emmeans
package (note: be aware that the standard approach with emmeans is based on the t-test and not the Z-test).
The VGAM::vglm
approach provides slightly different results. This difference comes from the fact that a iteratively reweighted least squares (IRLS) algorithm is used for estimation.
Reference
Breen, R. (1996). Regression models. SAGE Publications, Inc., https://doi.org/10.4135/9781412985611
Tobin, James (1958). “Estimation of Relationships for Limited Dependent Variables”. Econometrica. 26 (1): 24-36. doi:10.2307/1907382
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.4.2 (2024-10-31)
os Ubuntu 24.04.1 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz UTC
date 2024-12-16
pandoc 3.4 @ /opt/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P backports 1.5.0 2024-05-23 [?] RSPM (R 4.4.0)
P bdsmatrix 1.3-7 2024-03-02 [?] RSPM (R 4.4.0)
P broom * 1.0.7 2024-09-26 [?] RSPM (R 4.4.0)
P censReg * 0.5-38 2024-05-20 [?] RSPM (R 4.4.0)
P cli 3.6.3 2024-06-21 [?] RSPM (R 4.4.0)
P coda 0.19-4.1 2024-01-31 [?] RSPM (R 4.4.0)
codetools 0.2-20 2024-03-31 [2] CRAN (R 4.4.2)
P collapse 2.0.18 2024-11-23 [?] RSPM (R 4.4.0)
P digest 0.6.37 2024-08-19 [?] RSPM (R 4.4.0)
P dplyr * 1.1.4 2023-11-17 [?] RSPM (R 4.4.0)
P emmeans * 1.10.4 2024-08-21 [?] RSPM (R 4.4.0)
P estimability 1.5.1 2024-05-12 [?] RSPM (R 4.4.0)
P evaluate 1.0.0 2024-09-17 [?] RSPM (R 4.4.0)
P fansi 1.0.6 2023-12-08 [?] RSPM (R 4.4.0)
P fastmap 1.2.0 2024-05-15 [?] RSPM (R 4.4.0)
P Formula 1.2-5 2023-02-24 [?] RSPM (R 4.4.0)
P generics 0.1.3 2022-07-05 [?] RSPM (R 4.4.0)
P glmmML 1.1.7 2024-09-20 [?] RSPM (R 4.4.0)
P glue 1.8.0 2024-09-30 [?] RSPM (R 4.4.0)
P gt * 0.11.1 2024-10-04 [?] RSPM (R 4.4.0)
P htmltools 0.5.8.1 2024-04-04 [?] RSPM (R 4.4.0)
P htmlwidgets 1.6.4 2023-12-06 [?] RSPM (R 4.4.0)
P jsonlite 1.8.9 2024-09-20 [?] RSPM (R 4.4.0)
P knitr 1.48 2024-07-07 [?] RSPM (R 4.4.0)
lattice 0.22-6 2024-03-20 [2] CRAN (R 4.4.2)
P lifecycle 1.0.4 2023-11-07 [?] RSPM (R 4.4.0)
P lmtest 0.9-40 2022-03-21 [?] RSPM (R 4.4.0)
P magrittr 2.0.3 2022-03-30 [?] RSPM (R 4.4.0)
MASS 7.3-61 2024-06-13 [2] CRAN (R 4.4.2)
Matrix 1.7-1 2024-10-18 [2] CRAN (R 4.4.2)
P maxLik * 1.5-2.1 2024-03-24 [?] RSPM (R 4.4.0)
P miscTools * 0.6-28 2023-05-03 [?] RSPM (R 4.4.0)
P multcomp 1.4-26 2024-07-18 [?] RSPM (R 4.4.0)
P mvtnorm 1.3-1 2024-09-03 [?] RSPM (R 4.4.0)
nlme 3.1-166 2024-08-14 [2] CRAN (R 4.4.2)
P pillar 1.9.0 2023-03-22 [?] RSPM (R 4.4.0)
P pkgconfig 2.0.3 2019-09-22 [?] RSPM (R 4.4.0)
P plm 2.6-4 2024-04-01 [?] RSPM (R 4.4.0)
P purrr 1.0.2 2023-08-10 [?] RSPM (R 4.4.0)
P R6 2.5.1 2021-08-19 [?] RSPM (R 4.4.0)
P rbibutils 2.3 2024-10-04 [?] RSPM (R 4.4.0)
P Rcpp 1.0.13 2024-07-17 [?] RSPM (R 4.4.0)
P Rdpack 2.6.1 2024-08-06 [?] RSPM (R 4.4.0)
renv 1.0.10 2024-10-05 [1] RSPM (R 4.4.0)
P rlang 1.1.4 2024-06-04 [?] RSPM (R 4.4.0)
P rmarkdown 2.28 2024-08-17 [?] RSPM (R 4.4.0)
P sandwich 3.1-1 2024-09-15 [?] RSPM (R 4.4.0)
P sass 0.4.9 2024-03-15 [?] RSPM (R 4.4.0)
P sessioninfo 1.2.2 2021-12-06 [?] RSPM (R 4.4.0)
survival * 3.7-0 2024-06-05 [2] CRAN (R 4.4.2)
P TH.data 1.1-2 2023-04-17 [?] RSPM (R 4.4.0)
P tibble 3.2.1 2023-03-20 [?] RSPM (R 4.4.0)
P tidyr 1.3.1 2024-01-24 [?] RSPM (R 4.4.0)
P tidyselect 1.2.1 2024-03-11 [?] RSPM (R 4.4.0)
P utf8 1.2.4 2023-10-22 [?] RSPM (R 4.4.0)
P vctrs 0.6.5 2023-12-01 [?] RSPM (R 4.4.0)
P VGAM * 1.1-12 2024-09-18 [?] RSPM (R 4.4.0)
P withr 3.0.1 2024-07-31 [?] RSPM (R 4.4.0)
P xfun 0.48 2024-10-03 [?] RSPM (R 4.4.0)
P xml2 1.3.6 2023-12-04 [?] RSPM (R 4.4.0)
P xtable 1.8-4 2019-04-21 [?] RSPM (R 4.4.0)
P yaml 2.3.10 2024-07-26 [?] RSPM (R 4.4.0)
P zoo 1.8-12 2023-04-13 [?] RSPM (R 4.4.0)
[1] /home/runner/work/CAMIS/CAMIS/renv/library/linux-ubuntu-noble/R-4.4/x86_64-pc-linux-gnu
[2] /opt/R/4.4.2/lib/R/library
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────