Tobit regression

Libraries

General

library(dplyr)
library(gt)
library(broom)

Methodology specific

library(emmeans)
library(censReg)
library(survival)
library(VGAM)

Tobit model

Censoring occurs when data on the dependent variable is only partially known. For example, in virology, sample results could be below the lower limit of detection (eg, 100 copies/mL) and in such a case we only know that the sample result is <100 copies/mL, but we don’t know the exact value.

Let \(y^{*}\) be the the true underlying latent variable, and \(y\) the observed variable. We discuss here censoring on the left:

\[ y = \begin{cases} y^{*}, & y^{*} > \tau \\ \tau, & y^{*} \leq \tau \end{cases} \] We consider tobit regression with a censored normal distribution. The model equation is \[ y_{i}^{*} = X_{i}\beta + \epsilon_{i} \] with \(\epsilon_{i} \sim N(0,\sigma^2)\). But we only observe \(y = max(\tau, y^{*})\). The tobit model uses maximum likelihood estimation (for details see for example Breen, 1996). It is important to note that \(\beta\) estimates the effect of \(x\) on the latent variable \(y^{*}\), and not on the observed value \(y\).

Data used

We assume two equally sized groups (n=10 in each group). The data is censored on the left at a value of \(\tau=8.0\). In group A 4/10 records are censored, and 1/10 in group B.

dat_used = tribble(
  ~ID, ~ARM, ~Y, ~CENS,
  "001", "A", 8.0, 1, 
  "002", "A", 8.0, 1,
  "003", "A", 8.0, 1,
  "004", "A", 8.0, 1,
  "005", "A", 8.9, 0,
  "006", "A", 9.5, 0,
  "007", "A", 9.9, 0,
  "008", "A", 10.3, 0,
  "009", "A", 11.0, 0,
  "010", "A", 11.2, 0,
  "011", "B", 8.0, 1, 
  "012", "B", 9.2, 0,
  "013", "B", 9.9, 0,
  "014", "B", 10.0, 0,
  "015", "B", 10.6, 0,
  "016", "B", 10.6, 0,
  "017", "B", 11.3, 0,
  "018", "B", 11.8, 0,
  "019", "B", 12.9, 0,
  "020", "B", 13.0, 0,
)

gt(dat_used)

ID	ARM	Y	CENS
001	A	8.0	1
002	A	8.0	1
003	A	8.0	1
004	A	8.0	1
005	A	8.9	0
006	A	9.5	0
007	A	9.9	0
008	A	10.3	0
009	A	11.0	0
010	A	11.2	0
011	B	8.0	1
012	B	9.2	0
013	B	9.9	0
014	B	10.0	0
015	B	10.6	0
016	B	10.6	0
017	B	11.3	0
018	B	11.8	0
019	B	12.9	0
020	B	13.0	0

Example Code using R

The analysis will be based on a Tobit analysis of variance with \(Y\), rounded to 1 decimal places, as dependent variable and study group as a fixed covariate. A normally distributed error term will be used. Values will be left censored at the value 8.0.

Several R functions and packages are presented.

censReg

The censReg function from the censReg package performs tobit models for left and right censored. The model is estimated by Maximum Likelihood (ML) assuming a Gaussian (normal) distribution of the error term. The maximization of the likelihood function is done by function maxLik of the maxLik package. The optimization method can be changed.

res_censreg = censReg(Y ~ ARM, 
                      left = 8.0, 
                      data = dat_used)
summary(res_censreg)


Call:
censReg(formula = Y ~ ARM, left = 8, data = dat_used)

Observations:
         Total  Left-censored     Uncensored Right-censored 
            20              5             15              0 

Coefficients:
            Estimate Std. error t value Pr(> t)    
(Intercept)   8.8323     0.5918  14.925 < 2e-16 ***
ARMB          1.8225     0.8061   2.261 0.02376 *  
logSigma      0.5491     0.1947   2.819 0.00481 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Newton-Raphson maximisation, 4 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -34.3154 on 3 Df

# Difference between groups (Wald CIs)
round(res_censreg$estimate[2], 3)

 ARMB 
1.823

round(stats::confint(res_censreg, level = 0.95)[2,], 3)

 2.5 % 97.5 % 
 0.243  3.402

The output provides an estimate of difference between groups A and B (B-A), namely 1.8225 (se=0.8061). The presented p-value is a two-sided p-value based on the Z-test. The output also provides an estimate for \(log(\sigma) = 0.5491\). Wald based confidence intervals can be obtained by the stats::confint function.

survreg

Using the survreg function from the survival package a tobit model can be fit. For more information, refer to the survival package.

res_survreg = survreg(Surv(Y, 1-CENS, type="left") ~ ARM,
                      dist = "gaussian",
                      data = dat_used)
summary(res_survreg)


Call:
survreg(formula = Surv(Y, 1 - CENS, type = "left") ~ ARM, data = dat_used, 
    dist = "gaussian")
            Value Std. Error     z      p
(Intercept) 8.832      0.592 14.92 <2e-16
ARMB        1.823      0.806  2.26 0.0238
Log(scale)  0.549      0.195  2.82 0.0048

Scale= 1.73 

Gaussian distribution
Loglik(model)= -34.3   Loglik(intercept only)= -36.7
    Chisq= 4.72 on 1 degrees of freedom, p= 0.03 
Number of Newton-Raphson Iterations: 4 
n= 20

# Least square means by group
lsm = emmeans(res_survreg, specs = trt.vs.ctrl ~ ARM)
lsm$emmeans

 ARM emmean    SE df lower.CL upper.CL
 A     8.83 0.592 17     7.58     10.1
 B    10.65 0.552 17     9.49     11.8

Confidence level used: 0.95

# Difference between groups
lsm_contrast = broom::tidy(lsm$contrasts, conf.int=TRUE, conf.level=0.95)
gt(lsm_contrast) %>%
  fmt_number(decimals = 3)

term	contrast	null.value	estimate	std.error	df	conf.low	conf.high	statistic	p.value
ARM	B - A	0.000	1.823	0.806	17.000	0.122	3.523	2.261	0.037

# Wald-based CIs
round(stats::confint(res_survreg, level = 0.95)[2,], 3)

 2.5 % 97.5 % 
 0.243  3.402

The output provides an estimate of difference between groups A and B (B-A), namely 1.823 (se=0.806). The presented p-value is a two-sided p-value based on the Z-test. The output also provides an estimate for \(log(\sigma) = 0.549\). Using the emmeans package/function least square means and contrast can be easily obtained. The confidence intervals and p-value is based on the t-test using emmeans. Wald based confidence intervals can be obtained by the stats::confint function.

vglm

The VGAM package provides functions for fitting vector generalized linear and additive models (VGLMs and VGAMs). This package centers on the iteratively reweighted least squares (IRLS) algorithm. The vglm function offers the possibility to fit a tobit model.

res_vglm = vglm(Y ~ ARM,
                tobit(Lower = 8.0),
                data = dat_used)
summary(res_vglm)

Call:
vglm(formula = Y ~ ARM, family = tobit(Lower = 8), data = dat_used)

Coefficients: 
              Estimate Std. Error z value Pr(>|z|)    
(Intercept):1   8.8323     0.5727  15.422  < 2e-16 ***
(Intercept):2   0.5491     0.1807   3.039  0.00238 ** 
ARMB            1.8226     0.7942   2.295  0.02173 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Names of linear predictors: mu, loglink(sd)

Log-likelihood: -34.3154 on 37 degrees of freedom

Number of Fisher scoring iterations: 7 

No Hauck-Donner effect found in any of the estimates

# Difference between groups
round(res_vglm@coefficients[3], 3)

 ARMB 
1.823

round(confintvglm(res_vglm, level = 0.95)[3,], 3)

 2.5 % 97.5 % 
 0.266  3.379

The output provides an estimate of difference between groups A and B (B-A), namely 1.823 (se=0.794). The presented p-value is a two-sided p-value based on the Z-test. Note that point estimate for the difference (and associated SE) are slightly different from the results obtained by censReg and tobit due to the difference in estimation procedure used. Wald based confidence intervals can be obtained by the confintvglm function. The \((Intercept):2\) in the model output is an estimate for \(log(\sigma)\).

Discussion

The results from the censReg::censReg and survival::survreg are similar. The survival::survreg allows for easy incorporation with the emmeans package (note: be aware that the standard approach with emmeans is based on the t-test and not the Z-test).

The VGAM::vglm approach provides slightly different results. This difference comes from the fact that a iteratively reweighted least squares (IRLS) algorithm is used for estimation.

Reference

Breen, R. (1996). Regression models. SAGE Publications, Inc., https://doi.org/10.4135/9781412985611

Tobin, James (1958). “Estimation of Relationships for Limited Dependent Variables”. Econometrica. 26 (1): 24-36. doi:10.2307/1907382

Session Info

─ Session info ───────────────────────────────────────────────────────────────
 setting  value
 version  R version 4.4.2 (2024-10-31)
 os       macOS Sequoia 15.3.2
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       Europe/London
 date     2025-03-19
 pandoc   3.2 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/aarch64/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────
 ! package      * version  date (UTC) lib source
 P backports      1.5.0    2024-05-23 [?] CRAN (R 4.4.0)
 P bdsmatrix      1.3-7    2024-03-02 [?] CRAN (R 4.4.0)
 P BiocManager    1.30.25  2024-08-28 [?] CRAN (R 4.4.1)
 P broom        * 1.0.7    2024-09-26 [?] CRAN (R 4.4.1)
 P censReg      * 0.5-38   2024-05-20 [?] CRAN (R 4.4.0)
 P cli            3.6.3    2024-06-21 [?] CRAN (R 4.4.0)
 P coda           0.19-4.1 2024-01-31 [?] CRAN (R 4.4.0)
   codetools      0.2-20   2024-03-31 [2] CRAN (R 4.4.2)
 P collapse       2.0.18   2024-11-23 [?] CRAN (R 4.4.1)
 P digest         0.6.37   2024-08-19 [?] CRAN (R 4.4.1)
 P dplyr        * 1.1.4    2023-11-17 [?] CRAN (R 4.4.0)
 P emmeans      * 1.10.4   2024-08-21 [?] CRAN (R 4.4.1)
 P estimability   1.5.1    2024-05-12 [?] CRAN (R 4.4.0)
 P evaluate       1.0.0    2024-09-17 [?] CRAN (R 4.4.1)
 P fansi          1.0.6    2023-12-08 [?] CRAN (R 4.4.0)
 P fastmap        1.2.0    2024-05-15 [?] CRAN (R 4.4.0)
 P Formula        1.2-5    2023-02-24 [?] CRAN (R 4.4.0)
 P generics       0.1.3    2022-07-05 [?] CRAN (R 4.4.0)
 P glmmML         1.1.7    2024-09-20 [?] CRAN (R 4.4.1)
 P glue           1.8.0    2024-09-30 [?] CRAN (R 4.4.1)
 P gt           * 0.11.1   2024-10-04 [?] CRAN (R 4.4.1)
 P htmltools      0.5.8.1  2024-04-04 [?] CRAN (R 4.4.0)
 P htmlwidgets    1.6.4    2023-12-06 [?] CRAN (R 4.4.0)
 P jsonlite       1.8.9    2024-09-20 [?] CRAN (R 4.4.1)
 P knitr          1.48     2024-07-07 [?] CRAN (R 4.4.0)
   lattice        0.22-6   2024-03-20 [2] CRAN (R 4.4.2)
 P lifecycle      1.0.4    2023-11-07 [?] CRAN (R 4.4.0)
 P lmtest         0.9-40   2022-03-21 [?] CRAN (R 4.4.0)
 P magrittr       2.0.3    2022-03-30 [?] CRAN (R 4.4.0)
   MASS           7.3-61   2024-06-13 [2] CRAN (R 4.4.2)
   Matrix         1.7-1    2024-10-18 [2] CRAN (R 4.4.2)
 P maxLik       * 1.5-2.1  2024-03-24 [?] CRAN (R 4.4.0)
 P miscTools    * 0.6-28   2023-05-03 [?] CRAN (R 4.4.0)
 P multcomp       1.4-26   2024-07-18 [?] CRAN (R 4.4.0)
 P mvtnorm        1.3-1    2024-09-03 [?] CRAN (R 4.4.1)
   nlme           3.1-166  2024-08-14 [2] CRAN (R 4.4.2)
 P pillar         1.9.0    2023-03-22 [?] CRAN (R 4.4.0)
 P pkgconfig      2.0.3    2019-09-22 [?] CRAN (R 4.4.0)
 P plm            2.6-4    2024-04-01 [?] CRAN (R 4.4.0)
 P purrr          1.0.2    2023-08-10 [?] CRAN (R 4.4.0)
 P R6             2.5.1    2021-08-19 [?] CRAN (R 4.4.0)
 P rbibutils      2.3      2024-10-04 [?] CRAN (R 4.4.1)
 P Rcpp           1.0.13   2024-07-17 [?] CRAN (R 4.4.0)
 P Rdpack         2.6.1    2024-08-06 [?] CRAN (R 4.4.0)
   renv           1.0.10   2024-10-05 [1] CRAN (R 4.4.1)
 P rlang          1.1.4    2024-06-04 [?] CRAN (R 4.4.0)
 P rmarkdown      2.28     2024-08-17 [?] CRAN (R 4.4.0)
 P rstudioapi     0.16.0   2024-03-24 [?] CRAN (R 4.4.0)
 P sandwich       3.1-1    2024-09-15 [?] CRAN (R 4.4.1)
 P sass           0.4.9    2024-03-15 [?] CRAN (R 4.4.0)
 P sessioninfo    1.2.2    2021-12-06 [?] CRAN (R 4.4.0)
 P survival     * 3.7-0    2024-06-05 [?] CRAN (R 4.4.0)
 P TH.data        1.1-2    2023-04-17 [?] CRAN (R 4.4.0)
 P tibble         3.2.1    2023-03-20 [?] CRAN (R 4.4.0)
 P tidyr          1.3.1    2024-01-24 [?] CRAN (R 4.4.0)
 P tidyselect     1.2.1    2024-03-11 [?] CRAN (R 4.4.0)
 P utf8           1.2.4    2023-10-22 [?] CRAN (R 4.4.0)
 P vctrs          0.6.5    2023-12-01 [?] CRAN (R 4.4.0)
 P VGAM         * 1.1-12   2024-09-18 [?] CRAN (R 4.4.1)
 P withr          3.0.1    2024-07-31 [?] CRAN (R 4.4.0)
 P xfun           0.48     2024-10-03 [?] CRAN (R 4.4.1)
 P xml2           1.3.6    2023-12-04 [?] CRAN (R 4.4.0)
 P xtable         1.8-4    2019-04-21 [?] CRAN (R 4.4.0)
 P yaml           2.3.10   2024-07-26 [?] CRAN (R 4.4.0)
 P zoo            1.8-12   2023-04-13 [?] CRAN (R 4.4.0)

 [1] /Users/christinafillmore/Documents/GitHub/CAMIS/renv/library/macos/R-4.4/aarch64-apple-darwin20
 [2] /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library

 P ── Loaded and on-disk path mismatch.

──────────────────────────────────────────────────────────────────────────────