Quantile Regressions in Stan: Part II

stan
quantile
Author

Sean Pinkney

Published

October 18, 2022

Part I showed the basic Bayesian quantile regression using the asymmetric Laplace distribution and augmented scheme. In this post I’m going to show an approximate likelihood, score method proposed in Wu and Narisetty (2021).

First, why not use the asymmetric Laplace?

The ALD (asymmetric Laplace distribution) is great. It’s fast and gives point results comparable to the frequentist estimator. However, Yang, Wang, and He (2016) showed that the asymptotic variance of the posterior distribution is “incorrect”. It does not correspond to the frequentist interval. A reason is that the ALD and pinball loss function, \(\rho_\tau(u) = \frac{|u| + (2\tau - 1)u}{2}\), is maximized at the standard frequentist quantile but does not correspond to a true data generating likelihood. In the aforementioned paper, they propose a correction to the interval after sampling. I might show this in another post but post-processing is annoying. The score method asymptotically approaches the correct estimates with the correct interval.

Quantile regression with score

The score method includes the regressors into the objective function as \[ s_\tau(\beta) = \sum_{i=1}^{n} x_i \psi_\tau(y_i - x_i^{\top}\beta). \] where \(\psi_\tau(u) = \tau - I(u < 0)\).

Warning

Warning! It seems that \(\psi\) doesn’t work for me! What does work is using \(\rho_\tau = \frac{|u| + (2\tau - 1)u}{2}\) as before. They mention that the loss is the same but then use new notation indicating that it’s different. Anyway, this may be what they intended.

Wu and Narisetty (2021) consider the following “working” likelihood \[ \mathcal{L}(y \mid x,\, \beta) = C \exp\bigg(-\frac{1}{2n} s_\tau(\beta)^{\top} W s_\tau(\beta) \bigg), \] where \(W\) is a \(p \times p\) positive definite weight matrix, and C is a constant free of \(\beta\). The quadratic form and the given objective function implies that the quantile estimator is maximized at the typical quantile estimate.

They then choose a weight matrix which gives the correct posterior covariance as

\[ W = \frac{n}{\tau (1 - \tau)}\bigg(\sum_{i=1}^n x_i x_i^{\top} \bigg). \] In fact, they also show how this can incorporate multiple \(\tau\)’s (I’ll show this in part III) which allows more flexible modeling options, such as a squared exponential kernel that shares information across \(\beta\) quantile estimates which are near each other. The \(W\) matrix for multiple quantiles is given as

\[ W = (Q \otimes G )^{-1}, \text{ where } Q = (\min(\tau_i, \tau_j) - \tau_i \tau_j)_{ij}, \; G = \frac{1}{n}\sum_{i=1}^n x_i x_i^{\top}. \] A nice property of \(W\) is that it depends all on user input data so can be constructed outside of the MCMC iterations.

Score function

The following Stan model implements the single \(\tau\) code. I’ll simulate some data and compare to the asymmetric Laplace:

Show the code
functions{
  vector q_loss(real q, vector u){
    return 0.5 * (abs(u) + (2 * q - 1) * u);
  }
    
  vector score(real q, vector y, matrix x, vector beta) {
    return transpose(x) * q_loss(q, y - x * beta );
  }
    
  matrix make_w (matrix x, real q) {
    int N = rows(x);
    int P = cols(x);
    real alpha = 1 / (q * (1 - q));
    matrix[P, P] x_out = inverse_spd(crossprod(x));

    return alpha * x_out * N;
  }
}
data {
  int N;                 // Number of observation
  int P;                 // Number of predictors
  real q;
  vector[N] y;           // Response variable sorted
  matrix[N, P] x;
}
transformed data {
  matrix[P, P] W = make_w(x, q);
}
parameters {
  vector[P] beta;
}
model {
  vector[P] score_vec = score(q, y, x, beta);

  beta ~ normal(0, 4);

  target += -quad_form(W, score_vec) / (2 * N);
}

Let’s simulate some data and see how well it performs.

Show the code
set.seed(12312)
library(quantreg)
N     <- 1000
x     <- runif(N, max=10)
alpha <- -1
beta  <- 2
y     <- alpha + beta * x + rnorm(N, sd = .5 * x)
q     <- 0.05

# frequentist estimate
out_freq <- quantreg::rq(y ~ x, tau = q)

out_score <- score_qr$sample(
  data = list(N = N,
              P = 2,
              q = q,
              y = y,
              x = as.matrix(data.frame(alpha = 1, x = x))),
  seed = 12123123, 
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 500,
    refresh = 0,
  show_messages = FALSE
)
Running MCMC with 4 parallel chains...

Chain 1 finished in 0.3 seconds.
Chain 2 finished in 0.3 seconds.
Chain 3 finished in 0.3 seconds.
Chain 4 finished in 0.3 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 0.5 seconds.
Show the code
out_asym_laplace <- asym_laplace$sample(
  data = list(N = N,
              P = 2,
              q = q,
              y = y,
              x = as.matrix(data.frame(alpha = 1, x = x))),
  seed = 12123123, 
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 500,
    refresh = 0,
  show_messages = FALSE
)
Running MCMC with 4 parallel chains...

Chain 1 finished in 0.2 seconds.
Chain 2 finished in 0.2 seconds.
Chain 3 finished in 0.2 seconds.
Chain 4 finished in 0.2 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.4 seconds.

Score

Show the code
out_score$summary()
# A tibble: 3 × 10
  variable    mean  median     sd    mad      q5     q95  rhat ess_bulk ess_tail
  <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -832.   -831.   0.965  0.636  -833.   -831.    1.00     657.     804.
2 beta[1]    -1.42   -1.41 0.164  0.165    -1.69   -1.15  1.01     438.     417.
3 beta[2]     1.30    1.30 0.0260 0.0261    1.26    1.34  1.01     444.     420.

Asymmetric Laplace

Show the code
out_asym_laplace$summary(c("beta", "sigma"))
# A tibble: 3 × 10
  variable   mean median      sd     mad     q5    q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 beta[1]  -1.07  -1.06  0.0811  0.0816  -1.21  -0.957  1.01     777.     862.
2 beta[2]   1.24   1.24  0.0181  0.0181   1.21   1.27   1.00     790.     809.
3 sigma     0.248  0.248 0.00825 0.00825  0.235  0.262  1.01     979.    1046.

Quantreg

Show the code
summary(out_freq, se = "boot")

Call: quantreg::rq(formula = y ~ x, tau = q)

tau: [1] 0.05

Coefficients:
            Value    Std. Error t value  Pr(>|t|)
(Intercept) -1.03935  0.11713   -8.87352  0.00000
x            1.23539  0.03639   33.95156  0.00000

Modified Score

We see that the frequentist estimate from quantreg and the asymmetric Laplace outputs are similar. The score method under-estimates the intercept where the credible interval doesn’t even include -1 and the coefficient for \(x\) seems to be over-estimated. Noticing this behavior, let’s see about adding sigma the standard deviation to the estimation.

Let’s re-estimate the above with the modification.

Show the code
functions{
  vector q_loss(real q, vector u){
    return 0.5 * (abs(u) + (2 * q - 1) * u);
  }
    
  vector score(real q, vector y, matrix x, vector beta) {
    return transpose(x) * q_loss(q, y - x * beta);
  }

  matrix make_w (matrix x, real q) {
    int N = rows(x);
    int P = cols(x);
    real alpha = 1 / (q * (1 - q));
    matrix[P, P] x_out = crossprod(x);

   return alpha * inverse_spd(x_out) * N;

  }
}
data {
  int N;                 // Number of observation
  int P;                 // Number of predictors
  real q;
  vector[N] y;           // Response variable sorted
  matrix[N, P] x;
}
transformed data {
  matrix[P, P] W = make_w(x, q);
}
parameters {
  vector[P] beta;
  real<lower=0> sigma;
}
model {
  vector[P] score_vec = score( q, y, x, beta ) / sigma ;

  beta ~ normal(0, 4);

  target += -quad_form(W, score_vec) / (2 * N) - N * log(sigma);
}
Show the code
out_score_modified <- score_qr_modified$sample(
  data = list(N = N,
              P = 2,
              q = q,
              y = y,
              x = as.matrix(data.frame(alpha = 1, x = x))),
  seed = 12123123, 
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 500,
    refresh = 0,
  show_messages = FALSE
)
Running MCMC with 4 parallel chains...

Chain 1 finished in 0.4 seconds.
Chain 2 finished in 0.4 seconds.
Chain 3 finished in 0.4 seconds.
Chain 4 finished in 0.5 seconds.

All 4 chains finished successfully.
Mean chain execution time: 0.4 seconds.
Total execution time: 0.6 seconds.
Show the code
out_score_modified$summary() 
# A tibble: 4 × 10
  variable    mean  median     sd    mad      q5     q95  rhat ess_bulk ess_tail
  <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>   <dbl> <dbl>    <dbl>    <dbl>
1 lp__     -755.   -755.   1.25   1.02   -758.   -754.    1.01     717.     972.
2 beta[1]    -1.42   -1.40 0.222  0.231    -1.80   -1.08  1.00     631.     784.
3 beta[2]     1.30    1.30 0.0346 0.0359    1.25    1.36  1.00     639.     685.
4 sigma       1.29    1.29 0.0303 0.0302    1.24    1.34  1.01     793.     873.

The estimates are basically the same but the standard errors increased. We’ll see that the increased in standard errors makes more reasonable posterior intervals.

Simulated data with N = 10000

Score

Show the code
out_score$summary()
# A tibble: 3 × 10
  variable      mean    median      sd     mad        q5      q95  rhat ess_bulk
  <chr>        <dbl>     <dbl>   <dbl>   <dbl>     <dbl>    <dbl> <dbl>    <dbl>
1 lp__     -9621.    -9620.    1.05    0.815   -9623.    -9.62e+3  1.01     716.
2 beta[1]     -0.915    -0.913 0.0367  0.0353     -0.976 -8.56e-1  1.00     501.
3 beta[2]      1.15      1.15  0.00684 0.00655     1.14   1.16e+0  1.01     477.
# … with 1 more variable: ess_tail <dbl>

Modified Score

Show the code
out_score_modified$summary()
# A tibble: 4 × 10
  variable      mean    median      sd     mad       q5       q95  rhat ess_bulk
  <chr>        <dbl>     <dbl>   <dbl>   <dbl>    <dbl>     <dbl> <dbl>    <dbl>
1 lp__     -8273.    -8273.    1.23    1.02    -8275.   -8272.     1.00     911.
2 beta[1]     -0.922    -0.917 0.0515  0.0499     -1.02    -0.845  1.00     615.
3 beta[2]      1.15      1.15  0.00920 0.00930     1.14     1.17   1.00     695.
4 sigma        1.39      1.39  0.00997 0.0102      1.37     1.40   1.00    1264.
# … with 1 more variable: ess_tail <dbl>

Quantreg

Show the code
print(summary(out_freq, se = "iid"))

Call: quantreg::rq(formula = y ~ x, tau = q)

tau: [1] 0.05

Coefficients:
            Value     Std. Error t value   Pr(>|t|) 
(Intercept)  -0.99120   0.02167  -45.74656   0.00000
x             1.16937   0.00374  312.48513   0.00000

Asymmetric Laplace

Show the code
out_asym_laplace$summary("beta")
# A tibble: 2 × 10
  variable   mean median      sd     mad    q5    q95  rhat ess_bulk ess_tail
  <chr>     <dbl>  <dbl>   <dbl>   <dbl> <dbl>  <dbl> <dbl>    <dbl>    <dbl>
1 beta[1]  -0.997 -0.995 0.0147  0.0138  -1.02 -0.974  1.00    1332.    1112.
2 beta[2]   1.17   1.17  0.00593 0.00581  1.16  1.18   1.00    1273.    1336.

It does appear to be better. Let’s see how this does on the ImmunogG data.

Test on ImmunogG data

We’ll use the same quantile of \(5\%\) to test.

Show the code
library(Brq)
data("ImmunogG")
q <- 0.05
dat <- data.frame(y = ImmunogG$IgG, 
                  alpha = 1, 
                  x = ImmunogG$Age, 
                  xsq = ImmunogG$Age^2)

out_score <- score_qr$sample(
  data = list(N = nrow(ImmunogG),
              P = 3,
              q = q,
              y = ImmunogG$IgG,
              x = as.matrix(dat[, 2:4])),
  seed = 12123123, 
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 500,
    refresh = 0,
  show_messages = FALSE
)

out_score_mod <- score_qr_modified$sample(
  data = list(N = nrow(ImmunogG),
              P = 3,
              q = q,
              y = ImmunogG$IgG,
              x = as.matrix(dat[, 2:4])),
  seed = 12123123, 
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 500,
    refresh = 0,
  show_messages = FALSE
)

out_al <- asym_laplace$sample(
  data = list(N = nrow(ImmunogG),
              P = 3,
              q = q,
              y = ImmunogG$IgG,
              x = as.matrix(dat[, 2:4])),
  seed = 12123123, 
  parallel_chains = 4,
  iter_warmup = 500,
  iter_sampling = 500,
    refresh = 0,
  show_messages = FALSE
)

out_freq <- rq(y ~ x + xsq, 
               data = dat,
               tau = q)

Results

Orginal score:

Show the code
out_score$summary()
# A tibble: 4 × 10
  variable    mean  median     sd    mad      q5      q95  rhat ess_bulk
  <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl> <dbl>    <dbl>
1 lp__     -89.4   -89.1   1.23   1.07   -91.9   -88.0     1.01     483.
2 beta[1]    0.670   0.702 0.301  0.307    0.135   1.12    1.02     283.
3 beta[2]    1.21    1.20  0.244  0.238    0.803   1.60    1.02     256.
4 beta[3]   -0.138  -0.137 0.0395 0.0378  -0.201  -0.0733  1.01     268.
# … with 1 more variable: ess_tail <dbl>

Modified score:

Show the code
out_score_mod$summary()
# A tibble: 5 × 10
  variable    mean  median     sd    mad      q5      q95  rhat ess_bulk
  <chr>      <dbl>   <dbl>  <dbl>  <dbl>   <dbl>    <dbl> <dbl>    <dbl>
1 lp__     -72.6   -72.2   1.55   1.38   -75.6   -70.7     1.01     603.
2 beta[1]    0.686   0.709 0.249  0.254    0.254   1.06    1.02     334.
3 beta[2]    1.21    1.20  0.198  0.192    0.878   1.52    1.02     274.
4 beta[3]   -0.137  -0.137 0.0312 0.0297  -0.188  -0.0852  1.02     289.
5 sigma      0.775   0.775 0.0324 0.0324   0.724   0.830   1.00     760.
# … with 1 more variable: ess_tail <dbl>

Quantreg:

Show the code
summary(out_freq, se = "iid")

Call: rq(formula = y ~ x + xsq, tau = q, data = dat)

tau: [1] 0.05

Coefficients:
            Value    Std. Error t value  Pr(>|t|)
(Intercept)  0.76514  0.30477    2.51058  0.01259
x            1.19143  0.24792    4.80577  0.00000
xsq         -0.13371  0.04037   -3.31187  0.00104

Asymmetric Laplace:

Show the code
out_al$summary()
# A tibble: 5 × 10
  variable     mean   median      sd     mad       q5      q95  rhat ess_bulk
  <chr>       <dbl>    <dbl>   <dbl>   <dbl>    <dbl>    <dbl> <dbl>    <dbl>
1 lp__     -670.    -670.    1.50    1.24    -673.    -669.     1.01     585.
2 beta[1]     0.587    0.601 0.246   0.271      0.179    0.962  1.02     385.
3 beta[2]     1.31     1.31  0.199   0.210      1.01     1.64   1.02     336.
4 beta[3]    -0.156   -0.155 0.0338  0.0333    -0.212   -0.104  1.01     373.
5 sigma       0.165    0.165 0.00958 0.00897    0.149    0.181  1.01     645.
# … with 1 more variable: ess_tail <dbl>

Interestingly, the results of the score method lie between the frequentist and asymmetric Laplace methods.

Conclusion

The score method is promising but it doesn’t seem like one would use the asymmetric Laplace due to the ad-hoc nature of the intercept. One thing the score method has is incorporating multipled quantiles in a principled way that allows you to put priors on the estimates that correspond to linking them. It also doesn’t need to re-run the data through each quantile. In the next post, part III, I’ll show this. If you have a bunch of different quantiles, lots of data, and want to understand the correlations across parameters then the score method may be right for you.

References

Wu, Teng, and Naveen N. Narisetty. 2021. Bayesian Multiple Quantile Regression for Linear Models Using a Score Likelihood.” Bayesian Analysis 16 (3): 875–903. https://doi.org/10.1214/20-BA1217.
Yang, Yunwen, Huixia Judy Wang, and Xuming He. 2016. “Posterior Inference in Bayesian Quantile Regression with Asymmetric Laplace Likelihood.” International Statistical Review 84 (3): 327–44. https://doi.org/https://doi.org/10.1111/insr.12114.