Part I showed the basic Bayesian quantile regression using the asymmetric Laplace distribution and augmented scheme. In this post I’m going to show an approximate likelihood, score method proposed in Wu and Narisetty (2021).
First, why not use the asymmetric Laplace?
The ALD (asymmetric Laplace distribution) is great. It’s fast and gives point results comparable to the frequentist estimator. However, Yang, Wang, and He (2016) showed that the asymptotic variance of the posterior distribution is “incorrect”. It does not correspond to the frequentist interval. A reason is that the ALD and pinball loss function, \(\rho_\tau(u) = \frac{|u| + (2\tau - 1)u}{2}\), is maximized at the standard frequentist quantile but does not correspond to a true data generating likelihood. In the aforementioned paper, they propose a correction to the interval after sampling. I might show this in another post but post-processing is annoying. The score method asymptotically approaches the correct estimates with the correct interval.
Quantile regression with score
The score method includes the regressors into the objective function as \[ s_\tau(\beta) = \sum_{i=1}^{n} x_i \psi_\tau(y_i - x_i^{\top}\beta). \] where \(\psi_\tau(u) = \tau - I(u < 0)\).
Wu and Narisetty (2021) consider the following “working” likelihood \[ \mathcal{L}(y \mid x,\, \beta) = C \exp\bigg(-\frac{1}{2n} s_\tau(\beta)^{\top} W s_\tau(\beta) \bigg), \] where \(W\) is a \(p \times p\) positive definite weight matrix, and C is a constant free of \(\beta\). The quadratic form and the given objective function implies that the quantile estimator is maximized at the typical quantile estimate.
They then choose a weight matrix which gives the correct posterior covariance as
\[ W = \frac{n}{\tau (1 - \tau)}\bigg(\sum_{i=1}^n x_i x_i^{\top} \bigg). \] In fact, they also show how this can incorporate multiple \(\tau\)’s (I’ll show this in part III) which allows more flexible modeling options, such as a squared exponential kernel that shares information across \(\beta\) quantile estimates which are near each other. The \(W\) matrix for multiple quantiles is given as
\[ W = (Q \otimes G )^{-1}, \text{ where } Q = (\min(\tau_i, \tau_j) - \tau_i \tau_j)_{ij}, \; G = \frac{1}{n}\sum_{i=1}^n x_i x_i^{\top}. \] A nice property of \(W\) is that it depends all on user input data so can be constructed outside of the MCMC iterations.
Score function
The following Stan model implements the single \(\tau\) code. I’ll simulate some data and compare to the asymmetric Laplace:
Show the code
functions{
vector q_loss(real q, vector u){
return 0.5 * (abs(u) + (2 * q - 1) * u);
}
vector score(real q, vector y, matrix x, vector beta) {
return transpose(x) * q_loss(q, y - x * beta );
}
matrix make_w (matrix x, real q) {
int N = rows(x);
int P = cols(x);
real alpha = 1 / (q * (1 - q));
matrix[P, P] x_out = inverse_spd(crossprod(x));
return alpha * x_out * N;
}
}data {
int N; // Number of observation
int P; // Number of predictors
real q;
vector[N] y; // Response variable sorted
matrix[N, P] x;
}transformed data {
matrix[P, P] W = make_w(x, q);
}parameters {
vector[P] beta;
}model {
vector[P] score_vec = score(q, y, x, beta);
0, 4);
beta ~ normal(
target += -quad_form(W, score_vec) / (2 * N);
}
Let’s simulate some data and see how well it performs.
Show the code
set.seed(12312)
library(quantreg)
<- 1000
N <- runif(N, max=10)
x <- -1
alpha <- 2
beta <- alpha + beta * x + rnorm(N, sd = .5 * x)
y <- 0.05
q
# frequentist estimate
<- quantreg::rq(y ~ x, tau = q)
out_freq
<- score_qr$sample(
out_score data = list(N = N,
P = 2,
q = q,
y = y,
x = as.matrix(data.frame(alpha = 1, x = x))),
seed = 12123123,
parallel_chains = 4,
iter_warmup = 500,
iter_sampling = 500,
refresh = 0,
show_messages = FALSE
)
Running MCMC with 4 parallel chains...
Chain 1 finished in 0.3 seconds.
Chain 2 finished in 0.3 seconds.
Chain 3 finished in 0.3 seconds.
Chain 4 finished in 0.3 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.3 seconds.
Total execution time: 0.5 seconds.
Show the code
<- asym_laplace$sample(
out_asym_laplace data = list(N = N,
P = 2,
q = q,
y = y,
x = as.matrix(data.frame(alpha = 1, x = x))),
seed = 12123123,
parallel_chains = 4,
iter_warmup = 500,
iter_sampling = 500,
refresh = 0,
show_messages = FALSE
)
Running MCMC with 4 parallel chains...
Chain 1 finished in 0.2 seconds.
Chain 2 finished in 0.2 seconds.
Chain 3 finished in 0.2 seconds.
Chain 4 finished in 0.2 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.2 seconds.
Total execution time: 0.4 seconds.
Score
Show the code
$summary() out_score
# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -832. -831. 0.965 0.636 -833. -831. 1.00 657. 804.
2 beta[1] -1.42 -1.41 0.164 0.165 -1.69 -1.15 1.01 438. 417.
3 beta[2] 1.30 1.30 0.0260 0.0261 1.26 1.34 1.01 444. 420.
Asymmetric Laplace
Show the code
$summary(c("beta", "sigma")) out_asym_laplace
# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta[1] -1.07 -1.06 0.0811 0.0816 -1.21 -0.957 1.01 777. 862.
2 beta[2] 1.24 1.24 0.0181 0.0181 1.21 1.27 1.00 790. 809.
3 sigma 0.248 0.248 0.00825 0.00825 0.235 0.262 1.01 979. 1046.
Quantreg
Show the code
summary(out_freq, se = "boot")
Call: quantreg::rq(formula = y ~ x, tau = q)
tau: [1] 0.05
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -1.03935 0.11713 -8.87352 0.00000
x 1.23539 0.03639 33.95156 0.00000
Modified Score
We see that the frequentist estimate from quantreg
and the asymmetric Laplace outputs are similar. The score method under-estimates the intercept where the credible interval doesn’t even include -1 and the coefficient for \(x\) seems to be over-estimated. Noticing this behavior, let’s see about adding sigma
the standard deviation to the estimation.
Let’s re-estimate the above with the modification.
Show the code
functions{
vector q_loss(real q, vector u){
return 0.5 * (abs(u) + (2 * q - 1) * u);
}
vector score(real q, vector y, matrix x, vector beta) {
return transpose(x) * q_loss(q, y - x * beta);
}
matrix make_w (matrix x, real q) {
int N = rows(x);
int P = cols(x);
real alpha = 1 / (q * (1 - q));
matrix[P, P] x_out = crossprod(x);
return alpha * inverse_spd(x_out) * N;
}
}data {
int N; // Number of observation
int P; // Number of predictors
real q;
vector[N] y; // Response variable sorted
matrix[N, P] x;
}transformed data {
matrix[P, P] W = make_w(x, q);
}parameters {
vector[P] beta;
real<lower=0> sigma;
}model {
vector[P] score_vec = score( q, y, x, beta ) / sigma ;
0, 4);
beta ~ normal(
target += -quad_form(W, score_vec) / (2 * N) - N * log(sigma);
}
Show the code
<- score_qr_modified$sample(
out_score_modified data = list(N = N,
P = 2,
q = q,
y = y,
x = as.matrix(data.frame(alpha = 1, x = x))),
seed = 12123123,
parallel_chains = 4,
iter_warmup = 500,
iter_sampling = 500,
refresh = 0,
show_messages = FALSE
)
Running MCMC with 4 parallel chains...
Chain 1 finished in 0.4 seconds.
Chain 2 finished in 0.4 seconds.
Chain 3 finished in 0.4 seconds.
Chain 4 finished in 0.5 seconds.
All 4 chains finished successfully.
Mean chain execution time: 0.4 seconds.
Total execution time: 0.6 seconds.
Show the code
$summary() out_score_modified
# A tibble: 4 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -755. -755. 1.25 1.02 -758. -754. 1.01 717. 972.
2 beta[1] -1.42 -1.40 0.222 0.231 -1.80 -1.08 1.00 631. 784.
3 beta[2] 1.30 1.30 0.0346 0.0359 1.25 1.36 1.00 639. 685.
4 sigma 1.29 1.29 0.0303 0.0302 1.24 1.34 1.01 793. 873.
The estimates are basically the same but the standard errors increased. We’ll see that the increased in standard errors makes more reasonable posterior intervals.
Simulated data with N = 10000
Score
Show the code
$summary() out_score
# A tibble: 3 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -9621. -9620. 1.05 0.815 -9623. -9.62e+3 1.01 716.
2 beta[1] -0.915 -0.913 0.0367 0.0353 -0.976 -8.56e-1 1.00 501.
3 beta[2] 1.15 1.15 0.00684 0.00655 1.14 1.16e+0 1.01 477.
# … with 1 more variable: ess_tail <dbl>
Modified Score
Show the code
$summary() out_score_modified
# A tibble: 4 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -8273. -8273. 1.23 1.02 -8275. -8272. 1.00 911.
2 beta[1] -0.922 -0.917 0.0515 0.0499 -1.02 -0.845 1.00 615.
3 beta[2] 1.15 1.15 0.00920 0.00930 1.14 1.17 1.00 695.
4 sigma 1.39 1.39 0.00997 0.0102 1.37 1.40 1.00 1264.
# … with 1 more variable: ess_tail <dbl>
Quantreg
Show the code
print(summary(out_freq, se = "iid"))
Call: quantreg::rq(formula = y ~ x, tau = q)
tau: [1] 0.05
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) -0.99120 0.02167 -45.74656 0.00000
x 1.16937 0.00374 312.48513 0.00000
Asymmetric Laplace
Show the code
$summary("beta") out_asym_laplace
# A tibble: 2 × 10
variable mean median sd mad q5 q95 rhat ess_bulk ess_tail
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 beta[1] -0.997 -0.995 0.0147 0.0138 -1.02 -0.974 1.00 1332. 1112.
2 beta[2] 1.17 1.17 0.00593 0.00581 1.16 1.18 1.00 1273. 1336.
It does appear to be better. Let’s see how this does on the ImmunogG data.
Test on ImmunogG data
We’ll use the same quantile of \(5\%\) to test.
Show the code
library(Brq)
data("ImmunogG")
<- 0.05
q <- data.frame(y = ImmunogG$IgG,
dat alpha = 1,
x = ImmunogG$Age,
xsq = ImmunogG$Age^2)
<- score_qr$sample(
out_score data = list(N = nrow(ImmunogG),
P = 3,
q = q,
y = ImmunogG$IgG,
x = as.matrix(dat[, 2:4])),
seed = 12123123,
parallel_chains = 4,
iter_warmup = 500,
iter_sampling = 500,
refresh = 0,
show_messages = FALSE
)
<- score_qr_modified$sample(
out_score_mod data = list(N = nrow(ImmunogG),
P = 3,
q = q,
y = ImmunogG$IgG,
x = as.matrix(dat[, 2:4])),
seed = 12123123,
parallel_chains = 4,
iter_warmup = 500,
iter_sampling = 500,
refresh = 0,
show_messages = FALSE
)
<- asym_laplace$sample(
out_al data = list(N = nrow(ImmunogG),
P = 3,
q = q,
y = ImmunogG$IgG,
x = as.matrix(dat[, 2:4])),
seed = 12123123,
parallel_chains = 4,
iter_warmup = 500,
iter_sampling = 500,
refresh = 0,
show_messages = FALSE
)
<- rq(y ~ x + xsq,
out_freq data = dat,
tau = q)
Results
Orginal score:
Show the code
$summary() out_score
# A tibble: 4 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -89.4 -89.1 1.23 1.07 -91.9 -88.0 1.01 483.
2 beta[1] 0.670 0.702 0.301 0.307 0.135 1.12 1.02 283.
3 beta[2] 1.21 1.20 0.244 0.238 0.803 1.60 1.02 256.
4 beta[3] -0.138 -0.137 0.0395 0.0378 -0.201 -0.0733 1.01 268.
# … with 1 more variable: ess_tail <dbl>
Modified score:
Show the code
$summary() out_score_mod
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -72.6 -72.2 1.55 1.38 -75.6 -70.7 1.01 603.
2 beta[1] 0.686 0.709 0.249 0.254 0.254 1.06 1.02 334.
3 beta[2] 1.21 1.20 0.198 0.192 0.878 1.52 1.02 274.
4 beta[3] -0.137 -0.137 0.0312 0.0297 -0.188 -0.0852 1.02 289.
5 sigma 0.775 0.775 0.0324 0.0324 0.724 0.830 1.00 760.
# … with 1 more variable: ess_tail <dbl>
Quantreg:
Show the code
summary(out_freq, se = "iid")
Call: rq(formula = y ~ x + xsq, tau = q, data = dat)
tau: [1] 0.05
Coefficients:
Value Std. Error t value Pr(>|t|)
(Intercept) 0.76514 0.30477 2.51058 0.01259
x 1.19143 0.24792 4.80577 0.00000
xsq -0.13371 0.04037 -3.31187 0.00104
Asymmetric Laplace:
Show the code
$summary() out_al
# A tibble: 5 × 10
variable mean median sd mad q5 q95 rhat ess_bulk
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 lp__ -670. -670. 1.50 1.24 -673. -669. 1.01 585.
2 beta[1] 0.587 0.601 0.246 0.271 0.179 0.962 1.02 385.
3 beta[2] 1.31 1.31 0.199 0.210 1.01 1.64 1.02 336.
4 beta[3] -0.156 -0.155 0.0338 0.0333 -0.212 -0.104 1.01 373.
5 sigma 0.165 0.165 0.00958 0.00897 0.149 0.181 1.01 645.
# … with 1 more variable: ess_tail <dbl>
Interestingly, the results of the score method lie between the frequentist and asymmetric Laplace methods.
Conclusion
The score method is promising but it doesn’t seem like one would use the asymmetric Laplace due to the ad-hoc nature of the intercept. One thing the score method has is incorporating multipled quantiles in a principled way that allows you to put priors on the estimates that correspond to linking them. It also doesn’t need to re-run the data through each quantile. In the next post, part III, I’ll show this. If you have a bunch of different quantiles, lots of data, and want to understand the correlations across parameters then the score method may be right for you.