The learning objectives of this document are:
This is a key concept in scientific data analysis. We can briefly put the regression problem in general as looking for a function that “best explains” a collection of response values as a function of some other feature values. This is also commonly referred to as “interpolation”.
Suppose you have carried out an experiment leading to some table filled out with numerical values. As usual, each row represents a sample or replicate in our experiment; and each column represents the distinct measurements (features) conducted on each sample.
Let \(Y\) be a distinguished feature (response variable) that we want to explain as a function of another set of features: \(X_1, \ldots, X_m\).
In statistics, carrying out regression means to find the random process that generated the data in \(Y\) using as input the set of features. random means that is subject to some level of randomness. Consequently, unlike the case of functions, we do not expect to get always same outputs from identical inputs.
In general, this problem is very hard. In practice, to make life simple, we will search in a relatively small family of such processes (distributions).
We need to clarify a few things before daring to work on any regression problem at all. At least, we need to specify:
Also we will make a technical assumption, that implies that all samples in our dataset have been sampled independently from the same random process.
It is common that the distributions are taken from a family governed by a few parameters:
Then finding the right solution is equivalent to finding the right parameters.
For each distribution of choice we can compute something called likelihood of the data given the distribution or likelihood for short. Given the response data \(Y = \{y_1,...,y_n\}\), assuming the \(y_i\) have been produced by the random process encoded by the PMF or PDF \(f_{\theta}\) (where \(\theta\) represents a parameter that govern the distribution) the likelihood of the data is given by:
\[\mathcal{L}(\theta) = f_{\theta}(y_1)f_{\theta}(y_2)\ldots f_{\theta}(y_n) = \prod_{i=1}^n f_\theta(y_i)\] We will be looking for the \(\hat{\theta}\) that gives the maximum \(\mathcal{L}(Y,\theta)\). In a more compact notation: \[\hat{\theta} = \textrm{argmax}_{\theta} \; \mathcal{L}(\theta)\] ### Likelihood: with covariates
But what if we want to make the draw dependent on a set of covariates, the covariates we have been talking about before? In the context of generalized linear models, the trick is that they are hidden inside the parameters of the distribution.
Suppose our dataset has response vector \(Y=(y_1, \ldots, y_n)\) and a covariate vector \(X=(x_1, \ldots, x_n)\). If we assume that \(Y\) has been generated by distributions of a family governed by a parameter \(\theta\), we will see \(\theta\) as being dependent on \(X\) in a linear fashion \(\theta = \alpha X + \beta\).
This has an important implication: the specific \(\theta\) used to draw each \(y_i\) (say \(\theta_i\)) depends on the covariate of this sample in a linear fashion: \(\theta_i = \alpha x_i + \beta\). Now we want to find \(\hat\alpha\) and \(\hat\beta\) such that:
\[ \mathcal{L} (\alpha, \beta) = f_{\theta_1}(y_1) f_{\theta_2}(y_2) \ldots f_{\theta_n}(y_n) = \prod_{i=1}^n f_{\theta_i}(y_i)\] \[(\hat\alpha, \hat\beta) = \textrm{argmax}_{(\alpha, \beta)} \; \mathcal{L}(\alpha, \beta) \]
To illustrate linear regression from different angles, we will resort once more to the gapminder dataset.
This time we want to understand the statistical association between the GDP per capita and life expectancy across different countries for the year 2007.
Let’s retrieve and preprocess our data:
library(tidyverse)
gapminder <- as.data.frame(gapminder::gapminder)
gapminder_2002 <- gapminder %>% filter(year == 2007)
gapminder_sub <- gapminder_2002[, c("country", "lifeExp", "gdpPercap")]
gapminder_sub$log10_gdpPercap <- log10(gapminder_sub$gdpPercap)
head(gapminder_sub)
## country lifeExp gdpPercap log10_gdpPercap
## 1 Afghanistan 43.828 974.5803 2.988818
## 2 Albania 76.423 5937.0295 3.773569
## 3 Algeria 72.301 6223.3675 3.794025
## 4 Angola 42.731 4797.2313 3.680991
## 5 Argentina 75.320 12779.3796 4.106510
## 6 Australia 81.235 34435.3674 4.537005
Recall that our goal is to compute a vector \(\hat\beta=(\hat\beta_1, \hat\beta_2)\) such that if \(A\in\mathbb{R}^{n\times m}\) is the matrix of covariates across samples and \(b\in\mathbb{R}^n\) is the vector of independent terms (response variable), \(A\beta\) is a reasonable representation of what has been observed as \(b\).
We define the matrix of coefficients \(A\) and the independent term \(b\).
num_rows <- nrow(gapminder_sub)
A <- matrix(
c(gapminder_sub$log10_gdpPercap, rep(1, num_rows)),
nrow = num_rows,
ncol = 2,
byrow = FALSE
)
b <- as.matrix(gapminder_sub$lifeExp)
Notice that the \(A\) already incorporates a column of ones that renders an intercept after multiplying by the unknown vector of linear coefficients.
We already know (and understand the geometrical meaning behind it, see tutorial on orthogonal projections) at least one sensible way to proceed is to compute the solution \(\hat{x}\) can be computed with the following recipe:
\[\hat{x} = (A^tA)^{-1} A^t b.\]
Let’s do the computation:
library(matlib)
xhat <- inv(t(A) %*% A) %*% t(A) %*% b
xhat
## [,1]
## [1,] 16.585212
## [2,] 4.949584
The most compact, straightforward (and lazy) way to solve this problem with R is by means of the functions available in R for GLM regression:
glm(lifeExp ~ log10_gdpPercap + 1, family=gaussian, data=gapminder_sub)
##
## Call: glm(formula = lifeExp ~ log10_gdpPercap + 1, family = gaussian,
## data = gapminder_sub)
##
## Coefficients:
## (Intercept) log10_gdpPercap
## 4.95 16.59
##
## Degrees of Freedom: 141 Total (i.e. Null); 140 Residual
## Null Deviance: 20550
## Residual Deviance: 7102 AIC: 964.5
There are already two remarkable things in this short expression: - the formula “lifeExp ~ log10_gdpPercap + 1” employed as first argument - the “family=gaussian” argument
Let’s frame our problem as an optimization problem, based on minimizing the function that measures the sum of squares of the residuals:
Let’s define the function that given the parameters of the model (slope and intercept) gives the sum of squares of the residuals:
squared_error <- function(x){
v <- c(x[1], x[2]);
C <- A%*%v - b;
t(C) %*% C
}
# test the function at some point
a <- c(0,1)
squared_error(a)
## [,1]
## [1,] 639243
What does the sum of squares of the residuals has to do with the previous discussion in the first place?
The squared error arises naturally from the expression of the log-likelihood, where the distributions employed are Gaussian:
\[f_{\mu, \sigma}(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}\] and the linear dependence with the covariates is incorporated to the \(\mu\):
\[\mu_i = A_i\beta\] with \(A_i\) represents the \(i\)-th row of \(A\), i.e., the covariate values corresponding to the \(i\)-th sample.
Let’s compute the likelihood function:
\[\mathcal{L}(\beta) = \prod_{i=1}^n f_{\mu_i, \sigma}(y_i) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2} \left(\frac{y_i-\mu_i}{\sigma}\right)^2}\] Let’s compute the log version: \[ \begin{align} \log\mathcal{L}(\beta) & = & -n\log\sqrt{2\pi\sigma^2} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu_i)^2 \\ & \propto & \sum_{i=1}^n(y_i-A_i\beta)^2 = k - (A\beta - y)^t(A\beta - y) \end{align} \] where \(k\) is a constant terms which does not depend on \(\beta\).
The only non-constant part of the expression involves the squared error function. The log-likelihood will be maximized if the squared error function is minimized.
For a given function, there are packages that allow to compute their gradient at some point numerically. For instance:
library(rootSolve)
x0 <- c(15, 1) # some example
grad <- gradient(squared_error, x0)
grad
## [,1] [,2]
## [1,] -10654.8 -2806.086
However, interestingly, we can provide a closed form of the gradient of this specific objective function (see Exercises for Session 13-14).
Let’s implement it as a function and evaluate in the previously computed “xhat”:
grad_squared_error <- function(x){
v <- c(x[1], x[2]);
2 * t(A %*% v - b) %*% A
}
grad_squared_error(xhat)
## [,1] [,2]
## [1,] 0.5773928 0.1504183
squared_error(xhat)
## [,1]
## [1,] 7101.712
We can use our gradient function implementation to find the root of the gradient numerically:
library(rootSolve)
res <- multiroot(grad_squared_error, c(5,5), atol=c(1e-6,1e-6), maxiter=1e4)
xhat2 <- res$root
print(xhat2)
## [1] 16.585064 4.949612
print(res$f.root)
## [,1] [,2]
## [1,] 1.990862e-08 5.505825e-09
To represent graphically the function we have just optimized, it is convenient that we set-up a grid of inputs and add a new column with the outputs of our function:
xs <- seq(10, 20, by=0.05)
ys <- seq(0, 10, by=0.05)
griddf <- expand.grid(alpha=xs, beta=ys)
griddf$value <- apply(griddf, 1, function(x) {squared_error(x)})
Let’s visualize the function values around “xhat” as a 2D contour plot filled with colors:
se_gg <- griddf %>%
ggplot(aes(alpha, beta, z=value)) +
geom_contour_filled(bins=100, show.legend=FALSE) +
geom_point(aes(x=xhat[1], y=xhat[2]), colour="red") +
annotate("text", x=xhat[1], y=xhat[2], hjust = 1.2, vjust=1.2, label = "xhat", size = 5, color = "red") +
coord_fixed(ratio = 1) +
labs(x="alpha", y="beta") +
ggtitle("Squared residuals function: colormap plot")
se_gg
Let’s visualize the function using just the 2D level sets:
se_gg <- griddf %>%
ggplot(aes(alpha, beta, z=value)) +
geom_contour(bins=1000, show.legend=FALSE) +
geom_point(aes(x=xhat[1], y=xhat[2]), colour="red") +
annotate("text", x=xhat[1], y=xhat[2], hjust = 1.2, vjust=1.2, label = "xhat", size = 5, color = "black") +
coord_fixed(ratio = 1) +
labs(x="alpha", y="beta") +
ggtitle("Squared residuals function: contour plots")
se_gg
Let’s take advantage of our knowledge to compute the Hessian matrix (see also Exercises for Session 13-14)
hessian <- 2 * t(A) %*% A
hessian
## [,1] [,2]
## [1,] 4074.076 1062.668
## [2,] 1062.668 284.000
The eigenvalues and eigenvectors of the Hessian tell us about:
eigenvalues: how open/closed the branches of the vertical cross-sections of the graph of the function near “xhat” in the directions of the eigenvectors; these vertical cross-sections are all parabolas; and the graph of the function in this specific example is known as an elliptic parabolloid.
eigenvectors: orthogonal directions signaling the parabolic cross-sections that have max/min branch openness.
eigen <- eigen(hessian)
eigenvectors <- eigen$vectors
eigenvectors
## [,1] [,2]
## [1,] -0.9675283 0.2527627
## [2,] -0.2527627 -0.9675283
eigenvalues <- eigen$values
eigenvalues
## [1] 4351.694003 6.382378
Let’s see where those vectors are sitting in the contour plot.
se_gg <- griddf %>%
ggplot(aes(alpha, beta, z=value)) +
geom_contour(bins=200, show.legend=FALSE) +
geom_point(aes(x=xhat[1], y=xhat[2]), colour="red") +
geom_segment(aes(x=xhat[1], y=xhat[2],
xend=xhat[1] + eigenvectors[1,1], yend=xhat[2] + eigenvectors[2,1]),
arrow = arrow(length = unit(0.2, "cm"))) +
geom_segment(aes(x=xhat[1], y=xhat[2],
xend=xhat[1] + eigenvectors[1,2], yend=xhat[2] + eigenvectors[2,2]),
arrow = arrow(length = unit(0.2, "cm"))) +
coord_fixed(ratio = 1) +
annotate("text", x=xhat[1], y=xhat[2], hjust = 1.2, vjust=2.0, label = "xhat", size = 5, color = "black") +
labs(x="alpha", y="beta") +
ggtitle("Squared residuals function: contour plots")
se_gg
To illustrate logistic regression we will resort to the Breast Cancer Wisconsin Diagnostic Dataset from UCI Machine Learning Repository imported with dslabs.
library(dslabs)
brca <- as.data.frame(dslabs::brca)
brca_sub <- brca[, c("x.radius_mean", "x.texture_mean", "y")]
brca_sub$y <- apply(brca_sub, 1, function(x) {x[3]=="M"})
We want to understand the statistical association between malignancy of the tumor (yes or no) and a set of potential predictors derived from biopsies.
res <- glm(y ~ x.radius_mean + x.texture_mean, family=binomial, data=brca_sub)
res$coefficients
## (Intercept) x.radius_mean x.texture_mean
## -19.849417 1.057102 0.218141
In logistic regression we assume that the binary response \(y_i\), is drawn from a Bernoulli with a probability \(p_i\) that depends on the covariates.
More specifically, if \(A\in\mathbb{R}^{n\times m}\) is the matrix that contains the values of the \(m\) covariates for the \(n\) samples, and \(A_i\) denotes the \(i\)-th row, then
\[p_i = \frac{e^{A_i{\bf \beta}}}{1 + e^{A_i{\bf \beta}}}\]
where \({\beta} = (\beta_1,\ldots,\beta_m)\) is the vector of unknown parameters of the model.
There is a helpful alternative way to stress this relationship:
\[\log\left(\frac{p_i}{1 - p_i}\right) = A_i{\beta}\]
What the function glm does boilds down to computing the parameter \(\hat\beta\) that makes the following likelihood maximum:
\[ \begin{align} \mathcal{L}(\beta) & = & \prod_{i=1}^n \left\{ y_ip_i + (1 - y_i)(1 - p_i)\right\} \\ & = & \prod_{i=1}^n \left\{ y_i \frac{e^{A_i{\bf \beta}}}{1 + e^{A_i{\bf \beta}}} + (1 - y_i)(1 - \frac{e^{A_i{\bf \beta}}}{1 + e^{A_i{\bf \beta}}})\right\} \end{align} \] or its log version
\[ \begin{align} \log\mathcal{L}(\beta) & = & \sum_{i=1}^n \log\left\{ y_ip_i + (1 - y_i)(1 - p_i)\right\} \\ & = & \sum_{i=1}^n y_i\log p_i + (1 - y_i)\log(1 - p_i). \end{align} \]
To illustrate Poisson regression we will resort to the US Contagious Diseases dataset imported with dslabs. This dataset portrays yearly counts for Hepatitis A, Measles, Mumps, Pertussis, Polio, Rubella, and Smallpox for US states.
library(dslabs)
diseases <- as.data.frame(dslabs::us_contagious_diseases)
diseases <- diseases %>% filter(state == "Arizona") %>% filter(disease == "Measles")
diseases$log10_population <- log10(diseases$population)
We would like to analyze the trends of measles cases reported in Arizona through time but would also like to understand the influence of the population size.
res <- glm(count ~ log10_population, family=poisson, data=diseases)
res$coefficients
## (Intercept) log10_population
## 16.439380 -1.445008
In Poisson regression we assume that the count response variable \(n_i\), is drawn from a Poisson with mean parameter \(\lambda_i\) that depends on the covariates.
More specifically, if \(A\in\mathbb{R}^{n\times m}\) is the matrix that contains the values of the \(m\) covariates for the \(n\) samples, and \(A_i\) denotes the \(i\)-th row, then
\[\lambda_i = e^{A_i \beta}\]
where \({\beta} = (\beta_1,\ldots,\beta_m)\) is the vector of unknown parameters of the model.
What the function glm does boilds down to computing the parameter \(\hat\beta\) that makes the following likelihood maximum:
\[\mathcal{L}(\beta) = \prod_{i=1}^n \frac{\lambda^{n_i}_i}{n_i!}e^{-\lambda_i}\] or its log version
\[ \begin{align} \log\mathcal{L}(\beta) &=& \sum_{i=1}^n -\lambda_i + n_i\log\lambda_i - {n_i}! \\ &=& k + \sum_{i=1}^n -e^{A_i\beta} + n_i A_i\beta \end{align} \] where \(k\) is a constant terms which does not depend on \(\beta\).
sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 18.04.6 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=es_ES.UTF-8 LC_COLLATE=en_US.UTF-8 LC_MONETARY=es_ES.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=es_ES.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=es_ES.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Madrid
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dslabs_0.7.6 rootSolve_1.8.2.4 matlib_0.9.6 lubridate_1.9.3 forcats_1.0.0 stringr_1.5.0 dplyr_1.1.3 purrr_1.0.2 readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.4 tidyverse_2.0.0 knitr_1.44 BiocStyle_2.28.1
##
## loaded via a namespace (and not attached):
## [1] sass_0.4.7 utf8_1.2.3 generics_0.1.3 stringi_1.7.12 hms_1.1.3 digest_0.6.33 magrittr_2.0.3 rgl_1.2.1 evaluate_0.22 grid_4.3.2 timechange_0.2.0 bookdown_0.35 fastmap_1.1.1 jsonlite_1.8.7 BiocManager_1.30.22 fansi_1.0.5 viridisLite_0.4.2 scales_1.2.1 isoband_0.2.7 jquerylib_0.1.4 abind_1.4-5 cli_3.6.1 rlang_1.1.1 munsell_0.5.0 base64enc_0.1-3 withr_2.5.1 cachem_1.0.8 yaml_2.3.7 tools_4.3.2 tzdb_0.4.0 colorspace_2.1-0 vctrs_0.6.4 R6_2.5.1 lifecycle_1.0.3 car_3.1-2 htmlwidgets_1.6.2 MASS_7.3-60 pkgconfig_2.0.3 pillar_1.9.0 bslib_0.5.1 gtable_0.3.4 glue_1.6.2 xfun_0.40 tidyselect_1.2.0 rstudioapi_0.15.0 farver_2.1.1 xtable_1.8-4 htmltools_0.5.6.1 gapminder_1.0.0 labeling_0.4.3 carData_3.0-5 rmarkdown_2.25 compiler_4.3.2