1 Requirements

1.1 Packages

2 Introduction and Objectives

The learning objectives of this document are:

Understand the statistical assumptions behind regression models
Fitting generalized linear models (GLM) in R
Hands-on optimization of multivariable functions

3 Key concepts

3.1 Random processes and likelihood

Random processes
Likelihood formula for independent-identically distributed (i.i.d.) data generated by a random process
Connection between optimization and GLM

3.2 Multivariate optimization

Looking for critical points
Classification of critical points
Quadratic approximation

3.3 R skills

Review linear algebra tools
More examples of plots and graphical tools
Defining functions in R
Computing the gradient numerically
Solving numerically arbitrary systems of equations

4 Regression

This is a key concept in scientific data analysis. We can briefly put the regression problem in general as looking for a function that “best explains” a collection of response values as a function of some other feature values. This is also commonly referred to as “interpolation”.

4.1 Set-up

Suppose you have carried out an experiment leading to some table filled out with numerical values. As usual, each row represents a sample or replicate in our experiment; and each column represents the distinct measurements (features) conducted on each sample.

Let \(Y\) be a distinguished feature (response variable) that we want to explain as a function of another set of features: \(X_1, \ldots, X_m\).

In statistics, carrying out regression means to find the random process that generated the data in \(Y\) using as input the set of features. random means that is subject to some level of randomness. Consequently, unlike the case of functions, we do not expect to get always same outputs from identical inputs.

In general, this problem is very hard. In practice, to make life simple, we will search in a relatively small family of such processes (distributions).

4.2 Ingredients

We need to clarify a few things before daring to work on any regression problem at all. At least, we need to specify:

The set of distributions (family) we will pick our solution from.
A criterion to ascertain what a good solution is.

Also we will make a technical assumption, that implies that all samples in our dataset have been sampled independently from the same random process.

4.2.1 Distribution family

It is common that the distributions are taken from a family governed by a few parameters:

Gaussian family: \[f_{\mu, \sigma}(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}\] where \(\mu\) is the mean and \(\sigma\) is the standard deviation. For more information check the Normal distribution Wikipedia article.
Bernoulli/Binomial family: \[g_p(x) = xp + (1-x)(1-p)\] where \(p\) is the probability of drawing a 1.
Poisson family: \[h_{\lambda}(n) = \frac{\lambda^n}{n!}e^{-\lambda}\] where \(\lambda\) is the mean.

Then finding the right solution is equivalent to finding the right parameters.

4.2.2 Likelihood: without covariates

For each distribution of choice we can compute something called likelihood of the data given the distribution or likelihood for short. Given the response data \(Y = \{y_1,...,y_n\}\), assuming the \(y_i\) have been produced by the random process encoded by the PMF or PDF \(f_{\theta}\) (where \(\theta\) represents a parameter that govern the distribution) the likelihood of the data is given by:

\[\mathcal{L}(\theta) = f_{\theta}(y_1)f_{\theta}(y_2)\ldots f_{\theta}(y_n) = \prod_{i=1}^n f_\theta(y_i)\] We will be looking for the \(\hat{\theta}\) that gives the maximum \(\mathcal{L}(Y,\theta)\). In a more compact notation: \[\hat{\theta} = \textrm{argmax}_{\theta} \; \mathcal{L}(\theta)\] ### Likelihood: with covariates

But what if we want to make the draw dependent on a set of covariates, the covariates we have been talking about before? In the context of generalized linear models, the trick is that they are hidden inside the parameters of the distribution.

4.2.3 Example

Suppose our dataset has response vector \(Y=(y_1, \ldots, y_n)\) and a covariate vector \(X=(x_1, \ldots, x_n)\). If we assume that \(Y\) has been generated by distributions of a family governed by a parameter \(\theta\), we will see \(\theta\) as being dependent on \(X\) in a linear fashion \(\theta = \alpha X + \beta\).

This has an important implication: the specific \(\theta\) used to draw each \(y_i\) (say \(\theta_i\)) depends on the covariate of this sample in a linear fashion: \(\theta_i = \alpha x_i + \beta\). Now we want to find \(\hat\alpha\) and \(\hat\beta\) such that:

\[ \mathcal{L} (\alpha, \beta) = f_{\theta_1}(y_1) f_{\theta_2}(y_2) \ldots f_{\theta_n}(y_n) = \prod_{i=1}^n f_{\theta_i}(y_i)\] \[(\hat\alpha, \hat\beta) = \textrm{argmax}_{(\alpha, \beta)} \; \mathcal{L}(\alpha, \beta) \]

5 Linear Regression

To illustrate linear regression from different angles, we will resort once more to the gapminder dataset.

This time we want to understand the statistical association between the GDP per capita and life expectancy across different countries for the year 2007.

Let’s retrieve and preprocess our data:

library(tidyverse)
gapminder <- as.data.frame(gapminder::gapminder)
gapminder_2002 <- gapminder %>% filter(year == 2007)
gapminder_sub <- gapminder_2002[, c("country", "lifeExp", "gdpPercap")]
gapminder_sub$log10_gdpPercap <- log10(gapminder_sub$gdpPercap)
head(gapminder_sub)

##       country lifeExp  gdpPercap log10_gdpPercap
## 1 Afghanistan  43.828   974.5803        2.988818
## 2     Albania  76.423  5937.0295        3.773569
## 3     Algeria  72.301  6223.3675        3.794025
## 4      Angola  42.731  4797.2313        3.680991
## 5   Argentina  75.320 12779.3796        4.106510
## 6   Australia  81.235 34435.3674        4.537005

Recall that our goal is to compute a vector \(\hat\beta=(\hat\beta_1, \hat\beta_2)\) such that if \(A\in\mathbb{R}^{n\times m}\) is the matrix of covariates across samples and \(b\in\mathbb{R}^n\) is the vector of independent terms (response variable), \(A\beta\) is a reasonable representation of what has been observed as \(b\).

5.1 Solution 1: orthogonal projection

We define the matrix of coefficients \(A\) and the independent term \(b\).

num_rows <- nrow(gapminder_sub)
A <- matrix(
  c(gapminder_sub$log10_gdpPercap, rep(1, num_rows)),
  nrow = num_rows,
  ncol = 2,
  byrow = FALSE
)

b <- as.matrix(gapminder_sub$lifeExp)

Notice that the \(A\) already incorporates a column of ones that renders an intercept after multiplying by the unknown vector of linear coefficients.

We already know (and understand the geometrical meaning behind it, see tutorial on orthogonal projections) at least one sensible way to proceed is to compute the solution \(\hat{x}\) can be computed with the following recipe:

\[\hat{x} = (A^tA)^{-1} A^t b.\]

Let’s do the computation:

library(matlib)
xhat <- inv(t(A) %*% A) %*% t(A) %*% b
xhat

##           [,1]
## [1,] 16.585212
## [2,]  4.949584

5.2 Solution 2: GLM function in R

The most compact, straightforward (and lazy) way to solve this problem with R is by means of the functions available in R for GLM regression:

glm(lifeExp ~ log10_gdpPercap + 1, family=gaussian, data=gapminder_sub)

## 
## Call:  glm(formula = lifeExp ~ log10_gdpPercap + 1, family = gaussian, 
##     data = gapminder_sub)
## 
## Coefficients:
##     (Intercept)  log10_gdpPercap  
##            4.95            16.59  
## 
## Degrees of Freedom: 141 Total (i.e. Null);  140 Residual
## Null Deviance:       20550 
## Residual Deviance: 7102  AIC: 964.5

There are already two remarkable things in this short expression: - the formula “lifeExp ~ log10_gdpPercap + 1” employed as first argument - the “family=gaussian” argument

5.3 Solution 3: optimization (least squares)

Let’s frame our problem as an optimization problem, based on minimizing the function that measures the sum of squares of the residuals:

5.3.1 Objective function

Let’s define the function that given the parameters of the model (slope and intercept) gives the sum of squares of the residuals:

squared_error <- function(x){
  v <- c(x[1], x[2]);
  C <- A%*%v - b;
  t(C) %*% C
}

# test the function at some point
a <- c(0,1)
squared_error(a)

##        [,1]
## [1,] 639243

What does the sum of squares of the residuals has to do with the previous discussion in the first place?

5.3.2 How the square error arises from doing regression with the Gaussian family

The squared error arises naturally from the expression of the log-likelihood, where the distributions employed are Gaussian:

\[f_{\mu, \sigma}(x)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2} \left(\frac{x-\mu}{\sigma}\right)^2}\] and the linear dependence with the covariates is incorporated to the \(\mu\):

\[\mu_i = A_i\beta\] with \(A_i\) represents the \(i\)-th row of \(A\), i.e., the covariate values corresponding to the \(i\)-th sample.

Let’s compute the likelihood function:

\[\mathcal{L}(\beta) = \prod_{i=1}^n f_{\mu_i, \sigma}(y_i) = \prod_{i=1}^n \frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{1}{2} \left(\frac{y_i-\mu_i}{\sigma}\right)^2}\] Let’s compute the log version: \[ \begin{align} \log\mathcal{L}(\beta) & = & -n\log\sqrt{2\pi\sigma^2} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu_i)^2 \\ & \propto & \sum_{i=1}^n(y_i-A_i\beta)^2 = k - (A\beta - y)^t(A\beta - y) \end{align} \] where \(k\) is a constant terms which does not depend on \(\beta\).

The only non-constant part of the expression involves the squared error function. The log-likelihood will be maximized if the squared error function is minimized.

5.3.3 Gradient of the squared error

For a given function, there are packages that allow to compute their gradient at some point numerically. For instance:

library(rootSolve)
x0 <- c(15, 1)  # some example
grad <- gradient(squared_error, x0)
grad

##          [,1]      [,2]
## [1,] -10654.8 -2806.086

However, interestingly, we can provide a closed form of the gradient of this specific objective function (see Exercises for Session 13-14).

Let’s implement it as a function and evaluate in the previously computed “xhat”:

grad_squared_error <- function(x){
  v <- c(x[1], x[2]);
  2 * t(A %*% v - b) %*% A
}
grad_squared_error(xhat)

##           [,1]      [,2]
## [1,] 0.5773928 0.1504183

squared_error(xhat)

##          [,1]
## [1,] 7101.712

We can use our gradient function implementation to find the root of the gradient numerically:

library(rootSolve)
res <- multiroot(grad_squared_error, c(5,5), atol=c(1e-6,1e-6), maxiter=1e4)
xhat2 <- res$root
print(xhat2)

## [1] 16.585064  4.949612

print(res$f.root)

##              [,1]         [,2]
## [1,] 1.990862e-08 5.505825e-09

5.3.4 Graphical representation

To represent graphically the function we have just optimized, it is convenient that we set-up a grid of inputs and add a new column with the outputs of our function:

xs <- seq(10, 20, by=0.05)
ys <- seq(0, 10, by=0.05)

griddf <- expand.grid(alpha=xs, beta=ys)
griddf$value <- apply(griddf, 1, function(x) {squared_error(x)})

Let’s visualize the function values around “xhat” as a 2D contour plot filled with colors:

se_gg <- griddf %>%
  ggplot(aes(alpha, beta, z=value)) +
  geom_contour_filled(bins=100, show.legend=FALSE) + 
  geom_point(aes(x=xhat[1], y=xhat[2]), colour="red") + 
  annotate("text", x=xhat[1], y=xhat[2], hjust = 1.2, vjust=1.2, label = "xhat", size = 5, color = "red") +
  coord_fixed(ratio = 1) + 
  labs(x="alpha", y="beta") + 
  ggtitle("Squared residuals function: colormap plot")

se_gg

Let’s visualize the function using just the 2D level sets:

se_gg <- griddf %>%
  ggplot(aes(alpha, beta, z=value)) +
  geom_contour(bins=1000, show.legend=FALSE) + 
  geom_point(aes(x=xhat[1], y=xhat[2]), colour="red") + 
  annotate("text", x=xhat[1], y=xhat[2], hjust = 1.2, vjust=1.2, label = "xhat", size = 5, color = "black") +
  coord_fixed(ratio = 1) +
  labs(x="alpha", y="beta") + 
  ggtitle("Squared residuals function: contour plots")

se_gg

Let’s take advantage of our knowledge to compute the Hessian matrix (see also Exercises for Session 13-14)

hessian <- 2 * t(A) %*% A
hessian

##          [,1]     [,2]
## [1,] 4074.076 1062.668
## [2,] 1062.668  284.000

The eigenvalues and eigenvectors of the Hessian tell us about:

eigenvalues: how open/closed the branches of the vertical cross-sections of the graph of the function near “xhat” in the directions of the eigenvectors; these vertical cross-sections are all parabolas; and the graph of the function in this specific example is known as an elliptic parabolloid.
eigenvectors: orthogonal directions signaling the parabolic cross-sections that have max/min branch openness.

eigen <- eigen(hessian)

eigenvectors <- eigen$vectors
eigenvectors

##            [,1]       [,2]
## [1,] -0.9675283  0.2527627
## [2,] -0.2527627 -0.9675283

eigenvalues <- eigen$values
eigenvalues

## [1] 4351.694003    6.382378

Let’s see where those vectors are sitting in the contour plot.

se_gg <- griddf %>%
  ggplot(aes(alpha, beta, z=value)) +
  geom_contour(bins=200, show.legend=FALSE) + 
  geom_point(aes(x=xhat[1], y=xhat[2]), colour="red") +
  geom_segment(aes(x=xhat[1], y=xhat[2], 
                   xend=xhat[1] + eigenvectors[1,1], yend=xhat[2] + eigenvectors[2,1]),
               arrow = arrow(length = unit(0.2, "cm"))) +
  geom_segment(aes(x=xhat[1], y=xhat[2], 
                   xend=xhat[1] + eigenvectors[1,2], yend=xhat[2] + eigenvectors[2,2]),
               arrow = arrow(length = unit(0.2, "cm"))) + 
  coord_fixed(ratio = 1) + 
  annotate("text", x=xhat[1], y=xhat[2], hjust = 1.2, vjust=2.0, label = "xhat", size = 5, color = "black") +
  labs(x="alpha", y="beta") + 
  ggtitle("Squared residuals function: contour plots")

se_gg

6 Logistic regression

To illustrate logistic regression we will resort to the Breast Cancer Wisconsin Diagnostic Dataset from UCI Machine Learning Repository imported with dslabs.

library(dslabs)
brca <- as.data.frame(dslabs::brca)
brca_sub <- brca[, c("x.radius_mean", "x.texture_mean", "y")]
brca_sub$y <- apply(brca_sub, 1, function(x) {x[3]=="M"})

We want to understand the statistical association between malignancy of the tumor (yes or no) and a set of potential predictors derived from biopsies.

6.1 GLM function in R

res <- glm(y ~ x.radius_mean + x.texture_mean, family=binomial, data=brca_sub)
res$coefficients

##    (Intercept)  x.radius_mean x.texture_mean 
##     -19.849417       1.057102       0.218141

6.2 What’s going on inside the blackbox?

In logistic regression we assume that the binary response \(y_i\), is drawn from a Bernoulli with a probability \(p_i\) that depends on the covariates.

More specifically, if \(A\in\mathbb{R}^{n\times m}\) is the matrix that contains the values of the \(m\) covariates for the \(n\) samples, and \(A_i\) denotes the \(i\)-th row, then

\[p_i = \frac{e^{A_i{\bf \beta}}}{1 + e^{A_i{\bf \beta}}}\]

where \({\beta} = (\beta_1,\ldots,\beta_m)\) is the vector of unknown parameters of the model.

There is a helpful alternative way to stress this relationship:

\[\log\left(\frac{p_i}{1 - p_i}\right) = A_i{\beta}\]

What the function glm does boilds down to computing the parameter \(\hat\beta\) that makes the following likelihood maximum:

\[ \begin{align} \mathcal{L}(\beta) & = & \prod_{i=1}^n \left\{ y_ip_i + (1 - y_i)(1 - p_i)\right\} \\ & = & \prod_{i=1}^n \left\{ y_i \frac{e^{A_i{\bf \beta}}}{1 + e^{A_i{\bf \beta}}} + (1 - y_i)(1 - \frac{e^{A_i{\bf \beta}}}{1 + e^{A_i{\bf \beta}}})\right\} \end{align} \] or its log version

\[ \begin{align} \log\mathcal{L}(\beta) & = & \sum_{i=1}^n \log\left\{ y_ip_i + (1 - y_i)(1 - p_i)\right\} \\ & = & \sum_{i=1}^n y_i\log p_i + (1 - y_i)\log(1 - p_i). \end{align} \]

7 Poisson regression

To illustrate Poisson regression we will resort to the US Contagious Diseases dataset imported with dslabs. This dataset portrays yearly counts for Hepatitis A, Measles, Mumps, Pertussis, Polio, Rubella, and Smallpox for US states.

library(dslabs)
diseases <- as.data.frame(dslabs::us_contagious_diseases)
diseases <- diseases %>% filter(state == "Arizona") %>% filter(disease == "Measles")
diseases$log10_population <- log10(diseases$population)

We would like to analyze the trends of measles cases reported in Arizona through time but would also like to understand the influence of the population size.

res <- glm(count ~ log10_population, family=poisson, data=diseases)
res$coefficients

##      (Intercept) log10_population 
##        16.439380        -1.445008

7.1 What’s going on inside the blackbox?

In Poisson regression we assume that the count response variable \(n_i\), is drawn from a Poisson with mean parameter \(\lambda_i\) that depends on the covariates.