1 Introduction and objectives

In this hands-on session, we will review the concept of projections onto vector subspaces. Specifically, we aim to achieve the following objectives:

Use orthogonal projections to find the best approximation to a solution for general systems of linear equations of the form \(Ax = b\) with no solution.
Understand this problem both algebraically and geometrically. For the latter we will use the power of Shiny apps.
Find the least squares regression solution line for \(n\) points in \(\mathbb{R}^2\) using the concepts explained in this session. Least squares is a real-world, pervasive application of linear algebra.
Learn to use tidyverse and in particular ggplot2 to obtain publication-quality graphics.

2 Disclaimer

This hands-on session comprises some challenging concepts. We believe that all the students at your level are able to grasp them with time and dedication, but it would also be within the expected if you don’t finish this document or don’t understand everything in the classroom. You will have this document for you to study after the class and we will be available to answer any questions.

3 Description of the problem

We want to find the best approximation to the solution for a system of linear equations \(Ax = b\) that has no proper, exact solution in the traditional sense. We can approach this problem from two different perspectives, which master Gilbert Strang refers to as the “row and column pictures” of a system of equations.

3.1 Row picture

Let us illustrate the problem with the following system of equations:

\[ \begin{cases} x=6\\x+y=0\\x+2y=0 \end{cases} \]

Which can be represented in matrix form as:

\[ Ax = b \]

\[ \begin{bmatrix}1&0 \\1&1 \\1&2 \\\end{bmatrix} \begin{bmatrix}x \\y \\\end{bmatrix} = \begin{bmatrix}6 \\0 \\0 \\\end{bmatrix} \]

Where \(A\) is the coefficient matrix, \(x\) is the vector that contains the unknowns and \(b\) is the vector that contains the constants.

Geometrically, the row picture tells us that a system of linear equations has no solution if all the lines specified by the equations do not meet in a single point:

This is typically the case in systems of linear equations with more equations than unknowns. We can think of these as systems with “tall and thin matrices” (more rows than columns).

The term “row picture” is reminiscent of the fact that each row of the system represents an equation and every linear equation in the plane has typically not just one but many –a straight line of– solutions.

3.2 Column picture

Understanding \(Ax = b\) using the column picture is the ultimate goal of linear algebra. We do not think of equations and unknowns anymore. Instead, we represent the equations as a matrix-vector multiplication:

\[ Ax = b \]

We can think of the vector \(b\) as a linear combination of the column vectors of A. The scalars of this linear combinations are the elements of the vector \(x\).

Considering our system of linear equations we have:

\[ \begin{bmatrix}1&0 \\1&1 \\1&2 \\\end{bmatrix} \begin{bmatrix}x_1 \\x_2 \\\end{bmatrix} = \begin{bmatrix}6 \\0 \\0 \\\end{bmatrix} \]

\[ x_1 \begin{bmatrix}1 \\1 \\1 \\\end{bmatrix} + x_2 \begin{bmatrix}0 \\1 \\2 \\\end{bmatrix} = \begin{bmatrix}6 \\0 \\0 \\\end{bmatrix} \] *Remember that the column space of \(A\) is the vector space generated by the columns of \(A\) considered as vectors.

Thus, \(Ax = b\) has no proper solution in the traditional sense whenever \(b\) is not contained in \(C(A)\). In other words, there is no two \(x_1\) and \(x_2\) that specify a linear combination of \(a_1\) and \(a_2\) that produces \(b\).

Exercise

To geometrically visualize this, enter the shiny app associated with this tutorial by clicking the following link or copying it into your browser:

https://massonix.shinyapps.io/projections_onto_subspaces/

Then, in the left hand side of the app enter the matrix \(A\) and the vector \(b\). The app will render a visualization of \(C(A)\) as a plane in \(\mathbb{R}^3\), and \(b\) as a vector in \(\mathbb{R}^3\), which need not be contained in the plane.

4 Solving the problem

As we have seen above, in our system of linear equations, there is in general no vector \(x\) that will provide a proper solution of \(Ax = b\) in the traditional sense. However, we can provide “the best solution we can” by finding \(\widehat{x}\) such that vector \(p = \widehat{x}_1 a_1 + ··· + \widehat{x}_n a_n\) is as close as possible to the vector b.

To make the “as close as possible” more precise, we will consider the closest vector to \(b\) that belongs to the column space \(C(A)\) by moving perpendicularly with respect to \(C(A)\), i.e., \(p=p(b)\) will be the vector obtained with an orthogonal projection of \(b\) onto \(C(A)\). You can visualize this in the shiny app.

As \(p(b)\) is in \(C(A)\), we can find a vector \(\widehat{x}\) that solves the new system:

\[ A\widehat{x} = p(b) \]

We will first find \(\widehat{x}\), and then we will calculate \(p(b)\), that is the projection of \(b\) onto the column space \(C(A)\).

We can also calculate the projection matrix \(P\) that projects any vector onto the column space \(C(A)\).

4.1 Find \(\widehat{x}\)

We have seen that, if \(p(b)\) is the projection of \(b\) onto the column space \(C(A)\), then \(p(b)-b\) belongs to the orthogonal space of \(C(A)\), which is the same as the null space of the transpose of \(A\), i.e, \(N(A^t)\) (see the chapter 5 of the course notes).

\[ p(b) - b = A\widehat{x}-b \hspace{0.1cm} \textrm{ belongs to } C(A)^\perp = N(A^t) \] means that:

\[ A^t(A\widehat{x}-b)=\vec{0} \]

Let us distribute \(A^t\):

\[ A^t A\widehat{x}-A^tb = 0 \hspace{0.1cm} \Longrightarrow \hspace{0.1cm} A^t A\widehat{x} = A^tb \]

We know that the symmetric matrix \(A^tA\) is invertible if the column vectors are linearly independent. If that’s the case, we can multiply both sides by \((A^tA)^{-1}\) and isolate \(\widehat{x}\):

\[ (A^t A)^{-1}( A^t A)\widehat{x} = (A^t A)^{-1} A^tb \] \[ \textrm{Id }\widehat{x} = (A^t A)^{-1} A^tb \]

\[ \widehat{x} = (A^t A)^{-1} A^tb \]

Let us apply this formula to the example above. To make computations easier, we will use the operations in R that we learned in the previous hands-on session. First, we compute \(A^tA\):

A <- matrix(c(1, 1, 1, 0, 1, 2), nrow = 3, ncol = 2, byrow = FALSE)
AT_A <- t(A) %*% A
AT_A

##      [,1] [,2]
## [1,]    3    3
## [2,]    3    5

\[ A^tA = \begin{bmatrix}1&1&1 \\0&1&2 \\\end{bmatrix} \begin{bmatrix}1&0 \\1&1 \\1&2 \\\end{bmatrix} = \begin{bmatrix}3&3 \\3&5 \\\end{bmatrix} \]

Now we find the inverse \((A^t A)^{-1}\):

library(matlib)
library(MASS)
AT_A_inv <- inv(AT_A)
AT_A_inv <- fractions(AT_A_inv)
AT_A_inv

##      [,1] [,2]
## [1,]  5/6 -1/2
## [2,] -1/2  1/2

\[ (A^t A)^{-1} = \begin{bmatrix} 5/6 & -1/2\\ -1/2 & 1/2 \end{bmatrix} \]

Finally, we can solve for \(\widehat{x} = (A^t A)^{-1} A^tb\):

b <- matrix(c(6, 0, 0), nrow = 3)
x_hat <- AT_A_inv %*% t(A) %*% b
x_hat

##      [,1]
## [1,]    5
## [2,]   -3

\[ \widehat{x} = (A^t A)^{-1} A^tb = \begin{bmatrix} 5/6 & -1/2\\ -1/2 & 1/2 \end{bmatrix} \begin{bmatrix}1&1&1 \\0&1&2 \\\end{bmatrix} \begin{bmatrix}6 \\0 \\0 \\\end{bmatrix} = \begin{bmatrix} 5 \\ -3 \end{bmatrix} \]

4.2 Find \(p(b)\)

Let us find \(p(b)\), which is the projection of \(b\) onto \(C(A)\):

\[ p(b) = A \widehat{x} = A (A^t A)^{-1} A^tb \]

p <- A %*% x_hat
p

##      [,1]
## [1,]    5
## [2,]    2
## [3,]   -1

\[ p(b) = \begin{bmatrix}1&0 \\1&1 \\1&2 \\\end{bmatrix} \begin{bmatrix}4.99999998 \\-3 \\\end{bmatrix} = \begin{bmatrix}4.99999998 \\1.99999998 \\-1.00000002 \\\end{bmatrix} \]

4.3 Find the projection matrix P (optional)

Now we can pursue one more step to find the matrix \(P\) encoding the orthogonal projection onto the column space \(C(A)\), which describes a linear transformation that takes any vector and projects in onto \(C(A)\).

If we look at what happened with \(b\), it is easy to spot the matrix that did the job: \(P = A (A^t A)^{-1} A^t\).

Projection matrices have two interesting properties:

They are symmetric

\[ P^t = P \]

If you square them, you get \(P\)

\[ P^2 = P \]

Can you guess why?

Hint:

Try to calculate the transpose matrix of \(P = A (A^t A)^{-1} A^t\).
Think about what happens if you project the vector \(p\) again onto \(C(A)\).

5 Introduction to dplyr and ggplot2

In this session we move our focus from matrices to dataframes. Remember that dataframes are data structures that are 2 dimensional (tabular) and heterogeneous (can handle numerics, characters and logicals together). Intuitively, you can think of them as spreadsheets.

Here, we will introduce the tidyverse. As defined by the authors: “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures”.

You can learn more about Tidyverse in this video and this link. There is an entire book to learn how to use this ecosystem; which at its core aims to simplify the data analysis life cycle:

(Image obtained from this link)

In particular, we will focus on two crucial steps: transform and visualize; which are accomplished with the dplyr and ggplot2 packages, respectively.

5.1 Dplyr

To introduce this package, we will work with the iconic gapminder dataset, which contains information on population size, gross domestic product (GDP) and life expectancy for each country across several years.

Let us start by loading it into memory:

library(tidyverse)
gapminder <- as.data.frame(gapminder::gapminder)
head(gapminder)

##       country continent year lifeExp      pop gdpPercap
## 1 Afghanistan      Asia 1952  28.801  8425333  779.4453
## 2 Afghanistan      Asia 1957  30.332  9240934  820.8530
## 3 Afghanistan      Asia 1962  31.997 10267083  853.1007
## 4 Afghanistan      Asia 1967  34.020 11537966  836.1971
## 5 Afghanistan      Asia 1972  36.088 13079460  739.9811
## 6 Afghanistan      Asia 1977  38.438 14880372  786.1134

tail(gapminder)

##       country continent year lifeExp      pop gdpPercap
## 1699 Zimbabwe    Africa 1982  60.363  7636524  788.8550
## 1700 Zimbabwe    Africa 1987  62.351  9216418  706.1573
## 1701 Zimbabwe    Africa 1992  60.377 10704340  693.4208
## 1702 Zimbabwe    Africa 1997  46.809 11404948  792.4500
## 1703 Zimbabwe    Africa 2002  39.989 11926563  672.0386
## 1704 Zimbabwe    Africa 2007  43.487 12311143  469.7093

Dplyr is centered around 5 verbs:

filter(): get rows that fulfill a specific condition.
select(): select specific columns.
arrange(): sort rows in a certain order.
mutate(): create new variables from existing ones.
summarise(): collapse many values down to a single one.

All this functions can be used together with the group_by() function to apply these verbs grouped by a categorical variable.

In addition, dplyr aims to maximize code readability. To this end, it uses the pipe operator (%>%) from the magrittr package to create the so-called pipelines. (Image obtained from this link)

You can learn more about dplyr in this link.

Example pipeline:

“Starting with the gapminder dataset, filter all rows from 2002, and then select the continent and population. With that, group the rows by continent and summarise the dataset to the mean population for each continent. Finally, arrange the results in descending order”.

pop_by_continent <- gapminder %>%
  filter(year == "2002") %>%
  select(continent, pop) %>%
  group_by(continent) %>%
  summarise(mean_pop = mean(pop)) %>%
  arrange(desc(mean_pop))
pop_by_continent

## # A tibble: 5 × 2
##   continent   mean_pop
##   <fct>          <dbl>
## 1 Asia      109145521.
## 2 Americas   33990910.
## 3 Europe     19274129.
## 4 Africa     16033152.
## 5 Oceania    11727414.

As we can see, dplyr makes it easy to translate a thought into clear code.

5.2 ggplot2

The input to ggplot2 is a tidy dataset which is defined by the following properties:

Observations are rows.
Variables are columns.
Observational units form a table.

The underlying philosophy of ggplot2 is the grammar of graphics. This grammar entails that plots are formed with several components, whose relationships are structured in a layered hierarchy.

This hierarchy is better visualized as a pyramid:

(Image obtained from this link)

The important concept is clearly the layered hierarchy, which means that

Some components of the plot are defined first than others,
The plot is composed one layer at a time.

The main layers, by order, are the following:

Data: tidy dataframe.
Aesthetics: this word is a bit confusing at first. Aesthetics refers to every element of the plot that has a sense of “scale”, and to which you can assign a variable.
Examples: x and y axis, color, shape and size, among others.
Geometric objects (know as geoms): refers to the type of plot that you want to obtain.
Examples: violin plot, box plot, bar plot or scatterplot, among others.
Themes: controls the styling of the plot.
Examples: fontsize, width of the axis, etc.

As an example, using the same dataset as in the previous example, we can plot the correlation between life expectancy and GDP for each country by using the functions geom_point and geom_smooth:

gapminder_2002 <- gapminder %>%
  filter(year == 2002)
gapminder_2002 %>% 
  ggplot(aes(x = gdpPercap, y = lifeExp, color = continent, size = pop)) +
    geom_point() +
    geom_smooth(method = "lm", se = FALSE, color = "black") +
    scale_x_log10() +
    labs(x = "Income", y = "Life Expectancy") +
    theme_classic()

6 Least Squares exercise

Exercise:

Use the tools that we have seen until now to:

Find the slope and the \(y\)-intercept of the least squares regression line between Life Expectancy ~ Income from the gapminder data set, i.e, find \(m\) and \(n\) such that \[ y = mx+n \] the least squares regression line that approximates the Life Expectancy (\(y\)) given the Income (\(x\)).
Using ggplot, plot the regression line defining manually the parameters calculated in (1) and check that they are the same.

Hereafter, we will develop the reasoning and solution for this exercise. You are all highly encouraged to try the problem on your own and then compare the results. Otherwise, you can follow our explanation, and try to code each chunk by yourself before looking at ours.

Solution

First, we will subset the dataframe to keep only the variables of interest (life expectancy, income and country). Then we will visualize the first 3 rows with the head() function:

# Visualize the structure of the dataframe
str(gapminder::gapminder)

## tibble [1,704 × 6] (S3: tbl_df/tbl/data.frame)
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int [1:1704] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num [1:1704] 28.8 30.3 32 34 36.1 ...
##  $ pop      : int [1:1704] 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num [1:1704] 779 821 853 836 740 ...

# Character subsetting
gapminder_sub <- gapminder_2002[, c("country", "lifeExp", "gdpPercap")]
head(gapminder_sub, 3)

##       country lifeExp gdpPercap
## 1 Afghanistan  42.129  726.7341
## 2     Albania  75.651 4604.2117
## 3     Algeria  70.994 5288.0404

Note that the gdpPercap in the previous scatter plot is in logarithmic scale. Let us log-transform it:

gapminder_sub$log10_gdpPercap <- log10(gapminder_sub$gdpPercap)
head(gapminder_sub, 3)

##       country lifeExp gdpPercap log10_gdpPercap
## 1 Afghanistan  42.129  726.7341        2.861376
## 2     Albania  75.651 4604.2117        3.663155
## 3     Algeria  70.994 5288.0404        3.723295

Our purpose is to fit a line in the form of

\[ lifeExp = m · \log_{10}(gdpPercap) + n, \]

where lifeExp (life expectancy) is the dependent (or response) variable, gdpPercap (per-capita Gross domestic product) is the independent (or explanatory) variable, \(m\) is the slope and \(n\) is the \(y\)-intercept.

Hence, if we represent each of the rows of the previous head in this form, we would get the following toy system of linear equations:

\[ \begin{cases} 2.86m+n=42.19\\3.66m+n=75.65\\3.72m+n=70.99 \end{cases} \]

Which from the column perspective we would express in \(Ax = b\):

\[ \begin{bmatrix}2.86&1 \\3.66&1 \\3.72&1 \\\end{bmatrix} \begin{bmatrix}m \\n \\\end{bmatrix} = \begin{bmatrix}42.19 \\75.65 \\70.99 \\\end{bmatrix} \]

In our real problem, we have \(142\) equations for only two unknowns (slope and \(y\)-intercept). Thus, the coefficient matrix is \(142\)x\(2\). As you will have guessed, to find the least squares regression line we will need to use orthogonal projections to find \(m\) and \(n\).

Challenge:

To get a sense of how \(b\), \(p\) and \(p-b\) translate in the context of a least squares problem try the shiny app.

To solve for \(m\) and \(n\), we will first create the matrix \(A\) and the vector \(b\). Note that the first column of \(A\) is the gdpPercap, and the vector \(b\) is the lifeExp:

num_rows <- nrow(gapminder_sub)
A <- matrix(
  c(gapminder_sub$log10_gdpPercap, rep(1, num_rows)),
  nrow = num_rows,
  ncol = 2,
  byrow = FALSE
)
head(A)

##          [,1] [,2]
## [1,] 2.861376    1
## [2,] 3.663155    1
## [3,] 3.723295    1
## [4,] 3.442995    1
## [5,] 3.944366    1
## [6,] 4.486965    1

b <- as.matrix(gapminder_sub$lifeExp)
head(b)

##        [,1]
## [1,] 42.129
## [2,] 75.651
## [3,] 70.994
## [4,] 41.003
## [5,] 74.340
## [6,] 80.370

We can solve the problem with the following equation (seen above):