9. Orthogonal Projections and Their Applications#
Contents
9.1. Overview#
Orthogonal projection is a cornerstone of vector space methods, with many diverse applications.
These include
Least squares projection, also known as linear regression
Conditional expectations for multivariate normal (Gaussian) distributions
Gram–Schmidt orthogonalization
QR decomposition
Orthogonal polynomials
etc
In this lecture, we focus on
key ideas
least squares regression
We’ll require the following imports:
import numpy as np
from scipy.linalg import qr
9.1.1. Further Reading#
For background and foundational concepts, see our lecture on linear algebra.
For more proofs and greater theoretical detail, see A Primer in Econometric Theory.
For a complete set of proofs in a general setting, see, for example, [Roman, 2005].
For an advanced treatment of projection in the context of least squares prediction, see this book chapter.
9.2. Key Definitions#
Assume
Define
Recall
The law of cosines states that
When

For a linear subspace

The orthogonal complement of linear subspace

To see this, fix
and .Observe that if
, then
Hence
, as was to be shown
A set of vectors
If
For example, when
9.2.1. Linear Independence vs Orthogonality#
If
Proving this is a nice exercise.
While the converse is not true, a kind of partial converse holds, as we’ll see below.
9.3. The Orthogonal Projection Theorem#
What vector within a linear subspace of
The next theorem answers this question.
Theorem (OPT) Given
The minimizer
The vector
The next figure provides some intuition

9.3.1. Proof of Sufficiency#
We’ll omit the full proof.
But we will prove sufficiency of the asserted conditions.
To this end, let
Let
Let
Hence
9.3.2. Orthogonal Projection as a Mapping#
For a linear space
By the OPT, this is a well-defined mapping or operator from
In what follows we denote this operator by a matrix
represents the projection .This is sometimes expressed as
, where denotes a wide-sense expectations operator and the subscript indicates that we are projecting onto the linear subspace .
The operator

It is immediate from the OPT that for any
and
From this, we can deduce additional useful properties, such as
and
For example, to prove 1, observe that
9.3.2.1. Orthogonal Complement#
Let
The orthogonal complement of
Let
We write
to indicate that for every
Moreover,
This amounts to another version of the OPT:
Theorem. If
The next figure illustrates

9.4. Orthonormal Basis#
An orthogonal set of vectors
Let
If
One example of an orthonormal set is the canonical basis
If
To see this, observe that since
Taking the inner product with respect to
Combining this result with (9.1) verifies the claim.
9.4.1. Projection onto an Orthonormal Basis#
When a subspace onto which we project is orthonormal, computing the projection simplifies:
Theorem If
Proof: Fix
Clearly,
We claim that
It sufficies to show that
This is true because
(Why is this sufficient to establish the claim that
9.5. Projection Via Matrix Algebra#
Let
We want to compute the matrix
Evidently
This reference is useful.
Theorem. Let the columns of
Proof: Given arbitrary
, and
Claim 1 is true because
An expression of the form
Claim 2 is equivalent to the statement
To verify this, notice that if
The proof is now complete.
9.5.1. Starting with the Basis#
It is common in applications to start with
Then the columns of
From the preceding theorem,
In this context,
The matrix
satisfies and is sometimes called the annihilator matrix.
9.5.2. The Orthonormal Case#
Suppose that
Let
We know that the projection of
Since
Hence
We have recovered our earlier result about projecting onto the span of an orthonormal basis.
9.5.3. Application: Overdetermined Systems of Equations#
Let
Given
If
Intuitively, we may not be able to find a
The best approach here is to
Accept that an exact solution may not exist.
Look instead for an approximate solution.
By approximate solution, we mean a
The next theorem shows that a best approximation is well defined and unique.
The proof uses the OPT.
Theorem The unique minimizer of
Proof: Note that
Since
Because
This is what we aimed to show.
9.6. Least Squares Regression#
Let’s apply the theory of orthogonal projection to least squares regression.
This approach provides insights about many geometric properties of linear regression.
We treat only some examples.
9.6.1. Squared Risk Measures#
Given pairs
If probabilities and hence
However, if a sample is available, we can estimate the risk with the empirical risk:
Minimizing this expression is called empirical risk minimization.
The set
The theory of statistical learning tells us that to prevent overfitting we should take the set
If we let
This is the sample linear least squares problem.
9.6.2. Solution#
Define the matrices
and
We assume throughout that
If you work through the algebra, you will be able to verify that
Since monotone transforms don’t affect minimizers, we have
By our results about overdetermined linear systems of equations, the solution is
Let
The vector of fitted values is
The vector of residuals is
Here are some more standard definitions:
The total sum of squares is
.The sum of squared residuals is
.The explained sum of squares is
.
TSS = ESS + SSR
We can prove this easily using the OPT.
From the OPT we have
Applying the Pythagorean law completes the proof.
9.7. Orthogonalization and Decomposition#
Let’s return to the connection between linear independence and orthogonality touched on above.
A result of much interest is a famous algorithm for constructing orthonormal sets from linearly independent sets.
The next section gives details.
9.7.1. Gram-Schmidt Orthogonalization#
Theorem For each linearly independent set
The Gram-Schmidt orthogonalization procedure constructs an orthogonal set
One description of this procedure is as follows:
For
, form andSet
For
set and
The sequence
A Gram-Schmidt orthogonalization construction is a key idea behind the Kalman filter described in A First Look at the Kalman filter.
In some exercises below, you are asked to implement this algorithm and test it using projection.
9.7.2. QR Decomposition#
The following result uses the preceding algorithm to produce a useful decomposition.
Theorem If
is , upper triangular, and nonsingular is with orthonormal columns
Proof sketch: Let
be orthonormal with the same span as (to be constructed using Gram–Schmidt) be formed from cols
Since
Some rearranging gives
9.7.3. Linear Regression via QR Decomposition#
For matrices
Using the QR decomposition
Numerical routines would in this case use the alternative form
9.8. Exercises#
Exercise 9.1
Show that, for any linear subspace
Solution to Exercise 9.1
If
Exercise 9.2
Let
Solution to Exercise 9.2
Symmetry and idempotence of
Exercise 9.3
Using Gram-Schmidt orthogonalization, produce a linear projection of
and
Solution to Exercise 9.3
Here’s a function that computes the orthonormal vectors using the GS algorithm given in the lecture
def gram_schmidt(X):
"""
Implements Gram-Schmidt orthogonalization.
Parameters
----------
X : an n x k array with linearly independent columns
Returns
-------
U : an n x k array with orthonormal columns
"""
# Set up
n, k = X.shape
U = np.empty((n, k))
I = np.eye(n)
# The first col of U is just the normalized first col of X
v1 = X[:,0]
U[:, 0] = v1 / np.sqrt(np.sum(v1 * v1))
for i in range(1, k):
# Set up
b = X[:, i] # The vector we're going to project
Z = X[:, 0:i] # First i-1 columns of X
# Project onto the orthogonal complement of the col span of Z
M = I - Z @ np.linalg.inv(Z.T @ Z) @ Z.T
u = M @ b
# Normalize
U[:, i] = u / np.sqrt(np.sum(u * u))
return U
Here are the arrays we’ll work with
y = [1, 3, -3]
X = [[1, 0],
[0, -6],
[2, 2]]
X, y = [np.asarray(z) for z in (X, y)]
First, let’s try projection of
Py1 = X @ np.linalg.inv(X.T @ X) @ X.T @ y
Py1
array([-0.56521739, 3.26086957, -2.2173913 ])
Now let’s do the same using an orthonormal basis created from our
gram_schmidt
function
U = gram_schmidt(X)
U
array([[ 0.4472136 , -0.13187609],
[ 0. , -0.98907071],
[ 0.89442719, 0.06593805]])
Py2 = U @ U.T @ y
Py2
array([-0.56521739, 3.26086957, -2.2173913 ])
This is the same answer. So far so good. Finally, let’s try the same thing but with the basis obtained via QR decomposition:
Q, R = qr(X, mode='economic')
Q
array([[-0.4472136 , -0.13187609],
[-0. , -0.98907071],
[-0.89442719, 0.06593805]])
Py3 = Q @ Q.T @ y
Py3
array([-0.56521739, 3.26086957, -2.2173913 ])
Again, we obtain the same answer.