9. Orthogonal Projections and Their Applications#

9.1. Overview#

Orthogonal projection is a cornerstone of vector space methods, with many diverse applications.

These include

  • Least squares projection, also known as linear regression

  • Conditional expectations for multivariate normal (Gaussian) distributions

  • Gram–Schmidt orthogonalization

  • QR decomposition

  • Orthogonal polynomials

  • etc

In this lecture, we focus on

  • key ideas

  • least squares regression

We’ll require the following imports:

import numpy as np
from scipy.linalg import qr

9.1.1. Further Reading#

For background and foundational concepts, see our lecture on linear algebra.

For more proofs and greater theoretical detail, see A Primer in Econometric Theory.

For a complete set of proofs in a general setting, see, for example, [Roman, 2005].

For an advanced treatment of projection in the context of least squares prediction, see this book chapter.

9.2. Key Definitions#

Assume x,zRn.

Define x,z=ixizi.

Recall x2=x,x.

The law of cosines states that x,z=xzcos(θ) where θ is the angle between the vectors x and z.

When x,z=0, then cos(θ)=0 and x and z are said to be orthogonal and we write xz.

_images/orth_proj_def1.png

For a linear subspace SRn, we call xRn orthogonal to S if xz for all zS, and write xS.

_images/orth_proj_def2.png

The orthogonal complement of linear subspace SRn is the set S:={xRn:xS}.

_images/orth_proj_def3.png

S is a linear subspace of Rn

  • To see this, fix x,yS and α,βR.

  • Observe that if zS, then

αx+βy,z=αx,z+βy,z=α×0+β×0=0
  • Hence αx+βyS, as was to be shown

A set of vectors {x1,,xk}Rn is called an orthogonal set if xixj whenever ij.

If {x1,,xk} is an orthogonal set, then the Pythagorean Law states that

x1++xk2=x12++xk2

For example, when k=2, x1x2 implies

x1+x22=x1+x2,x1+x2=x1,x1+2x2,x1+x2,x2=x12+x22

9.2.1. Linear Independence vs Orthogonality#

If XRn is an orthogonal set and 0X, then X is linearly independent.

Proving this is a nice exercise.

While the converse is not true, a kind of partial converse holds, as we’ll see below.

9.3. The Orthogonal Projection Theorem#

What vector within a linear subspace of Rn best approximates a given vector in Rn?

The next theorem answers this question.

Theorem (OPT) Given yRn and linear subspace SRn, there exists a unique solution to the minimization problem

y^:=argminzSyz

The minimizer y^ is the unique vector in Rn that satisfies

  • y^S

  • yy^S

The vector y^ is called the orthogonal projection of y onto S.

The next figure provides some intuition

_images/orth_proj_thm1.png

9.3.1. Proof of Sufficiency#

We’ll omit the full proof.

But we will prove sufficiency of the asserted conditions.

To this end, let yRn and let S be a linear subspace of Rn.

Let y^ be a vector in Rn such that y^S and yy^S.

Let z be any other point in S and use the fact that S is a linear subspace to deduce

yz2=(yy^)+(y^z)2=yy^2+y^z2

Hence yzyy^, which completes the proof.

9.3.2. Orthogonal Projection as a Mapping#

For a linear space Y and a fixed linear subspace S, we have a functional relationship

yY its orthogonal projection y^S

By the OPT, this is a well-defined mapping or operator from Rn to Rn.

In what follows we denote this operator by a matrix P

  • Py represents the projection y^.

  • This is sometimes expressed as E^Sy=Py, where E^ denotes a wide-sense expectations operator and the subscript S indicates that we are projecting y onto the linear subspace S.

The operator P is called the orthogonal projection mapping onto S.

_images/orth_proj_thm2.png

It is immediate from the OPT that for any yRn

  1. PyS and

  2. yPyS

From this, we can deduce additional useful properties, such as

  1. y2=Py2+yPy2 and

  2. Pyy

For example, to prove 1, observe that y=Py+yPy and apply the Pythagorean law.

9.3.2.1. Orthogonal Complement#

Let SRn.

The orthogonal complement of S is the linear subspace S that satisfies x1x2 for every x1S and x2S.

Let Y be a linear space with linear subspace S and its orthogonal complement S.

We write

Y=SS

to indicate that for every yY there is unique x1S and a unique x2S such that y=x1+x2.

Moreover, x1=E^Sy and x2=yE^Sy.

This amounts to another version of the OPT:

Theorem. If S is a linear subspace of Rn, E^Sy=Py and E^Sy=My, then

PyMyandy=Py+Myfor all yRn

The next figure illustrates

_images/orth_proj_thm3.png

9.4. Orthonormal Basis#

An orthogonal set of vectors ORn is called an orthonormal set if u=1 for all uO.

Let S be a linear subspace of Rn and let OS.

If O is orthonormal and spanO=S, then O is called an orthonormal basis of S.

O is necessarily a basis of S (being independent by orthogonality and the fact that no element is the zero vector).

One example of an orthonormal set is the canonical basis {e1,,en} that forms an orthonormal basis of Rn, where ei is the i th unit vector.

If {u1,,uk} is an orthonormal basis of linear subspace S, then

x=i=1kx,uiuifor allxS

To see this, observe that since xspan{u1,,uk}, we can find scalars α1,,αk that verify

(9.1)#x=j=1kαjuj

Taking the inner product with respect to ui gives

x,ui=j=1kαjuj,ui=αi

Combining this result with (9.1) verifies the claim.

9.4.1. Projection onto an Orthonormal Basis#

When a subspace onto which we project is orthonormal, computing the projection simplifies:

Theorem If {u1,,uk} is an orthonormal basis for S, then

(9.2)#Py=i=1ky,uiui,yRn

Proof: Fix yRn and let Py be defined as in (9.2).

Clearly, PyS.

We claim that yPyS also holds.

It sufficies to show that yPy any basis vector ui.

This is true because

yi=1ky,uiui,uj=y,uji=1ky,uiui,uj=0

(Why is this sufficient to establish the claim that yPyS?)

9.5. Projection Via Matrix Algebra#

Let S be a linear subspace of Rn and let yRn.

We want to compute the matrix P that verifies

E^Sy=Py

Evidently Py is a linear function from yRn to PyRn.

This reference is useful.

Theorem. Let the columns of n×k matrix X form a basis of S. Then

P=X(XX)1X

Proof: Given arbitrary yRn and P=X(XX)1X, our claim is that

  1. PyS, and

  2. yPyS

Claim 1 is true because

Py=X(XX)1Xy=Xawhena:=(XX)1Xy

An expression of the form Xa is precisely a linear combination of the columns of X and hence an element of S.

Claim 2 is equivalent to the statement

yX(XX)1XyXbfor allbRK

To verify this, notice that if bRK, then

(Xb)[yX(XX)1Xy]=b[XyXy]=0

The proof is now complete.

9.5.1. Starting with the Basis#

It is common in applications to start with n×k matrix X with linearly independent columns and let

S:=spanX:=span{\col1X,,\colkX}

Then the columns of X form a basis of S.

From the preceding theorem, P=X(XX)1Xy projects y onto S.

In this context, P is often called the projection matrix

  • The matrix M=IP satisfies My=E^Sy and is sometimes called the annihilator matrix.

9.5.2. The Orthonormal Case#

Suppose that U is n×k with orthonormal columns.

Let ui:=colUi for each i, let S:=spanU and let yRn.

We know that the projection of y onto S is

Py=U(UU)1Uy

Since U has orthonormal columns, we have UU=I.

Hence

Py=UUy=i=1kui,yui

We have recovered our earlier result about projecting onto the span of an orthonormal basis.

9.5.3. Application: Overdetermined Systems of Equations#

Let yRn and let X be n×k with linearly independent columns.

Given X and y, we seek bRk that satisfies the system of linear equations Xb=y.

If n>k (more equations than unknowns), then b is said to be overdetermined.

Intuitively, we may not be able to find a b that satisfies all n equations.

The best approach here is to

  • Accept that an exact solution may not exist.

  • Look instead for an approximate solution.

By approximate solution, we mean a bRk such that Xb is close to y.

The next theorem shows that a best approximation is well defined and unique.

The proof uses the OPT.

Theorem The unique minimizer of yXb over bRK is

β^:=(XX)1Xy

Proof: Note that

Xβ^=X(XX)1Xy=Py

Since Py is the orthogonal projection onto span(X) we have

yPyyz for any zspan(X)

Because Xbspan(X)

yXβ^yXb for any bRK

This is what we aimed to show.

9.6. Least Squares Regression#

Let’s apply the theory of orthogonal projection to least squares regression.

This approach provides insights about many geometric properties of linear regression.

We treat only some examples.

9.6.1. Squared Risk Measures#

Given pairs (x,y)RK×R, consider choosing f:RKR to minimize the risk

R(f):=E[(yf(x))2]

If probabilities and hence E are unknown, we cannot solve this problem directly.

However, if a sample is available, we can estimate the risk with the empirical risk:

minfF1Nn=1N(ynf(xn))2

Minimizing this expression is called empirical risk minimization.

The set F is sometimes called the hypothesis space.

The theory of statistical learning tells us that to prevent overfitting we should take the set F to be relatively simple.

If we let F be the class of linear functions, the problem is

minbRKn=1N(ynbxn)2

This is the sample linear least squares problem.

9.6.2. Solution#

Define the matrices

y:=(y1y2yN),xn:=(xn1xn2xnK)=n-th obs on all regressors

and

X:=(x1x2xN):=:(x11x12x1Kx21x22x2KxN1xN2xNK)

We assume throughout that N>K and X is full column rank.

If you work through the algebra, you will be able to verify that yXb2=n=1N(ynbxn)2.

Since monotone transforms don’t affect minimizers, we have

argminbRKn=1N(ynbxn)2=argminbRKyXb

By our results about overdetermined linear systems of equations, the solution is

β^:=(XX)1Xy

Let P and M be the projection and annihilator associated with X:

P:=X(XX)1XandM:=IP

The vector of fitted values is

y^:=Xβ^=Py

The vector of residuals is

u^:=yy^=yPy=My

Here are some more standard definitions:

  • The total sum of squares is :=y2.

  • The sum of squared residuals is :=u^2.

  • The explained sum of squares is :=y^2.

TSS = ESS + SSR

We can prove this easily using the OPT.

From the OPT we have y=y^+u^ and u^y^.

Applying the Pythagorean law completes the proof.

9.7. Orthogonalization and Decomposition#

Let’s return to the connection between linear independence and orthogonality touched on above.

A result of much interest is a famous algorithm for constructing orthonormal sets from linearly independent sets.

The next section gives details.

9.7.1. Gram-Schmidt Orthogonalization#

Theorem For each linearly independent set {x1,,xk}Rn, there exists an orthonormal set {u1,,uk} with

span{x1,,xi}=span{u1,,ui}fori=1,,k

The Gram-Schmidt orthogonalization procedure constructs an orthogonal set {u1,u2,,un}.

One description of this procedure is as follows:

  • For i=1,,k, form Si:=span{x1,,xi} and Si

  • Set v1=x1

  • For i2 set vi:=E^Si1xi and ui:=vi/vi

The sequence u1,,uk has the stated properties.

A Gram-Schmidt orthogonalization construction is a key idea behind the Kalman filter described in A First Look at the Kalman filter.

In some exercises below, you are asked to implement this algorithm and test it using projection.

9.7.2. QR Decomposition#

The following result uses the preceding algorithm to produce a useful decomposition.

Theorem If X is n×k with linearly independent columns, then there exists a factorization X=QR where

  • R is k×k, upper triangular, and nonsingular

  • Q is n×k with orthonormal columns

Proof sketch: Let

  • xj:=\colj(X)

  • {u1,,uk} be orthonormal with the same span as {x1,,xk} (to be constructed using Gram–Schmidt)

  • Q be formed from cols ui

Since xjspan{u1,,uj}, we have

xj=i=1jui,xjuifor j=1,,k

Some rearranging gives X=QR.

9.7.3. Linear Regression via QR Decomposition#

For matrices X and y that overdetermine β in the linear equation system y=Xβ, we found the least squares approximator β^=(XX)1Xy.

Using the QR decomposition X=QR gives

β^=(RQQR)1RQy=(RR)1RQy=R1(R)1RQy=R1Qy

Numerical routines would in this case use the alternative form Rβ^=Qy and back substitution.

9.8. Exercises#

Exercise 9.1

Show that, for any linear subspace SRn, SS={0}.

Exercise 9.2

Let P=X(XX)1X and let M=IP. Show that P and M are both idempotent and symmetric. Can you give any intuition as to why they should be idempotent?

Exercise 9.3

Using Gram-Schmidt orthogonalization, produce a linear projection of y onto the column space of X and verify this using the projection matrix P:=X(XX)1X and also using QR decomposition, where:

y:=(133),

and

X:=(100622)