2025-10-24 Beyond Linear Models¶

Assumptions of linear models
Look at the data!
Partial derivatives
Loss functions

[1]:

using LinearAlgebra
using Plots
using Polynomials
default(lw=4, ms=5, legendfontsize=12, xtickfontsize=12, ytickfontsize=12)

# Here's our Vandermonde matrix again
function vander(x, k=nothing)
    if isnothing(k)
        k = length(x)
    end
    m = length(x)
    V = ones(m, k)
    for j in 2:k
        V[:, j] = V[:, j-1] .* x
    end
    V
end

# With Chebyshev polynomials
function vander_chebyshev(x, n=nothing)
    if isnothing(n)
        n = length(x) # Square by default
    end
    m = length(x)
    T = ones(m, n)
    if n > 1
        T[:, 2] = x
    end
    for k in 3:n
        #T[:, k] = x .* T[:, k-1]
        T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
    end
    T
end

# And for piecewise constant interpolation
function interp_nearest(x, s)
    A = zeros(length(s), length(x))
    for (i, t) in enumerate(s)
        loc = nothing
        dist = Inf
        for (j, u) in enumerate(x)
            if abs(t - u) < dist
                loc = j
                dist = abs(t - u)
            end
        end
        A[i, loc] = 1
    end
    A
end

# And our "bad" function
runge(x) = 1 / (1 + 10*x^2)

# And a utility for points distributed via cos
CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))

# And a helper for looking at conditioning
vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]

[1]:

vcond (generic function with 1 method)

Why ‘linear’¶

We are currently working with algorithms that express the regression as a linear function of the model parameters. That is, we search for coefficients $c = \left[ c_1, c_2, \dots \right]^T$ such that

\[V \left( x \right) c \approx y\]

where the left hand side is linear in $c$. In different notation, we are searching for a predictive model

\[f \left( x_i, c \right) \approx y_i, \forall \left( x_i, y_i \right)\]

that is linear in $c$.

Assumptions¶

So far, we have been using the following assumptions

The independent variables $x$ are error-free
The prediction (or “response”) $f \left( x, c \right)$ is linear in $c$
The noise in the measurements $y$ is independent (uncorrelated)
The noise in the measurements $y$ has constant variance

There are reasons why all of these assumptions may be undesirable in practice, leading to more complicated methods.

Anscombe’s quartet ¶

Loss functions¶

The error in a single prediction $f \left( x_i, c \right)$ of an observation $\left( x_i, y_i \right)$ is often measured as

\[\frac{1}{2} \left( f \left( x_i, c \right) - y_i \right)^2\]

which turns out to have a statistical interpretation when the noise is normally distributed.

It is natural to define the error over the entire data set as

\[\begin{split} \begin{align} L \left( c; x, y \right) &= \sum_i \frac{1}{2} \left( f \left( x_i, c \right) - y_i \right)^2\\ &= \frac{1}{2} \left\lvert \left\lvert f \left( x, c \right) - y \right\rvert \right\rvert^2 \end{align}\end{split}\]

where I’ve used the notation $f \left( x, c \right)$ to mean the vector resulting from gathering all of the outputs $f \left( x_i, c \right)$. The function is called the “loss function” and is the key to relaxing the above assumptions.

Gradient of scalar-valued functions¶

Let’s step back from optimization and consider how to differentiate a function of several variables. Let $f \left( \mathbf{x} \right)$ be a function of a vector $\mathbf{x}$. For example,

\[f \left( \mathbf{x} \right) = x_1^2 + \sin \left( x_2 \right) e^{3 x_3}\]

We can evaluate the partial derivative by differentiating with respect to each component $x_i$ separately (holding the others constant), and collect the result in a vector,

\[\begin{split} \begin{align} \frac{\partial f}{\partial \mathbf{x}} &= \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \frac{\partial f}{\partial x_3} \end{bmatrix}\\ &= \begin{bmatrix} 2 x_1 & \cos \left( x_2 \right) e^{3 x_3} & 3 \sin \left( x_2 \right) e^{3 x_3} \end{bmatrix} \end{align}\end{split}\]

Gradient of vector-valued functions¶

Now let’s consider a vector-valued function $\mathbf{f} \left( \mathbf{x} \right)$. For example,

\[\begin{split}\mathbf{f} \left( \mathbf{x} \right) = \begin{bmatrix} x_1^2 + \sin \left( x_2 \right) e^{3 x_3}\\ x_1 x_2^2 / x_3 \end{bmatrix}\end{split}\]

and write the derivative as a matrix

\[\begin{split} \begin{align} \frac{\partial \mathbf{f}}{\partial \mathbf{x}} &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \frac{\partial f_1}{\partial x_3}\\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \frac{\partial f_2}{\partial x_3} \end{bmatrix}\\ &= \begin{bmatrix} 2 x_1 & \cos \left( x_2 \right) e^{3 x_3} & 3 \sin \left( x_2 \right) e^{3 x_3}\\ x_2^2 / x_3 & 2 x_1 x_2 / x_3 & -x_1 x_2^2 / x_3^2 \end{bmatrix} \end{align}\end{split}\]

Here is a handy resource on partial derivatives for matrices and vectors: https://explained.ai/matrix-calculus/index.html#sec3

Derivative of a dot product¶

(Hang in there; I promise that we’re building to a point!)

Let $f \left( \mathbf{x} \right) = \mathbf{y}^T \mathbf{x} = \sum_i y_i x_i$ and compute the derivative

\[\frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} y_1 & y_2 & \cdots \end{bmatrix} = \mathbf{y}^T\]

Note that $\mathbf{y}^T \mathbf{x} = \mathbf{x}^T \mathbf{y}$ and we have the product rule

\[\frac{\partial \left\lvert \left\lvert \mathbf{x} \right\rvert \right\rvert^2}{\partial \mathbf{x}} = \frac{\partial \mathbf{x}^T \mathbf{x}}{\partial \mathbf{x}} = 2 \mathbf{x}^T\]

Also,

\[\frac{\partial \left\lvert \left\lvert \mathbf{x} - \mathbf{y} \right\rvert \right\rvert^2}{\partial \mathbf{x}} = \frac{\partial (\mathbf{x} - \mathbf{y})^T (\mathbf{x} - \mathbf{y})}{\partial \mathbf{x}} = 2 (\mathbf{x} - \mathbf{y})^T\]

Variational notation¶

It’s convenient to express derivatives in terms of how they act on an infinitesimal perturbation. So we might write

\[\delta f = \frac{\partial f}{\partial x} \delta x\]

(It is common to use $\delta x$ or $dx$ for these infinitesimals.) This makes inner products look like a normal product rule

\[\delta \left( \mathbf{x}^T \mathbf{y} \right) = \left( \delta \mathbf{x} \right)^T \mathbf{y} + \mathbf{x}^T \left( \delta \mathbf{y} \right)\]

A powerful example of variational notation is differentiating a matrix inverse

\[0 = \delta I = \delta \left( A^{-1} A \right) = \left( \delta A^{-1} \right) A + A^{-1} \left( \delta A \right)\]

and thus

\[\delta A^{-1} = - A^{-1} \left( \delta A \right) A^{-1}\]

Practice¶

Differentiate $f \left( x \right) = A x$ with respect to $x$
Differentiate $f \left( x \right) = A x$ with respect to $A$

Optimization¶

Ok, now we can start putting pieces together.

Given data $\left( x, y \right)$ and loss function $L \left( c; x, y \right)$, we wish to find the coefficients $c$ that minimize the loss, thus yielding the “best predictor” (in a sense that can be made statistically precise). I.e.,

\[\bar{c} = \arg \min_c L \left( c; x, y \right)\]

It is usually desirable to design models such that the loss function is differentiable with respect to the coefficients $c$, because this allows the use of more efficient optimization methods.

Recall that our forward model is given in terms of the Vandermonde matrix,

\[f \left( x, c \right) = V \left( x \right) c\]

and thus

\[\frac{\partial f}{\partial c} = V \left( x \right)\]

Derivative of loss function¶

We can now differentiate our loss function

\[L \left( c; x, y \right) = \frac{1}{2} \left\lvert \left\lvert f \left( x, c \right) - y \right\rvert \right\rvert^2 = \frac{1}{2} \sum_i \left( f \left( x_i, c \right) - y_i \right)^2\]

term-by-term as

\[\begin{split} \begin{align} \nabla_c L \left( c; x, y \right) = \frac{\partial L \left( c; x, y \right)}{\partial c} &= \sum_i \left( f \left( x_i, c \right) - y_i \right) \frac{\partial f \left( x_i, c \right)}{\partial c} \\ &= \sum_i \left( f \left( x_i, c \right) - y_i \right) V \left( x_i \right) \end{align}\end{split}\]

where $V \left( x_i \right)$ is the $i$th row of $V \left( x \right)$.

Alternative derivative¶

Alternatively, we can take a more linear algebraic approach to write the same expression as

\[\begin{split} \begin{align} \nabla_c L \left( c; x, y \right) &= \left( f \left( x, c \right) - y \right)^T V \left( x \right) \\ &= \left( V \left( x \right) c - y \right)^T V \left( x \right) \\ &= V \left( x \right)^T \left( V \left( x \right) c - y \right) \end{align}\end{split}\]

A necessary condition for the loss function to be minimized is that $ \nabla_c L \left`( c; x, y :nbsphinx-math:right`) = 0$.

Is the condition sufficient for general $f \left( x, c \right)$?
Is the condition sufficient for the linear model $f \left( x, c \right) = V \left( x \right) c$?
Have we seen this sort of equation before?