2025-10-24 Beyond Linear Models

  • Assumptions of linear models

  • Look at the data!

  • Partial derivatives

  • Loss functions

[1]:
using LinearAlgebra
using Plots
using Polynomials
default(lw=4, ms=5, legendfontsize=12, xtickfontsize=12, ytickfontsize=12)

# Here's our Vandermonde matrix again
function vander(x, k=nothing)
    if isnothing(k)
        k = length(x)
    end
    m = length(x)
    V = ones(m, k)
    for j in 2:k
        V[:, j] = V[:, j-1] .* x
    end
    V
end

# With Chebyshev polynomials
function vander_chebyshev(x, n=nothing)
    if isnothing(n)
        n = length(x) # Square by default
    end
    m = length(x)
    T = ones(m, n)
    if n > 1
        T[:, 2] = x
    end
    for k in 3:n
        #T[:, k] = x .* T[:, k-1]
        T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
    end
    T
end

# And for piecewise constant interpolation
function interp_nearest(x, s)
    A = zeros(length(s), length(x))
    for (i, t) in enumerate(s)
        loc = nothing
        dist = Inf
        for (j, u) in enumerate(x)
            if abs(t - u) < dist
                loc = j
                dist = abs(t - u)
            end
        end
        A[i, loc] = 1
    end
    A
end

# And our "bad" function
runge(x) = 1 / (1 + 10*x^2)

# And a utility for points distributed via cos
CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))

# And a helper for looking at conditioning
vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]
[1]:
vcond (generic function with 1 method)

Why ‘linear’

We are currently working with algorithms that express the regression as a linear function of the model parameters. That is, we search for coefficients \(c = \left[ c_1, c_2, \dots \right]^T\) such that

\[V \left( x \right) c \approx y\]

where the left hand side is linear in \(c\). In different notation, we are searching for a predictive model

\[f \left( x_i, c \right) \approx y_i, \forall \left( x_i, y_i \right)\]

that is linear in \(c\).

Assumptions

So far, we have been using the following assumptions

  1. The independent variables \(x\) are error-free

  2. The prediction (or “response”) \(f \left( x, c \right)\) is linear in \(c\)

  3. The noise in the measurements \(y\) is independent (uncorrelated)

  4. The noise in the measurements \(y\) has constant variance

There are reasons why all of these assumptions may be undesirable in practice, leading to more complicated methods.

Anscombe’s quartet

image.png

Loss functions

The error in a single prediction \(f \left( x_i, c \right)\) of an observation \(\left( x_i, y_i \right)\) is often measured as

\[\frac{1}{2} \left( f \left( x_i, c \right) - y_i \right)^2\]

which turns out to have a statistical interpretation when the noise is normally distributed.

It is natural to define the error over the entire data set as

\[\begin{split} \begin{align} L \left( c; x, y \right) &= \sum_i \frac{1}{2} \left( f \left( x_i, c \right) - y_i \right)^2\\ &= \frac{1}{2} \left\lvert \left\lvert f \left( x, c \right) - y \right\rvert \right\rvert^2 \end{align}\end{split}\]

where I’ve used the notation \(f \left( x, c \right)\) to mean the vector resulting from gathering all of the outputs \(f \left( x_i, c \right)\). The function is called the “loss function” and is the key to relaxing the above assumptions.

Gradient of scalar-valued functions

Let’s step back from optimization and consider how to differentiate a function of several variables. Let \(f \left( \mathbf{x} \right)\) be a function of a vector \(\mathbf{x}\). For example,

\[f \left( \mathbf{x} \right) = x_1^2 + \sin \left( x_2 \right) e^{3 x_3}\]

We can evaluate the partial derivative by differentiating with respect to each component \(x_i\) separately (holding the others constant), and collect the result in a vector,

\[\begin{split} \begin{align} \frac{\partial f}{\partial \mathbf{x}} &= \begin{bmatrix} \frac{\partial f}{\partial x_1} & \frac{\partial f}{\partial x_2} & \frac{\partial f}{\partial x_3} \end{bmatrix}\\ &= \begin{bmatrix} 2 x_1 & \cos \left( x_2 \right) e^{3 x_3} & 3 \sin \left( x_2 \right) e^{3 x_3} \end{bmatrix} \end{align}\end{split}\]

Gradient of vector-valued functions

Now let’s consider a vector-valued function \(\mathbf{f} \left( \mathbf{x} \right)\). For example,

\[\begin{split}\mathbf{f} \left( \mathbf{x} \right) = \begin{bmatrix} x_1^2 + \sin \left( x_2 \right) e^{3 x_3}\\ x_1 x_2^2 / x_3 \end{bmatrix}\end{split}\]

and write the derivative as a matrix

\[\begin{split} \begin{align} \frac{\partial \mathbf{f}}{\partial \mathbf{x}} &= \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \frac{\partial f_1}{\partial x_3}\\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \frac{\partial f_2}{\partial x_3} \end{bmatrix}\\ &= \begin{bmatrix} 2 x_1 & \cos \left( x_2 \right) e^{3 x_3} & 3 \sin \left( x_2 \right) e^{3 x_3}\\ x_2^2 / x_3 & 2 x_1 x_2 / x_3 & -x_1 x_2^2 / x_3^2 \end{bmatrix} \end{align}\end{split}\]

Here is a handy resource on partial derivatives for matrices and vectors: https://explained.ai/matrix-calculus/index.html#sec3

Derivative of a dot product

(Hang in there; I promise that we’re building to a point!)

Let \(f \left( \mathbf{x} \right) = \mathbf{y}^T \mathbf{x} = \sum_i y_i x_i\) and compute the derivative

\[\frac{\partial f}{\partial \mathbf{x}} = \begin{bmatrix} y_1 & y_2 & \cdots \end{bmatrix} = \mathbf{y}^T\]

Note that \(\mathbf{y}^T \mathbf{x} = \mathbf{x}^T \mathbf{y}\) and we have the product rule

\[\frac{\partial \left\lvert \left\lvert \mathbf{x} \right\rvert \right\rvert^2}{\partial \mathbf{x}} = \frac{\partial \mathbf{x}^T \mathbf{x}}{\partial \mathbf{x}} = 2 \mathbf{x}^T\]

Also,

\[\frac{\partial \left\lvert \left\lvert \mathbf{x} - \mathbf{y} \right\rvert \right\rvert^2}{\partial \mathbf{x}} = \frac{\partial (\mathbf{x} - \mathbf{y})^T (\mathbf{x} - \mathbf{y})}{\partial \mathbf{x}} = 2 (\mathbf{x} - \mathbf{y})^T\]

Variational notation

It’s convenient to express derivatives in terms of how they act on an infinitesimal perturbation. So we might write

\[\delta f = \frac{\partial f}{\partial x} \delta x\]

(It is common to use \(\delta x\) or \(dx\) for these infinitesimals.) This makes inner products look like a normal product rule

\[\delta \left( \mathbf{x}^T \mathbf{y} \right) = \left( \delta \mathbf{x} \right)^T \mathbf{y} + \mathbf{x}^T \left( \delta \mathbf{y} \right)\]

A powerful example of variational notation is differentiating a matrix inverse

\[0 = \delta I = \delta \left( A^{-1} A \right) = \left( \delta A^{-1} \right) A + A^{-1} \left( \delta A \right)\]

and thus

\[\delta A^{-1} = - A^{-1} \left( \delta A \right) A^{-1}\]

Practice

  1. Differentiate \(f \left( x \right) = A x\) with respect to \(x\)

  2. Differentiate \(f \left( x \right) = A x\) with respect to \(A\)

Optimization

Ok, now we can start putting pieces together.

Given data \(\left( x, y \right)\) and loss function \(L \left( c; x, y \right)\), we wish to find the coefficients \(c\) that minimize the loss, thus yielding the “best predictor” (in a sense that can be made statistically precise). I.e.,

\[\bar{c} = \arg \min_c L \left( c; x, y \right)\]

It is usually desirable to design models such that the loss function is differentiable with respect to the coefficients \(c\), because this allows the use of more efficient optimization methods.

Recall that our forward model is given in terms of the Vandermonde matrix,

\[f \left( x, c \right) = V \left( x \right) c\]

and thus

\[\frac{\partial f}{\partial c} = V \left( x \right)\]

Derivative of loss function

We can now differentiate our loss function

\[L \left( c; x, y \right) = \frac{1}{2} \left\lvert \left\lvert f \left( x, c \right) - y \right\rvert \right\rvert^2 = \frac{1}{2} \sum_i \left( f \left( x_i, c \right) - y_i \right)^2\]

term-by-term as

\[\begin{split} \begin{align} \nabla_c L \left( c; x, y \right) = \frac{\partial L \left( c; x, y \right)}{\partial c} &= \sum_i \left( f \left( x_i, c \right) - y_i \right) \frac{\partial f \left( x_i, c \right)}{\partial c} \\ &= \sum_i \left( f \left( x_i, c \right) - y_i \right) V \left( x_i \right) \end{align}\end{split}\]

where \(V \left( x_i \right)\) is the \(i\)th row of \(V \left( x \right)\).

Alternative derivative

Alternatively, we can take a more linear algebraic approach to write the same expression as

\[\begin{split} \begin{align} \nabla_c L \left( c; x, y \right) &= \left( f \left( x, c \right) - y \right)^T V \left( x \right) \\ &= \left( V \left( x \right) c - y \right)^T V \left( x \right) \\ &= V \left( x \right)^T \left( V \left( x \right) c - y \right) \end{align}\end{split}\]

A necessary condition for the loss function to be minimized is that $ \nabla_c L \left`( c; x, y :nbsphinx-math:right`) = 0$.

  • Is the condition sufficient for general \(f \left( x, c \right)\)?

  • Is the condition sufficient for the linear model \(f \left( x, c \right) = V \left( x \right) c\)?

  • Have we seen this sort of equation before?