2025-10-24 Beyond Linear Models¶
Assumptions of linear models
Look at the data!
Partial derivatives
Loss functions
[1]:
using LinearAlgebra
using Plots
using Polynomials
default(lw=4, ms=5, legendfontsize=12, xtickfontsize=12, ytickfontsize=12)
# Here's our Vandermonde matrix again
function vander(x, k=nothing)
if isnothing(k)
k = length(x)
end
m = length(x)
V = ones(m, k)
for j in 2:k
V[:, j] = V[:, j-1] .* x
end
V
end
# With Chebyshev polynomials
function vander_chebyshev(x, n=nothing)
if isnothing(n)
n = length(x) # Square by default
end
m = length(x)
T = ones(m, n)
if n > 1
T[:, 2] = x
end
for k in 3:n
#T[:, k] = x .* T[:, k-1]
T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
end
T
end
# And for piecewise constant interpolation
function interp_nearest(x, s)
A = zeros(length(s), length(x))
for (i, t) in enumerate(s)
loc = nothing
dist = Inf
for (j, u) in enumerate(x)
if abs(t - u) < dist
loc = j
dist = abs(t - u)
end
end
A[i, loc] = 1
end
A
end
# And our "bad" function
runge(x) = 1 / (1 + 10*x^2)
# And a utility for points distributed via cos
CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))
# And a helper for looking at conditioning
vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]
[1]:
vcond (generic function with 1 method)
Why ‘linear’¶
We are currently working with algorithms that express the regression as a linear function of the model parameters. That is, we search for coefficients \(c = \left[ c_1, c_2, \dots \right]^T\) such that
where the left hand side is linear in \(c\). In different notation, we are searching for a predictive model
that is linear in \(c\).
Assumptions¶
So far, we have been using the following assumptions
The independent variables \(x\) are error-free
The prediction (or “response”) \(f \left( x, c \right)\) is linear in \(c\)
The noise in the measurements \(y\) is independent (uncorrelated)
The noise in the measurements \(y\) has constant variance
There are reasons why all of these assumptions may be undesirable in practice, leading to more complicated methods.
Anscombe’s quartet¶
Loss functions¶
The error in a single prediction \(f \left( x_i, c \right)\) of an observation \(\left( x_i, y_i \right)\) is often measured as
which turns out to have a statistical interpretation when the noise is normally distributed.
It is natural to define the error over the entire data set as
where I’ve used the notation \(f \left( x, c \right)\) to mean the vector resulting from gathering all of the outputs \(f \left( x_i, c \right)\). The function is called the “loss function” and is the key to relaxing the above assumptions.
Gradient of scalar-valued functions¶
Let’s step back from optimization and consider how to differentiate a function of several variables. Let \(f \left( \mathbf{x} \right)\) be a function of a vector \(\mathbf{x}\). For example,
We can evaluate the partial derivative by differentiating with respect to each component \(x_i\) separately (holding the others constant), and collect the result in a vector,
Gradient of vector-valued functions¶
Now let’s consider a vector-valued function \(\mathbf{f} \left( \mathbf{x} \right)\). For example,
and write the derivative as a matrix
Here is a handy resource on partial derivatives for matrices and vectors: https://explained.ai/matrix-calculus/index.html#sec3
Derivative of a dot product¶
(Hang in there; I promise that we’re building to a point!)
Let \(f \left( \mathbf{x} \right) = \mathbf{y}^T \mathbf{x} = \sum_i y_i x_i\) and compute the derivative
Note that \(\mathbf{y}^T \mathbf{x} = \mathbf{x}^T \mathbf{y}\) and we have the product rule
Also,
Variational notation¶
It’s convenient to express derivatives in terms of how they act on an infinitesimal perturbation. So we might write
(It is common to use \(\delta x\) or \(dx\) for these infinitesimals.) This makes inner products look like a normal product rule
A powerful example of variational notation is differentiating a matrix inverse
and thus
Practice¶
Differentiate \(f \left( x \right) = A x\) with respect to \(x\)
Differentiate \(f \left( x \right) = A x\) with respect to \(A\)
Optimization¶
Ok, now we can start putting pieces together.
Given data \(\left( x, y \right)\) and loss function \(L \left( c; x, y \right)\), we wish to find the coefficients \(c\) that minimize the loss, thus yielding the “best predictor” (in a sense that can be made statistically precise). I.e.,
It is usually desirable to design models such that the loss function is differentiable with respect to the coefficients \(c\), because this allows the use of more efficient optimization methods.
Recall that our forward model is given in terms of the Vandermonde matrix,
and thus
Derivative of loss function¶
We can now differentiate our loss function
term-by-term as
where \(V \left( x_i \right)\) is the \(i\)th row of \(V \left( x \right)\).
Alternative derivative¶
Alternatively, we can take a more linear algebraic approach to write the same expression as
A necessary condition for the loss function to be minimized is that $ \nabla_c L \left`( c; x, y :nbsphinx-math:right`) = 0$.
Is the condition sufficient for general \(f \left( x, c \right)\)?
Is the condition sufficient for the linear model \(f \left( x, c \right) = V \left( x \right) c\)?
Have we seen this sort of equation before?