{ "cells": [ { "cell_type": "markdown", "id": "83abd682-9c99-44f7-bb58-2e26065fe1d4", "metadata": {}, "source": [ "# 2025-10-24 Beyond Linear Models\n", "\n", "* Assumptions of linear models\n", "\n", "* Look at the data!\n", "\n", "* Partial derivatives\n", "\n", "* Loss functions" ] }, { "cell_type": "code", "execution_count": 1, "id": "fb51563e-29a8-4295-a043-d1cb872a4937", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "vcond (generic function with 1 method)" ] }, "execution_count": 1, "metadata": {}, "output_type": "execute_result" } ], "source": [ "using LinearAlgebra\n", "using Plots\n", "using Polynomials\n", "default(lw=4, ms=5, legendfontsize=12, xtickfontsize=12, ytickfontsize=12)\n", "\n", "# Here's our Vandermonde matrix again\n", "function vander(x, k=nothing)\n", " if isnothing(k)\n", " k = length(x)\n", " end\n", " m = length(x)\n", " V = ones(m, k)\n", " for j in 2:k\n", " V[:, j] = V[:, j-1] .* x\n", " end\n", " V\n", "end\n", "\n", "# With Chebyshev polynomials\n", "function vander_chebyshev(x, n=nothing)\n", " if isnothing(n)\n", " n = length(x) # Square by default\n", " end\n", " m = length(x)\n", " T = ones(m, n)\n", " if n > 1\n", " T[:, 2] = x\n", " end\n", " for k in 3:n\n", " #T[:, k] = x .* T[:, k-1]\n", " T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]\n", " end\n", " T\n", "end\n", "\n", "# And for piecewise constant interpolation\n", "function interp_nearest(x, s)\n", " A = zeros(length(s), length(x))\n", " for (i, t) in enumerate(s)\n", " loc = nothing\n", " dist = Inf\n", " for (j, u) in enumerate(x)\n", " if abs(t - u) < dist\n", " loc = j\n", " dist = abs(t - u)\n", " end\n", " end\n", " A[i, loc] = 1\n", " end\n", " A\n", "end\n", "\n", "# And our \"bad\" function\n", "runge(x) = 1 / (1 + 10*x^2)\n", "\n", "# And a utility for points distributed via cos\n", "CosRange(a, b, n) = (a + b)/2 .+ (b - a)/2 * cos.(LinRange(-pi, 0, n))\n", "\n", "# And a helper for looking at conditioning\n", "vcond(mat, points, nmax) = [cond(mat(points(-1, 1, n))) for n in 2:nmax]" ] }, { "cell_type": "markdown", "id": "45858524-97ea-4e00-90e2-e6a00ca9c548", "metadata": {}, "source": [ "## Why 'linear'\n", "\n", "We are currently working with algorithms that express the regression as a linear function of the model parameters.\n", "That is, we search for coefficients $c = \\left[ c_1, c_2, \\dots \\right]^T$ such that\n", "\n", "$$ V \\left( x \\right) c \\approx y $$\n", "\n", "where the left hand side is linear in $c$.\n", "In different notation, we are searching for a predictive model\n", "\n", "$$ f \\left( x_i, c \\right) \\approx y_i, \\forall \\left( x_i, y_i \\right) $$\n", "\n", "that is linear in $c$." ] }, { "cell_type": "markdown", "id": "f0b5b8f1-6951-4c7e-9a50-d3ca056c7f62", "metadata": {}, "source": [ "## Assumptions\n", "\n", "So far, we have been using the following assumptions\n", "\n", "1) The independent variables $x$ are error-free\n", "\n", "2) The prediction (or \"response\") $f \\left( x, c \\right)$ is linear in $c$\n", "\n", "3) The noise in the measurements $y$ is independent (uncorrelated)\n", "\n", "4) The noise in the measurements $y$ has constant variance\n", "\n", "There are reasons why all of these assumptions may be undesirable in practice, leading to more complicated methods." ] }, { "attachments": { "506a8e4e-99f6-4c46-ac86-99d701c25847.png": { "image/png": "" } }, "cell_type": "markdown", "id": "3015e975-a26a-4c86-a2d1-a149c290b789", "metadata": {}, "source": [ "## [Anscombe's quartet](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)\n", "\n", "![image.png](attachment:506a8e4e-99f6-4c46-ac86-99d701c25847.png)" ] }, { "cell_type": "markdown", "id": "5fdc6e2f-54a0-4991-b694-703f91b7093e", "metadata": {}, "source": [ "## Loss functions\n", "\n", "The error in a single prediction $f \\left( x_i, c \\right)$ of an observation $\\left( x_i, y_i \\right)$ is often measured as\n", "\n", "$$ \\frac{1}{2} \\left( f \\left( x_i, c \\right) - y_i \\right)^2 $$\n", "\n", "which turns out to have a statistical interpretation when the noise is normally distributed.\n", "\n", "It is natural to define the error over the entire data set as\n", "\n", "$$ \\begin{align}\n", "L \\left( c; x, y \\right) &= \\sum_i \\frac{1}{2} \\left( f \\left( x_i, c \\right) - y_i \\right)^2\\\\\n", " &= \\frac{1}{2} \\left\\lvert \\left\\lvert f \\left( x, c \\right) - y \\right\\rvert \\right\\rvert^2\n", "\\end{align} $$\n", "\n", "where I've used the notation $f \\left( x, c \\right)$ to mean the vector resulting from gathering all of the outputs $f \\left( x_i, c \\right)$.\n", "The function is called the \"loss function\" and is the key to relaxing the above assumptions." ] }, { "cell_type": "markdown", "id": "5751cc01-52a2-4bcb-b418-83a6139a327a", "metadata": {}, "source": [ "## Gradient of scalar-valued functions\n", "\n", "Let's step back from optimization and consider how to differentiate a function of several variables.\n", "Let $f \\left( \\mathbf{x} \\right)$ be a function of a vector $\\mathbf{x}$.\n", "For example,\n", "\n", "$$ f \\left( \\mathbf{x} \\right) = x_1^2 + \\sin \\left( x_2 \\right) e^{3 x_3} $$\n", "\n", "We can evaluate the partial derivative by differentiating with respect to each component $x_i$ separately (holding the others constant), and collect the result in a vector,\n", "\n", "$$ \\begin{align}\n", "\\frac{\\partial f}{\\partial \\mathbf{x}} &= \\begin{bmatrix} \\frac{\\partial f}{\\partial x_1} & \\frac{\\partial f}{\\partial x_2} & \\frac{\\partial f}{\\partial x_3} \\end{bmatrix}\\\\\n", " &= \\begin{bmatrix} 2 x_1 & \\cos \\left( x_2 \\right) e^{3 x_3} & 3 \\sin \\left( x_2 \\right) e^{3 x_3} \\end{bmatrix}\n", "\\end{align} $$" ] }, { "cell_type": "markdown", "id": "0aed0522-7557-4659-8181-79cff6a90e9e", "metadata": {}, "source": [ "## Gradient of vector-valued functions\n", "\n", "Now let's consider a vector-valued function $\\mathbf{f} \\left( \\mathbf{x} \\right)$.\n", "For example,\n", "\n", "$$ \\mathbf{f} \\left( \\mathbf{x} \\right) = \\begin{bmatrix} x_1^2 + \\sin \\left( x_2 \\right) e^{3 x_3}\\\\ x_1 x_2^2 / x_3 \\end{bmatrix} $$\n", "\n", "and write the derivative as a matrix\n", "\n", "$$ \\begin{align}\n", "\\frac{\\partial \\mathbf{f}}{\\partial \\mathbf{x}} &= \\begin{bmatrix} \\frac{\\partial f_1}{\\partial x_1} & \\frac{\\partial f_1}{\\partial x_2} & \\frac{\\partial f_1}{\\partial x_3}\\\\ \\frac{\\partial f_2}{\\partial x_1} & \\frac{\\partial f_2}{\\partial x_2} & \\frac{\\partial f_2}{\\partial x_3} \\end{bmatrix}\\\\\n", " &= \\begin{bmatrix} 2 x_1 & \\cos \\left( x_2 \\right) e^{3 x_3} & 3 \\sin \\left( x_2 \\right) e^{3 x_3}\\\\ x_2^2 / x_3 & 2 x_1 x_2 / x_3 & -x_1 x_2^2 / x_3^2 \\end{bmatrix}\n", "\\end{align} $$\n", "\n", "Here is a handy resource on partial derivatives for matrices and vectors: https://explained.ai/matrix-calculus/index.html#sec3" ] }, { "cell_type": "markdown", "id": "60ac1e57-7708-4142-bec2-cc5fc1af60df", "metadata": {}, "source": [ "## Derivative of a dot product\n", "\n", "(Hang in there; I promise that we're building to a point!)\n", "\n", "Let $f \\left( \\mathbf{x} \\right) = \\mathbf{y}^T \\mathbf{x} = \\sum_i y_i x_i$ and compute the derivative\n", "\n", "$$ \\frac{\\partial f}{\\partial \\mathbf{x}} = \\begin{bmatrix} y_1 & y_2 & \\cdots \\end{bmatrix} = \\mathbf{y}^T $$\n", "\n", "Note that $\\mathbf{y}^T \\mathbf{x} = \\mathbf{x}^T \\mathbf{y}$ and we have the product rule\n", "\n", "$$ \\frac{\\partial \\left\\lvert \\left\\lvert \\mathbf{x} \\right\\rvert \\right\\rvert^2}{\\partial \\mathbf{x}} = \\frac{\\partial \\mathbf{x}^T \\mathbf{x}}{\\partial \\mathbf{x}} = 2 \\mathbf{x}^T $$\n", "\n", "Also,\n", "\n", "$$ \\frac{\\partial \\left\\lvert \\left\\lvert \\mathbf{x} - \\mathbf{y} \\right\\rvert \\right\\rvert^2}{\\partial \\mathbf{x}} = \\frac{\\partial (\\mathbf{x} - \\mathbf{y})^T (\\mathbf{x} - \\mathbf{y})}{\\partial \\mathbf{x}} = 2 (\\mathbf{x} - \\mathbf{y})^T $$" ] }, { "cell_type": "markdown", "id": "38f6af7b-d716-49c3-9afc-7c3dda9ff9c6", "metadata": {}, "source": [ "## Variational notation\n", "\n", "It's convenient to express derivatives in terms of how they act on an infinitesimal perturbation.\n", "So we might write\n", "\n", "$$ \\delta f = \\frac{\\partial f}{\\partial x} \\delta x $$\n", "\n", "(It is common to use $\\delta x$ or $dx$ for these infinitesimals.)\n", "This makes inner products look like a normal product rule\n", "\n", "$$ \\delta \\left( \\mathbf{x}^T \\mathbf{y} \\right) = \\left( \\delta \\mathbf{x} \\right)^T \\mathbf{y} + \\mathbf{x}^T \\left( \\delta \\mathbf{y} \\right) $$\n", "\n", "A powerful example of variational notation is differentiating a matrix inverse\n", "\n", "$$ 0 = \\delta I = \\delta \\left( A^{-1} A \\right) = \\left( \\delta A^{-1} \\right) A + A^{-1} \\left( \\delta A \\right) $$\n", "\n", "and thus\n", "\n", "$$ \\delta A^{-1} = - A^{-1} \\left( \\delta A \\right) A^{-1} $$" ] }, { "cell_type": "markdown", "id": "40d0e547-ba19-40ec-a254-226e41d35156", "metadata": {}, "source": [ "## Practice\n", "\n", "1) Differentiate $f \\left( x \\right) = A x$ with respect to $x$\n", "\n", "2) Differentiate $f \\left( x \\right) = A x$ with respect to $A$" ] }, { "cell_type": "markdown", "id": "326aed8a-0fde-403a-9e8f-b57c93115dbf", "metadata": {}, "source": [ "## Optimization\n", "\n", "Ok, now we can start putting pieces together.\n", "\n", "Given data $\\left( x, y \\right)$ and loss function $L \\left( c; x, y \\right)$, we wish to find the coefficients $c$ that minimize the loss, thus yielding the \"best predictor\" (in a sense that can be made statistically precise). I.e.,\n", "\n", "$$ \\bar{c} = \\arg \\min_c L \\left( c; x, y \\right) $$\n", "\n", "It is usually desirable to design models such that the loss function is differentiable with respect to the coefficients $c$, because this allows the use of more efficient optimization methods.\n", "\n", "Recall that our forward model is given in terms of the Vandermonde matrix,\n", "\n", "$$ f \\left( x, c \\right) = V \\left( x \\right) c $$\n", "\n", "and thus\n", "\n", "$$ \\frac{\\partial f}{\\partial c} = V \\left( x \\right) $$" ] }, { "cell_type": "markdown", "id": "2d207ca9-9396-450a-bd08-7ed375264d4c", "metadata": {}, "source": [ "## Derivative of loss function\n", "\n", "We can now differentiate our loss function\n", "\n", "$$ L \\left( c; x, y \\right) = \\frac{1}{2} \\left\\lvert \\left\\lvert f \\left( x, c \\right) - y \\right\\rvert \\right\\rvert^2 = \\frac{1}{2} \\sum_i \\left( f \\left( x_i, c \\right) - y_i \\right)^2 $$\n", "\n", "term-by-term as\n", "\n", "$$ \\begin{align}\n", "\\nabla_c L \\left( c; x, y \\right) = \\frac{\\partial L \\left( c; x, y \\right)}{\\partial c}\n", " &= \\sum_i \\left( f \\left( x_i, c \\right) - y_i \\right) \\frac{\\partial f \\left( x_i, c \\right)}{\\partial c} \\\\\n", " &= \\sum_i \\left( f \\left( x_i, c \\right) - y_i \\right) V \\left( x_i \\right)\n", "\\end{align} $$\n", "\n", "where $V \\left( x_i \\right)$ is the $i$th row of $V \\left( x \\right)$." ] }, { "cell_type": "markdown", "id": "cbb4e81c-ec49-46a4-b43d-9ae43d7142a0", "metadata": {}, "source": [ "## Alternative derivative\n", "\n", "Alternatively, we can take a more linear algebraic approach to write the same expression as\n", "\n", "$$ \\begin{align}\n", "\\nabla_c L \\left( c; x, y \\right)\n", " &= \\left( f \\left( x, c \\right) - y \\right)^T V \\left( x \\right) \\\\\n", " &= \\left( V \\left( x \\right) c - y \\right)^T V \\left( x \\right) \\\\\n", " &= V \\left( x \\right)^T \\left( V \\left( x \\right) c - y \\right)\n", "\\end{align} $$\n", "\n", "A necessary condition for the loss function to be minimized is that $ \\nabla_c L \\left( c; x, y \\right) = 0$.\n", "\n", "* Is the condition sufficient for general $f \\left( x, c \\right)$?\n", "\n", "* Is the condition sufficient for the linear model $f \\left( x, c \\right) = V \\left( x \\right) c$?\n", "\n", "* Have we seen this sort of equation before?" ] } ], "metadata": { "kernelspec": { "display_name": "Julia 1.11.6", "language": "julia", "name": "julia-1.11" }, "language_info": { "file_extension": ".jl", "mimetype": "application/julia", "name": "julia", "version": "1.11.6" } }, "nbformat": 4, "nbformat_minor": 5 }