2025-11-03 Optimization

  • Differentiation

  • Second order (Newton type) optimization

  • Project discussion

[1]:
using LinearAlgebra
using Plots
using Polynomials
default(lw=4, ms=5, legendfontsize=12, xtickfontsize=12, ytickfontsize=12)

# Here's our Vandermonde matrix again
function vander(x, k=nothing)
    if isnothing(k)
        k = length(x)
    end
    m = length(x)
    V = ones(m, k)
    for j in 2:k
        V[:, j] = V[:, j-1] .* x
    end
    V
end

# With Chebyshev polynomials
function vander_chebyshev(x, n=nothing)
    if isnothing(n)
        n = length(x) # Square by default
    end
    m = length(x)
    T = ones(m, n)
    if n > 1
        T[:, 2] = x
    end
    for k in 3:n
        #T[:, k] = x .* T[:, k-1]
        T[:, k] = 2 * x .* T[:,k-1] - T[:, k-2]
    end
    T
end

# And our "bad" function
runge(x) = 1 / (1 + 10*x^2)
runge_noisy(x, sigma) = runge.(x) + randn(size(x)) * sigma

# And our gradient descent algorithm
function grad_descent(loss, grad, c0; gamma=1e-3, tol=1e-5)
    """Minimize loss(c) via gradient descent with initial guess c0
    using learning rate gamma.  Declares convergence when gradient
    is less than tol or after 500 steps.
    """
    c = copy(c0)
    chist = [copy(c)]
    lhist = [loss(c)]
    for it in 1:500
        g = grad(c)
        c -= gamma * g
        push!(chist, copy(c))
        push!(lhist, loss(c))
        if norm(g) < tol
            break
        end
    end
    (c, hcat(chist...), lhist)
end

# And our function for a finite difference while picking h automatically
function diff_wp(f, x; eps=1e-8)
    """Diff using Walker and Pernice (1998) choice of step"""
    h = eps * (1 + abs(x))
    (f(x+h) - f(x)) / h
end
[1]:
diff_wp (generic function with 1 method)

Hand-coded derivatives

With (mild) algebra abuse, the expression

\[\frac{df}{dx} = f' \left( x \right)\]

is equivalent to

\[df = f' \left( x \right) dx\]
[2]:
function f(x)
    y = x
    for _ in 1:2
        a = y^π
        b = cos(a)
        c = log(y)
        y = b * c
    end
    y
end

f(1.9), diff_wp(f, 1.9)
[2]:
(-1.5346823414986814, -34.032439961925064)
[3]:
function df(x, dx)
    y = x
    dy = dx
    for _ in 1:2
        a = y^π
        da = π * y^(π - 1) * dy
        b = cos(a)
        db = -sin(a) * da
        c = log(y)
        dc = dy / y
        y = b * c
        dy = db * c + b * dc
    end
    y, dy
end

df(1.9, 1)
[3]:
(-1.5346823414986814, -34.032419599140475)

Forward vs reverse mode

We can differentiate a composition \(h \left( g \left( f \left( x \right) \right) \right)\) as

\[\begin{split} \begin{align} \operatorname{d} h &= h' \operatorname{d} g \\ \operatorname{d} g &= g' \operatorname{d} f \\ \operatorname{d} f &= f' \operatorname{d} x \end{align}\end{split}\]

What we’ve done above is called “forward mode”, and amounts to placing the parentheses in the chain rule like

\[\operatorname{d} h = \frac{dh}{dg} \left( \frac{dg}{df} \left( \frac{df}{dx} \operatorname{d} x \right) \right)\]

This expression means the same thing if we rearrange the parenthesis,

\[\operatorname{d} h = \left( \left( \left( \frac{dh}{dg} \right) \frac{dg}{df} \right) \frac{df}{dx} \right) \operatorname{d} x\]

Reverse mode example

Let’s do an example to better understand.

\[\underbrace{\bar x}_{\frac{dh}{dx}} = \underbrace{\bar g \frac{dg}{df}}_{\bar f} \frac{df}{dx}\]
[4]:
function g(x)
    a = x^π
    b = cos(a)
    c = log(x)
    y = b * c
    y
end

(g(1.9), diff_wp(g, 1.4))
[4]:
(0.2155134138380423, -1.2559760384500684)
[5]:
function gback(x, y_)
    a = x^π
    b = cos(a)
    c = log(x)
    y = b * c
    # backward pass
    c_ = y_ * b
    b_ = c * y_
    a_ = -sin(a) * b_
    x_ = 1/x * c_ + π * x^(π - 1) * a_
end

gback(1.4, 1)
[5]:
-1.2559761698835525

Automatic differentiation

See also Enzyme.jl

[6]:
using Pkg
pkg"add Zygote"
import Zygote
   Resolving package versions...
  No Changes to `~/.julia/environments/v1.11/Project.toml`
  No Changes to `~/.julia/environments/v1.11/Manifest.toml`
[7]:
Zygote.gradient(g, 1.4)
[7]:
(-1.2559761698835525,)

But how?

It’s cool that Zygote works, but how does it actually work?

[8]:
square(x) = x^2
# Let's look at the LLVM bitcode here
@code_llvm square(1.5)
; Function Signature: square(Float64)
;  @ In[8]:1 within `square`
define double @julia_square_11421(double %"x::Float64") #0 {
top:
; ┌ @ intfuncs.jl:370 within `literal_pow`
; │┌ @ float.jl:493 within `*`
    %0 = fmul double %"x::Float64", %"x::Float64"
    ret double %0
; └└
}
[9]:
# And here is the LLVM bitcode of Zygote's derivative
@code_llvm Zygote.gradient(square, 1.5)
; Function Signature: gradient(typeof(Main.square), Float64)
;  @ /home/jeremy/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:152 within `gradient`
define [1 x double] @julia_gradient_11636(double %"args[1]::Float64") #0 {
top:
;  @ /home/jeremy/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:154 within `gradient`
; ┌ @ /home/jeremy/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:97 within `#88`
; │┌ @ In[8]:1 within `square`
; ││┌ @ /home/jeremy/.julia/packages/Zygote/55SqB/src/compiler/chainrules.jl:222 within `ZBack`
; │││┌ @ /home/jeremy/.julia/packages/Zygote/55SqB/src/lib/number.jl:12 within `literal_pow_pullback`
; ││││┌ @ promotion.jl:430 within `*` @ float.jl:493
       %0 = fmul double %"args[1]::Float64", 2.000000e+00
; └└└└└
;  @ /home/jeremy/.julia/packages/Zygote/55SqB/src/compiler/interface.jl:155 within `gradient`
  %"new::Tuple2.unbox.fca.0.insert" = insertvalue [1 x double] zeroinitializer, double %0, 0
  ret [1 x double] %"new::Tuple2.unbox.fca.0.insert"
}

Types of algorithmic differentiation

  • Source transformation: Fortran code in, Fortran code out

    • Duplicates compiler features, usually incomplete language coverage

    • Produces efficient code

  • Operator overloading: C++ types

    • Hard to vectorize

    • Loops are effectively unrolled/inefficient

  • Just-in-time compilation: tightly coupled with compiler

    • JIT lag

    • Needs dynamic language features (JAX) or tight integration with compiler (Zygote, Enzyme)

    • Some sharp bits

Forward or reverse mode

Pick forward or reverse mode depending upon the ‘shape’ of your function.

  • One input, many outputs: use forward mode

    • “One input” can be looking in one direction

  • Many inputs, one output: use reverse mode

    • Will need to traverse execution backwards (“tape”)

    • Hierarchical checkpointing

  • About square? Forward mode is usually a bit more efficient

Differentating an algorithm?

  • Consider an input \(c\) and output \(x\) such that \(f \left( x, c \right) = 0\).

  • Consider an input \(A\) and output \(\lambda\) such that \(A x = \lambda x\) for some nonzero vector \(x\)

  • Consider an input buffer and and output sha256(buffer)

Ill-conditioned optimization

\[L \left( c; x, y \right) = \frac 1 2 \lVert \underbrace{f \left( x, c \right) - y}_{r \left( c \right)} \rVert_{C^{-1}}^2\]

Gradient of \(L\) requires the Jacobian \(J\) of the model \(f\).

\[g \left( c \right) = \nabla_c L = r^T \underbrace{\nabla_c f}_{J}\]

We can solve \(g \left( c \right) = 0\) using a Newton method

\[g \left( c + \delta c \right) = g \left( c \right) + \underbrace{\nabla_c g}_{H} \delta c + \mathcal{O} \left( \left( \delta c \right)^2 \right)\]

The Hessian requires the second derivative of \(f\), which can cause problems

\[H = J^T J + r^T \left( \nabla_c J \right)\]

Consider - if the Jacobian (fist derivative) can have significantly many more terms than our loss function, then that will compound for the Hessian.

Newton-like methods for optimization

Solve

\[H \delta c = - g \left( c \right)\]

Update \(c \leftarrow c + \gamma \delta c\) using using a line search or trust region.

Outlook

  • The optimization problem can be solved using a Newton method. It can be onerous to implement the needed derivatives.

  • The Gauss-Newton method is often more practical than Newton while being faster than gradient descent, though it lacks robustness.

  • The Levenberg-Marquardt method provides a sort of middle-ground between Gauss-Newton and gradient descent.

  • Many globalization techniques are used for models that possess many local minima.

  • One pervasive approach is stochastic gradient descent, where small batches (e.g., 1 or 10 or 20) are selected randomly from the corpus of observations, and a step of gradient descent is applied to that reduced set of observations. This helps to escape saddle points and weak local minima.

  • Among expressive models \(f \left( x, c \right)\), some may converge much more easily than others. Having a good optimization algorithm is essential for nonlinear regression with complicated models, especially those with many parameters \(c\).

  • Classification is a very similar problem to regression, but the observations \(y\) are discrete, thus

    • models \(f \left( x, c \right)\) must have discrete output

    • the least squares loss function is not appropriate

  • Why momentum really works