2025-09-08 Newton Methods

  • Newton’s method via Taylor series

  • Convergence theory for fixed point methods

  • Derive Newton’s method via convergence theory

  • Newton methods in computing culture

  • Breaking Newton’s method

See also FNC

Bisection

[1]:
using Plots
default(lw=4, ms=5, legendfontsize=12, xtickfontsize=12, ytickfontsize=12)

# Bisection function
hasroot(f, a, b) = f(a) * f(b) < 0
function bisect_iter(f, a, b, tol)
    hist = Float64[]
    while abs(b - a) > tol
        mid = (a + b) / 2
        push!(hist, mid)
        if hasroot(f, a, mid)
            b = mid
        else
            a = mid
        end
    end
    hist
end

# And a target function
f(x) = cos(x) - x
[1]:
f (generic function with 1 method)

Convergence classes

A convergent rootfinding algorithm produces a sequence of approximations \(x_k\) such that

\[\lim_{k \rightarrow \infty} x_k \rightarrow x_*\]

where \(f \left( x_* \right) = 0\). For analysis, it is convenient to define the errors \(e_k = x_k - x_*\). We say that an iterative algorithm is :math:`q-linearly convergent <https://en.wikipedia.org/wiki/Rate_of_convergence#Q-convergence>`__ if

\[\lim_{k \rightarrow \infty} \lvert e_{k + 1} \rvert / \lvert e_k \rvert = \rho < 1\]

(:math:`q` stands for “quotient”) A smaller convergence factor \(\rho\) represents faster convergence.

A slightly weaker condition, :math:`r-linear convergence <https://en.wikipedia.org/wiki/Rate_of_convergence#R-convergence>`__ or linear convergence, is that

\[\lvert e_{k + 1} \rvert \leq \lvert e_k \rvert\]

for all sufficiently large \(k\) when the sequence \(\left\lbrace e_k \right\rbrace\) converges \(r\)-linearly to 0.

[2]:
# Let's define endpoints and find the root
a = -1
b = 3
hist = bisect_iter(f, a, b, 1e-10)
r = hist[end] # What are we trusting?

hist = hist[1:end-1]
scatter(abs.(hist .- r), yscale=:log10, label="\$x_k\$")

ks = 1:length(hist)
plot!(ks, (b - a) * (.5 .^ ks), xlabel="\$k\$", ylabel="\$e_k\$", label = "~\$2^k\$")
[2]:

Newton-Raphson method

Much of the numerical analysis reduces to Taylor series, the approximation

\[f \left( x \right) = f \left( x_0 \right) + f' \left( x_0 \right) \left( x - x_0 \right) + f'' \left( x_0 \right) \left( x - x_0 \right)^2 / 2 + \cdots\]

centered on some reference point \(x_0\).

In numerical computation, it is exceedingly rare to look beyond the first-order approximation

\[\tilde{f}_{x_0} \left( x \right) = f \left( x_0 \right) + f' \left( x_0 \right) \left( x - x_0 \right)\]

Since \(\tilde{f}_{x_0} \left( x \right)\) is a linear function, we can explicitly compute the unique solution of \(\tilde{f}_{x_0} \left( x \right) = 0\) as

\[x = x_0 - \frac{f \left( x_0 \right)}{f' \left( x \right)}\]

This is Newton’s Method (aka Newton-Raphson or Newton-Raphson-Simpson) for finding the roots of differentiable functions.

An implementation

[3]:
using Printf

# Newton's method
function newton(f, fp, x0; tol=1e-8, verbose=false)
    x = x0
    for k in 1:100 # max number of iterations
        fx = f(x)
        fpx = fp(x)
        if verbose
            println("k=$k, x_k=$(@sprintf("%.16f", x)), f_k=$(@sprintf("%.16e", fx))")
        end
        if abs(fx) < tol
            return x, fx, k
        end
        x = x - fx / fpx
    end
end

# And a target function
f(x)  =  cos(x) - x
fp(x) = -sin(x) - 1

# Let's plot our target function
plot(f, xlims=[-2, 2], label="\$cos(x) - x\$")

# And find the root
newton(f, fp, 1; tol=1e-15, verbose=true)
k=1, x_k=1.0000000000000000, f_k=-4.5969769413186023e-01
k=2, x_k=0.7503638678402439, f_k=-1.8923073822117442e-02
k=3, x_k=0.7391128909113617, f_k=-4.6455898990771516e-05
k=4, x_k=0.7390851333852840, f_k=-2.8472058044570758e-10
k=5, x_k=0.7390851332151607, f_k=0.0000000000000000e+00
[3]:
(0.7390851332151607, 0.0, 5)

That’s fast convergence!

  • 10 digits of accuracy in 4 iterations

  • How is this convergence test different than the one for bisection?

  • How can Newton’s method break down?

\[x_{k + 1} = x_k - \frac{f \left( x_k \right)}{f' \left( x_k \right)}\]
[4]:
# What if we start with a poor initial guess?
# Note that our convergence from before starts at k=20
newton(f, fp, -pi/2+.1; verbose=true)
k=1, x_k=-1.4707963267948965, f_k=1.5706297434417247e+00
k=2, x_k=312.9170549232223948, f_k=-3.1259435002533314e+02
k=3, x_k=-5529.9275427528937144, f_k=5.5306763919178247e+03
k=4, x_k=10868.9459369702435652, f_k=-1.0868376227850798e+04
k=5, x_k=-50136.7073225235581049, f_k=5.0135707777419018e+04
k=6, x_k=-1468.7903453577164328, f_k=1.4688859787856973e+03
k=7, x_k=-732.6603742863298976, f_k=7.3187610943622951e+02
k=8, x_k=-281.0038172368655864, f_k=2.8083589138981239e+02
k=9, x_k=-139.5817486699348819, f_k=1.3979912372811944e+02
k=10, x_k=5706.8561562109989609, f_k=-5.7070086597647050e+03
k=11, x_k=2836.5648158265976235, f_k=-2.8375220962814674e+03
k=12, x_k=635.5038791777569713, f_k=-6.3488396511814790e+02
k=13, x_k=279.7607629875441830, f_k=-2.8074814643252921e+02
k=14, x_k=-53.8070079658355667, f_k=5.2885920826340048e+01
k=15, x_k=-15.7419606182276794, f_k=1.4742538472479499e+01
k=16, x_k=-1.4840596284124175, f_k=1.5706876106501757e+00
k=17, x_k=416.3331582875110257, f_k=-4.1640522744068772e+02
k=18, x_k=207.8594910282616013, f_k=-2.0698889108021109e+02
k=19, x_k=69.1262088526432592, f_k=-6.8126271241735495e+01
k=20, x_k=1.7525179991632598, f_k=-1.9332411629172330e+00
k=21, x_k=0.7778731690296721, f_k=-6.5465483571587102e-02
k=22, x_k=0.7394040200105153, f_k=-5.3373035134818281e-04
k=23, x_k=0.7390851556610822, f_k=-3.7565764610114627e-08
k=24, x_k=0.7390851332151608, f_k=-2.2204460492503131e-16
[4]:
(0.7390851332151608, -2.220446049250313e-16, 24)

Convergence of fixed-point methods (mean value theorem)

Consider the iteration

\[x_{k + 1} = g \left( x_k \right)\]

where \(g\) is a continuously differentiable function. Suppose that there exists a fixed point \(x_* = g \left( x_* \right)\). By the mean value theorem, we have that

\[x_{k + 1} - x_* = g \left( x_k \right) - g \left( x_* \right) = g' \left( c_k \right) \left( x_k - x_* \right)\]

for some \(c_i\) between \(x_k\) and \(x_*\).

Taking absolute values,

\[\left\lvert e_{k + 1} \right\rvert = \left\lvert g' \left( c_k \right) \right\rvert \left\lvert e_k \right\rvert\]

which converges to zero if \(\left\lvert g' \left( c_k \right) \right\rvert < 1\).

image.png

Convergence of fixed-point methods (Taylor series)

Consider the iteration

\[x_{k + 1} = g \left( x_k \right)\]

where \(g\) is a continuously differentiable function. Suppose there exists a fixed point \(x_* = g \left( x_* \right)\). there exists a Taylor series at \(x_*\),

\[g \left( x_k \right) = g \left( x_* \right) + g' \left( x_* \right) \left( x_k - x_* \right) + \mathcal{O} \left( \left( x_k - x_* \right)^2 \right)\]

and thus

\[x_{k + 1} - x_* = g \left( x_k \right) - g \left( x_* \right) = g' \left( x_* \right) \left( x_k - x_* \right) + \mathcal{O} \left( \left( x_k - x_* \right)^2 \right)\]

In terms of the error \(e_k = x_k - x_*\),

\[\left\lvert \frac{e_{k + 1}}{e_k} \right\rvert = \left\lvert g' \left( x_* \right) \right\rvert + \mathcal{O} \left( e_k^2 \right)\]

Big :math:mathcal{O}` notation <https://en.wikipedia.org/wiki/Big_O_notation>`__

Limit \(n \rightarrow \infty\)

We’d say an algorithm costs \(\mathcal{O} \left( n^2 \right)\) if its running time on input of size \(n\) is less than \(c n^2\) for some constant \(c\) and sufficiently large \(n\).

Sometimes we write \(\text{cost} \left( \text{algorithm} \right), n = \mathcal{O} \left( n^2 \right)\) or (preferably) \(\text{cost} \left( \text{algorithm} \right) \in \mathcal{O} \left( n^2 \right)\).

Note that \(\mathcal{O} \left( \log \left( n \right) \right) \subset \mathcal{O} \left( n \right) \subset \mathcal{O} \left( n \log \left( n \right) \right) \subset \mathcal{O} \left( n^2 \right) \subset \cdots\) so it’s correct to say “binary search is in \(\mathcal{O} \left( n^2 \right)\)”, even though a sharper statement is also true.

We say the algorithm is in \(\mathcal{\Omega} \left( n^2 \right)\) (“big theta”) if

\[c_1 n^2 < \text{cost} \left( \text{algorithm} \right) < c_2 n^2\]

for some positive constants \(c_1, c_2\) and sufficiently large \(n\).

Limit \(h \rightarrow 0\)

In numerical analysis, we often have a small real number, and now the definitions take the limit as the small number goes to zero. So we say a term in an expression is in \(\mathcal{O} \left( h^2 \right)\) if

\[\lim_{h \rightarrow \infty} \frac{\text{term} \left( h \right)}{h^2} < \infty\]

Big \(\mathcal{O}\) terms can be manipulated as

\[h \mathcal{O} \left( h^k \right) = \mathcal{O} \left( h^{k + 1} \right)\]
\[\mathcal{O} \left( h^k \right) / h = \mathcal{O} \left( h^{k - 1} \right)\]
\[c \mathcal{O} \left( h^k \right) = \mathcal{O} \left( h^k \right)\]
\[\mathcal{O} \left( h^k \right) - \mathcal{O} \left( h^k \right) = ?\]

Fixed point iteration example

We wanted to solve \(\cos \left( x \right) - x = 0\), which occurs when \(g \left( k \right) = \cos \left( x \right)\) is a fixed point.

[5]:
# Let's find the root
xstar, _ = newton(f, fp, 1.)

# And reformulate our target function as a fixed point iteration
# Note: 0 = cos(x) - x => x = cos(x)
g(x)  = cos(x)
gp(x) = -sin(x)

# What is our root and the slope of g there?
@show xstar
@show gp(xstar)

plot([x->x, g], xlims=(-2, 3), label=["\$x\$" "\$cos (x)\$"])
scatter!([xstar], [xstar], label="\$x_*\$")
xstar = 0.739085133385284
gp(xstar) = -0.6736120293089505
[5]:
[6]:
# Let's plot the iterations of the fixed point method as approaches x_*
function fixed_point(g, x, n)
    xk = [x]
    for k in 1:n
        x = g(x)
        append!(xk, x)
    end
    xk
end

xk = fixed_point(g, 2., 15)
plot!(xk, g.(xk), seriestype=:path, marker=:auto, label="\$x_k\$")
[6]:

Verifying fixed point convergence theory

\[\left\lvert \frac{e_{k + 1}}{e_k} \right\rvert \rightarrow \left\lvert g' \left( x_* \right) \right\rvert\]
[7]:
# The ratio of errors should approach g'(x_*)
# First, what is g'(x_*)
@show gp(xstar)

# And then look at the ratio
ek = xk .- xstar
println("\nratios e_k+1 / e_k =")
(ek[2:end] ./ ek[1:end-1])
gp(xstar) = -0.6736120293089505

ratios e_k+1 / e_k =
[7]:
15-element Vector{Float64}:
 -0.9161855415615605
 -0.15197657010596488
 -0.734870205299266
 -0.624132525531327
 -0.7026257933893496
 -0.6523498121376077
 -0.6870971782336925
 -0.664168570025122
 -0.6798044680427148
 -0.6693659427636027
 -0.6764378047956165
 -0.6716930541785153
 -0.6748976495459512
 -0.6727427617641084
 -0.6741962236114177
[8]:
# And lets put it on a log-linear plot to visualize it
scatter(abs.(ek), yscale=:log10, xlabel="\$k\$", ylabel="\$e_k\$", label="\$e_k\$")
plot!(k -> abs(gp(xstar))^k, label="\$|g'|^k\$")
[8]:

Plotting Newton convergence

[9]:
# Ok, this is for fixed points, but what about Newton?!?
# Let's look at the history of our iterates
function newton_hist(f, fp, x0; tol=1e-12)
    x = x0
    hist = []
    for k in 1:100 # max number of iterations
        fx = f(x)
        fpx = fp(x)
        push!(hist, [x fx fpx])
        if abs(fx) < tol
            return vcat(hist...)
        end
        x = x - fx / fpx
    end
end
[9]:
newton_hist (generic function with 1 method)
[10]:
# Let's plot the error
xk = newton_hist(f, fp, 1.97)
@show x_star = xk[end,1]

plot(xk[1:end-1,1] .- x_star, yscale=:log10, marker=:auto, xlabel="\$k\$", ylabel="\$e_k\$", label="\$e_k\$")
x_star = xk[end, 1] = 0.7390851332151607
[10]:

Convergence class

Is Newton’s method

  1. \(q\)-linearly convergent

  2. \(r\)-linearly convergent

  3. neither

Bringing fixed point and Newton together

Let’s try to bridge the gap and bring these two ideas together.

Formulations are not unique (constants)

If \(x_* = g \left( x_* \right)\) then \(g \left( x_* \right) - x_* = 0\).

Consider the fixed point iteration

\[x = x + c \left( g \left( x \right) - x \right)\]

for any constant \(c \neq 0\). Let \(g_2 \left( x \right) = x + c \left( g \left( x \right) - x \right)\). Can we choose \(c\) to make \(\left\lvert g'_2 \left( x_* \right) \right\rvert\) small?

\[g'_3 \left( x \right) = 1 + c \left( g' \left( x \right) - 1 \right)\]
[11]:
# This looks like Newton, but with c instead of f'(x)...
c = .5
g2(x) = x + c * (cos(x) - x)
g2p(x) = 1 + c * (-sin(x) - 1)

@show g2p(xstar)
plot([x -> x, g, g2], ylims=(-5, 5), label=["\$x\$" "\$g(x)\$" "\$g_2(x)\$"])
g2p(xstar) = 0.16319398534552476
[11]:
[12]:
# Let's check the convergence
xk = fixed_point(g2, 1., 10)
xk .- xstar
[12]:
11-element Vector{Float64}:
 0.26091486661471597
 0.03106601954878585
 0.004893162344945079
 0.0007941171212053622
 0.00012947850276123773
 2.112687301181193e-5
 3.4475537732392425e-6
 5.62475483634195e-7
 9.16501970982253e-8
 1.4814399151852342e-8
 2.2752605355336186e-9
[13]:
# The ratio of errors should approach g'(x_*)
# First, what is g'(x_*)
@show g2p(xstar)

# And then look at the ratio
ek = xk .- xstar
println("\nratios e_k+1 / e_k =")
(ek[2:end] ./ ek[1:end-1])
g2p(xstar) = 0.16319398534552476

ratios e_k+1 / e_k =
[13]:
10-element Vector{Float64}:
 0.11906573186824182
 0.15750850659386512
 0.1622911861949014
 0.16304711144460268
 0.16316896288776617
 0.1631833433803352
 0.1631520552341395
 0.16294078544733492
 0.16164066876992242
 0.15358439530428933

Formulations are not unique (functions)

If \(x_* = g \left( x_* \right)\) then \(g \left( x_* \right) - x_* = 0\).

Consider the fixed point iteration

\[x = x + h \left( x \right) \left( g \left( x \right) - x \right)\]

for any smooth \(h \left( x \right) \neq 0\). Let \(g_3 \left( x \right) = x - h \left( x \right) \left( g \left( x \right) - x \right)\). Can we choose \(h \left( x \right)\) to make \(\left\lvert g'_3 \left( x_* \right) \right\rvert\) small?

\[g'_3 \left( x \right) = 1 + h' \left( x \right) \left( g \left( x \right) - x \right) + h \left( x \right) \left( g' \left( x \right) - 1 \right)\]
[14]:
# Let's try something that looks more like Newton
h(x) = -1 / (gp(x) - 1)
g3(x) = x + h(x) * (g(x) - x)
plot([x -> x, cos, g2, g3], ylims=(-5, 5), label=["\$x\$" "\$cos (x)\$" "\$g_2 (x)\$" "\$g_3 (x)\$"])
[14]:
  • We don’t know \(g' \left( x_* \right)\) in advance because we don’t know \(x_*\)

  • This method converges fast

  • We just re-derived Newton’s method

Newton’s method - a new derivation

A rootfinding problem \(f \left( x \right) = 0\) can be converted to a fixed point problem

\[x = x + f \left( x \right) := g \left( x \right)\]

but there is no guarantee that \(g' \left( x_* \right) = 1 + f' \left( x_* \right)\) will have magnitude less than 1.

Problem-specific algebraic manipulation can be used to make \(\left\lvert g' \left( x_* \right) \right\rvert\) small.

\(x = x + h \left( x \right) f \left( x \right)\) is a valid formulation for any \(h \left( x \right)\) bounded away from \(0\).

Can we choose \(h \left( x \right)\) such that

\[g' \left( x \right) = 1 + h' \left( x \right) f \left( x \right) + h \left( x \right) f' \left( x \right) = 0\]

when \(f \left( x \right) = 0\)?

Then, we have

\[g' \left( x \right) = 1 + h \left( x \right) f' \left( x \right) = 0\]

or

\[h \left( x \right) = \frac{-1}{f' \left( x \right)}\]

In other words,

\[x_{k + 1} = x_k + \frac{-1}{f' \left( x_k \right)} f \left( x_k \right) = x_k + h \left( x_k \right) f \left( x_k \right)\]

which is our Newton method.

Quadratic convergence

\[\left\lvert \frac{e_{k + 1}}{e_k} \right\rvert \rightarrow \left\lvert g' \left( x_* \right) \right\rvert\]
  • What does it mean that \(g' \left( x_* \right) = 0\)?

  • It turns out that Newton’s method has locally quadratic convergence to simple roots,

\[\lim_{k \rightarrow \infty} \frac{\left\lvert e_{k + 1} \right\rvert}{\left\lvert e_k \right\rvert^2} < \infty\]
  • “The number of correct digits doubles each iteration”

  • Now that we know how to make a good guess accurate, the hard part is getting a good guess