2025-09-08 Newton Methods¶

Newton’s method via Taylor series
Convergence theory for fixed point methods
Derive Newton’s method via convergence theory
Newton methods in computing culture
Breaking Newton’s method

Bisection¶

[1]:

using Plots
default(lw=4, ms=5, legendfontsize=12, xtickfontsize=12, ytickfontsize=12)

# Bisection function
hasroot(f, a, b) = f(a) * f(b) < 0
function bisect_iter(f, a, b, tol)
    hist = Float64[]
    while abs(b - a) > tol
        mid = (a + b) / 2
        push!(hist, mid)
        if hasroot(f, a, mid)
            b = mid
        else
            a = mid
        end
    end
    hist
end

# And a target function
f(x) = cos(x) - x

[1]:

f (generic function with 1 method)

Convergence classes¶

A convergent rootfinding algorithm produces a sequence of approximations \(x_k\) such that

\[\lim_{k \rightarrow \infty} x_k \rightarrow x_*\]

where \(f \left( x_* \right) = 0\). For analysis, it is convenient to define the errors \(e_k = x_k - x_*\). We say that an iterative algorithm is :math:`q-linearly convergent <https://en.wikipedia.org/wiki/Rate_of_convergence#Q-convergence>`__ if

\[\lim_{k \rightarrow \infty} \lvert e_{k + 1} \rvert / \lvert e_k \rvert = \rho < 1\]

(:math:`q` stands for “quotient”) A smaller convergence factor \(\rho\) represents faster convergence.

A slightly weaker condition, :math:`r-linear convergence <https://en.wikipedia.org/wiki/Rate_of_convergence#R-convergence>`__ or linear convergence, is that

\[\lvert e_{k + 1} \rvert \leq \lvert e_k \rvert\]

for all sufficiently large \(k\) when the sequence \(\left\lbrace e_k \right\rbrace\) converges \(r\)-linearly to 0.

[2]:

# Let's define endpoints and find the root
a = -1
b = 3
hist = bisect_iter(f, a, b, 1e-10)
r = hist[end] # What are we trusting?

hist = hist[1:end-1]
scatter(abs.(hist .- r), yscale=:log10, label="\$x_k\$")

ks = 1:length(hist)
plot!(ks, (b - a) * (.5 .^ ks), xlabel="\$k\$", ylabel="\$e_k\$", label = "~\$2^k\$")

[2]:

Newton-Raphson method¶

Much of the numerical analysis reduces to Taylor series, the approximation

\[f \left( x \right) = f \left( x_0 \right) + f' \left( x_0 \right) \left( x - x_0 \right) + f'' \left( x_0 \right) \left( x - x_0 \right)^2 / 2 + \cdots\]

centered on some reference point \(x_0\).

In numerical computation, it is exceedingly rare to look beyond the first-order approximation

\[\tilde{f}_{x_0} \left( x \right) = f \left( x_0 \right) + f' \left( x_0 \right) \left( x - x_0 \right)\]

Since \(\tilde{f}_{x_0} \left( x \right)\) is a linear function, we can explicitly compute the unique solution of \(\tilde{f}_{x_0} \left( x \right) = 0\) as

\[x = x_0 - \frac{f \left( x_0 \right)}{f' \left( x_0 \right)}\]

This is Newton’s Method (aka Newton-Raphson or Newton-Raphson-Simpson) for finding the roots of differentiable functions.

An implementation¶

[3]:

using Printf

# Newton's method
function newton(f, fp, x0; tol=1e-8, verbose=false)
    x = x0
    for k in 1:100 # max number of iterations
        fx = f(x)
        fpx = fp(x)
        if verbose
            println("k=$k, x_k=$(@sprintf("%.16f", x)), f_k=$(@sprintf("%.16e", fx))")
        end
        if abs(fx) < tol
            return x, fx, k
        end
        x = x - fx / fpx
    end
end

# And a target function
f(x)  =  cos(x) - x
fp(x) = -sin(x) - 1

# Let's plot our target function
plot(f, xlims=[-2, 2], label="\$cos(x) - x\$")

# And find the root
newton(f, fp, 1; tol=1e-15, verbose=true)

k=1, x_k=1.0000000000000000, f_k=-4.5969769413186023e-01
k=2, x_k=0.7503638678402439, f_k=-1.8923073822117442e-02
k=3, x_k=0.7391128909113617, f_k=-4.6455898990771516e-05
k=4, x_k=0.7390851333852840, f_k=-2.8472058044570758e-10
k=5, x_k=0.7390851332151607, f_k=0.0000000000000000e+00

[3]:

(0.7390851332151607, 0.0, 5)

That’s fast convergence!

10 digits of accuracy in 4 iterations
How is this convergence test different than the one for bisection?
How can Newton’s method break down?

\[x_{k + 1} = x_k - \frac{f \left( x_k \right)}{f' \left( x_k \right)}\]

[4]:

# What if we start with a poor initial guess?
# Note that our convergence from before starts at k=20
newton(f, fp, -pi/2+.1; verbose=true)

k=1, x_k=-1.4707963267948965, f_k=1.5706297434417247e+00
k=2, x_k=312.9170549232223948, f_k=-3.1259435002533314e+02
k=3, x_k=-5529.9275427528937144, f_k=5.5306763919178247e+03
k=4, x_k=10868.9459369702435652, f_k=-1.0868376227850798e+04
k=5, x_k=-50136.7073225235581049, f_k=5.0135707777419018e+04
k=6, x_k=-1468.7903453577164328, f_k=1.4688859787856973e+03
k=7, x_k=-732.6603742863298976, f_k=7.3187610943622951e+02
k=8, x_k=-281.0038172368655864, f_k=2.8083589138981239e+02
k=9, x_k=-139.5817486699348819, f_k=1.3979912372811944e+02
k=10, x_k=5706.8561562109989609, f_k=-5.7070086597647050e+03
k=11, x_k=2836.5648158265976235, f_k=-2.8375220962814674e+03
k=12, x_k=635.5038791777569713, f_k=-6.3488396511814790e+02
k=13, x_k=279.7607629875441830, f_k=-2.8074814643252921e+02
k=14, x_k=-53.8070079658355667, f_k=5.2885920826340048e+01
k=15, x_k=-15.7419606182276794, f_k=1.4742538472479499e+01
k=16, x_k=-1.4840596284124175, f_k=1.5706876106501757e+00
k=17, x_k=416.3331582875110257, f_k=-4.1640522744068772e+02
k=18, x_k=207.8594910282616013, f_k=-2.0698889108021109e+02
k=19, x_k=69.1262088526432592, f_k=-6.8126271241735495e+01
k=20, x_k=1.7525179991632598, f_k=-1.9332411629172330e+00
k=21, x_k=0.7778731690296721, f_k=-6.5465483571587102e-02
k=22, x_k=0.7394040200105153, f_k=-5.3373035134818281e-04
k=23, x_k=0.7390851556610822, f_k=-3.7565764610114627e-08
k=24, x_k=0.7390851332151608, f_k=-2.2204460492503131e-16

[4]:

(0.7390851332151608, -2.220446049250313e-16, 24)

Convergence of fixed-point methods (mean value theorem)¶

Consider the iteration

\[x_{k + 1} = g \left( x_k \right)\]

where \(g\) is a continuously differentiable function. Suppose that there exists a fixed point \(x_* = g \left( x_* \right)\). By the mean value theorem, we have that

\[x_{k + 1} - x_* = g \left( x_k \right) - g \left( x_* \right) = g' \left( c_k \right) \left( x_k - x_* \right)\]

for some \(c_i\) between \(x_k\) and \(x_*\).

Taking absolute values,

\[\left\lvert e_{k + 1} \right\rvert = \left\lvert g' \left( c_k \right) \right\rvert \left\lvert e_k \right\rvert\]

which converges to zero if \(\left\lvert g' \left( c_k \right) \right\rvert < 1\).

Convergence of fixed-point methods (Taylor series)¶

Consider the iteration

\[x_{k + 1} = g \left( x_k \right)\]

where \(g\) is a continuously differentiable function. Suppose there exists a fixed point \(x_* = g \left( x_* \right)\). there exists a Taylor series at \(x_*\),

\[g \left( x_k \right) = g \left( x_* \right) + g' \left( x_* \right) \left( x_k - x_* \right) + \mathcal{O} \left( \left( x_k - x_* \right)^2 \right)\]

and thus

\[x_{k + 1} - x_* = g \left( x_k \right) - g \left( x_* \right) = g' \left( x_* \right) \left( x_k - x_* \right) + \mathcal{O} \left( \left( x_k - x_* \right)^2 \right)\]

In terms of the error \(e_k = x_k - x_*\),

\[\left\lvert \frac{e_{k + 1}}{e_k} \right\rvert = \left\lvert g' \left( x_* \right) \right\rvert + \mathcal{O} \left( e_k^2 \right)\]

Big :math:mathcal{O}` notation <https://en.wikipedia.org/wiki/Big_O_notation>`__¶

Limit \(n \rightarrow \infty\)¶

We’d say an algorithm costs \(\mathcal{O} \left( n^2 \right)\) if its running time on input of size \(n\) is less than \(c n^2\) for some constant \(c\) and sufficiently large \(n\).

Sometimes we write \(\text{cost} \left( \text{algorithm} \right), n = \mathcal{O} \left( n^2 \right)\) or (preferably) \(\text{cost} \left( \text{algorithm} \right) \in \mathcal{O} \left( n^2 \right)\).

Note that \(\mathcal{O} \left( \log \left( n \right) \right) \subset \mathcal{O} \left( n \right) \subset \mathcal{O} \left( n \log \left( n \right) \right) \subset \mathcal{O} \left( n^2 \right) \subset \cdots\) so it’s correct to say “binary search is in \(\mathcal{O} \left( n^2 \right)\)”, even though a sharper statement is also true.

We say the algorithm is in \(\mathcal{\Omega} \left( n^2 \right)\) (“big theta”) if

\[c_1 n^2 < \text{cost} \left( \text{algorithm} \right) < c_2 n^2\]

for some positive constants \(c_1, c_2\) and sufficiently large \(n\).

Limit \(h \rightarrow 0\)¶

In numerical analysis, we often have a small real number, and now the definitions take the limit as the small number goes to zero. So we say a term in an expression is in \(\mathcal{O} \left( h^2 \right)\) if

\[\lim_{h \rightarrow \infty} \frac{\text{term} \left( h \right)}{h^2} < \infty\]

Big \(\mathcal{O}\) terms can be manipulated as

\[h \mathcal{O} \left( h^k \right) = \mathcal{O} \left( h^{k + 1} \right)\]

\[\mathcal{O} \left( h^k \right) / h = \mathcal{O} \left( h^{k - 1} \right)\]

\[c \mathcal{O} \left( h^k \right) = \mathcal{O} \left( h^k \right)\]

\[\mathcal{O} \left( h^k \right) - \mathcal{O} \left( h^k \right) = ?\]

Fixed point iteration example¶

We wanted to solve \(\cos \left( x \right) - x = 0\), which occurs when \(g \left( k \right) = \cos \left( x \right)\) is a fixed point.

[5]:

# Let's find the root
xstar, _ = newton(f, fp, 1.)

# And reformulate our target function as a fixed point iteration
# Note: 0 = cos(x) - x => x = cos(x)
g(x)  = cos(x)
gp(x) = -sin(x)

# What is our root and the slope of g there?
@show xstar
@show gp(xstar)

plot([x->x, g], xlims=(-2, 3), label=["\$x\$" "\$cos (x)\$"])
scatter!([xstar], [xstar], label="\$x_*\$")

xstar = 0.739085133385284
gp(xstar) = -0.6736120293089505

[5]:

[6]:

# Let's plot the iterations of the fixed point method as approaches x_*
function fixed_point(g, x, n)
    xk = [x]
    for k in 1:n
        x = g(x)
        append!(xk, x)
    end
    xk
end

xk = fixed_point(g, 2., 15)
plot!(xk, g.(xk), seriestype=:path, marker=:auto, label="\$x_k\$")

[6]:

Verifying fixed point convergence theory¶

\[\left\lvert \frac{e_{k + 1}}{e_k} \right\rvert \rightarrow \left\lvert g' \left( x_* \right) \right\rvert\]

[7]:

# The ratio of errors should approach g'(x_*)
# First, what is g'(x_*)
@show gp(xstar)

# And then look at the ratio
ek = xk .- xstar
println("\nratios e_k+1 / e_k =")
(ek[2:end] ./ ek[1:end-1])

gp(xstar) = -0.6736120293089505

ratios e_k+1 / e_k =

[7]:

15-element Vector{Float64}:
 -0.9161855415615605
 -0.15197657010596488
 -0.734870205299266
 -0.624132525531327
 -0.7026257933893496
 -0.6523498121376077
 -0.6870971782336925
 -0.664168570025122
 -0.6798044680427148
 -0.6693659427636027
 -0.6764378047956165
 -0.6716930541785153
 -0.6748976495459512
 -0.6727427617641084
 -0.6741962236114177

[8]:

# And lets put it on a log-linear plot to visualize it
scatter(abs.(ek), yscale=:log10, xlabel="\$k\$", ylabel="\$e_k\$", label="\$e_k\$")
plot!(k -> abs(gp(xstar))^k, label="\$|g'|^k\$")

[8]:

Plotting Newton convergence¶

[9]:

# Ok, this is for fixed points, but what about Newton?!?
# Let's look at the history of our iterates
function newton_hist(f, fp, x0; tol=1e-12)
    x = x0
    hist = []
    for k in 1:100 # max number of iterations
        fx = f(x)
        fpx = fp(x)
        push!(hist, [x fx fpx])
        if abs(fx) < tol
            return vcat(hist...)
        end
        x = x - fx / fpx
    end
end

[9]:

newton_hist (generic function with 1 method)

[10]:

# Let's plot the error
xk = newton_hist(f, fp, 1.97)
@show x_star = xk[end,1]

plot(xk[1:end-1,1] .- x_star, yscale=:log10, marker=:auto, xlabel="\$k\$", ylabel="\$e_k\$", label="\$e_k\$")

x_star = xk[end, 1] = 0.7390851332151607

[10]:

Convergence class¶

Is Newton’s method

\(q\)-linearly convergent
\(r\)-linearly convergent
neither

Bringing fixed point and Newton together¶

Let’s try to bridge the gap and bring these two ideas together.

Formulations are not unique (constants)¶

If \(x_* = g \left( x_* \right)\) then \(g \left( x_* \right) - x_* = 0\).

Consider the fixed point iteration

\[x = x + c \left( g \left( x \right) - x \right)\]

for any constant \(c \neq 0\). Let \(g_2 \left( x \right) = x + c \left( g \left( x \right) - x \right)\). Can we choose \(c\) to make \(\left\lvert g'_2 \left( x_* \right) \right\rvert\) small?

\[g'_3 \left( x \right) = 1 + c \left( g' \left( x \right) - 1 \right)\]

[11]:

# This looks like Newton, but with c instead of f'(x)...
c = .5
g2(x) = x + c * (cos(x) - x)
g2p(x) = 1 + c * (-sin(x) - 1)

@show g2p(xstar)
plot([x -> x, g, g2], ylims=(-5, 5), label=["\$x\$" "\$g(x)\$" "\$g_2(x)\$"])

g2p(xstar) = 0.16319398534552476

[11]:

[12]:

# Let's check the convergence
xk = fixed_point(g2, 1., 10)
xk .- xstar

[12]:

11-element Vector{Float64}:
 0.26091486661471597
 0.03106601954878585
 0.004893162344945079
 0.0007941171212053622
 0.00012947850276123773
 2.112687301181193e-5
 3.4475537732392425e-6
 5.62475483634195e-7
 9.16501970982253e-8
 1.4814399151852342e-8
 2.2752605355336186e-9

[13]:

# The ratio of errors should approach g'(x_*)
# First, what is g'(x_*)
@show g2p(xstar)

# And then look at the ratio
ek = xk .- xstar
println("\nratios e_k+1 / e_k =")
(ek[2:end] ./ ek[1:end-1])

g2p(xstar) = 0.16319398534552476

ratios e_k+1 / e_k =

[13]:

10-element Vector{Float64}:
 0.11906573186824182
 0.15750850659386512
 0.1622911861949014
 0.16304711144460268
 0.16316896288776617
 0.1631833433803352
 0.1631520552341395
 0.16294078544733492
 0.16164066876992242
 0.15358439530428933

Formulations are not unique (functions)¶

If \(x_* = g \left( x_* \right)\) then \(g \left( x_* \right) - x_* = 0\).

Consider the fixed point iteration

\[x = x + h \left( x \right) \left( g \left( x \right) - x \right)\]

for any smooth \(h \left( x \right) \neq 0\). Let \(g_3 \left( x \right) = x - h \left( x \right) \left( g \left( x \right) - x \right)\). Can we choose \(h \left( x \right)\) to make \(\left\lvert g'_3 \left( x_* \right) \right\rvert\) small?

\[g'_3 \left( x \right) = 1 + h' \left( x \right) \left( g \left( x \right) - x \right) + h \left( x \right) \left( g' \left( x \right) - 1 \right)\]

[14]:

# Let's try something that looks more like Newton
h(x) = -1 / (gp(x) - 1)
g3(x) = x + h(x) * (g(x) - x)
plot([x -> x, cos, g2, g3], ylims=(-5, 5), label=["\$x\$" "\$cos (x)\$" "\$g_2 (x)\$" "\$g_3 (x)\$"])

[14]:

We don’t know \(g' \left( x_* \right)\) in advance because we don’t know \(x_*\)
This method converges fast
We just re-derived Newton’s method

Newton’s method - a new derivation¶

A rootfinding problem \(f \left( x \right) = 0\) can be converted to a fixed point problem

\[x = x + f \left( x \right) := g \left( x \right)\]

but there is no guarantee that \(g' \left( x_* \right) = 1 + f' \left( x_* \right)\) will have magnitude less than 1.

Problem-specific algebraic manipulation can be used to make \(\left\lvert g' \left( x_* \right) \right\rvert\) small.

\(x = x + h \left( x \right) f \left( x \right)\) is a valid formulation for any \(h \left( x \right)\) bounded away from \(0\).

Can we choose \(h \left( x \right)\) such that

\[g' \left( x \right) = 1 + h' \left( x \right) f \left( x \right) + h \left( x \right) f' \left( x \right) = 0\]

when \(f \left( x \right) = 0\)?

Then, we have

\[g' \left( x \right) = 1 + h \left( x \right) f' \left( x \right) = 0\]

\[h \left( x \right) = \frac{-1}{f' \left( x \right)}\]

In other words,

\[x_{k + 1} = x_k + \frac{-1}{f' \left( x_k \right)} f \left( x_k \right) = x_k + h \left( x_k \right) f \left( x_k \right)\]

which is our Newton method.

Quadratic convergence¶

\[\left\lvert \frac{e_{k + 1}}{e_k} \right\rvert \rightarrow \left\lvert g' \left( x_* \right) \right\rvert\]

What does it mean that \(g' \left( x_* \right) = 0\)?
It turns out that Newton’s method has locally quadratic convergence to simple roots,

\[\lim_{k \rightarrow \infty} \frac{\left\lvert e_{k + 1} \right\rvert}{\left\lvert e_k \right\rvert^2} < \infty\]

“The number of correct digits doubles each iteration”
Now that we know how to make a good guess accurate, the hard part is getting a good guess

Sidebar - Fast inverse square root¶

The following code appeared literally (with these comments) in the Quake III Arena source code (late 1990s).

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                     // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );             // what the fuck?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

    return y;
}

We now have vector instructions for approximate inverse square root. See also Wikipedia. For details about the magic number, read this section on Wikipedia.

How does it work?¶

Let’s look at the last line.

y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration

We want a function \(f \left( y \right)\) such that \(f \left( 1 / \sqrt{x} \right) = 0\). One such function is

\[f \left( y \right) = 1 / y^2 - x, f' \left( y \right) = -2 / y^3\]

There are others (see the homework activity) that require a division.

Newton’s method is

\[\begin{split} \begin{align} y_{k + 1} &= y_k - \frac{f \left( y_k \right)}{f' \left( y_k \right)}\\ &= y_k - \frac{ 1 / y_k^2 - x}{-2 / y_k^3}\\ &= y_k + \frac{1}{2} \left( y_k - x y_k^2 \right)\\ &= y_k \left( \frac{3}{2} - \frac{1}{2} x f_k^2 \right) \end{align}\end{split}\]