Limit-memory BFGS Method
Slide: Starting with the classical Newton's method, we will examine its limitations and the improvements introduced by quasi-Newton methods. Finally, we will delve into the L-BFGS algorithm, which is particularly suited for large-scale problems. Inspired by Dr. Zhepei Wang's Lecture "Numerical Optimization for Robotics".
Introduction
Unconstrained Optimization
Consider a smooth and twice-differentiable unconstrained optimization problem:
Descent methods provide an iterative solution:
where \(\mathbf d^k\) is the direction, and \(\alpha^k\) is the step size.
Newton's Method
By second-order Taylor expansion,
Minimizing quadratic approximation
For \(\nabla^2f(\mathbf{x}^k)\succ0\)
Damped Newton Method
For \(\nabla^2f(\mathbf{x}^k) \nsucc 0\), the direction \(\mathbf{d}^k\) cannot be directly solved from \(\nabla^2f(\mathbf{x}^k) \mathbf{d}^k = -\nabla f(\mathbf{x}^k)\). In such cases, a PD matrix \(\mathbf{M}^k\) must be constructed to approximate the Hessian.
If the function is convex, \(\nabla^2f(\mathbf{x}^k)\) may be singular. Adding a regularization term ensures positive definiteness:
\(\lambda > 0\) starts small and grows until Cholesky decomposition works.
If the function is nonconvex, \(\nabla^2f(\mathbf{x}^k)\) may be indefinite. To handle this, the Bunch-Kaufman decomposition is applied to obtain \(\mathbf{L}\mathbf{D}\mathbf{L}^\top\) and \(\tilde{\mathbf{D}}\).
Finally, direction is solved from \(\mathbf{M}^k\mathbf{d}^k=-\nabla f(\mathbf{x}^k)\).
Practical Newton Method
Moreover, we can select \(\alpha^k\) by backtracking line search to ensure sufficient decrease in the objective function, satisfying the Armijo condition:
where \(c_1 \in (0, 1)\) is a small constant.
Quasi-Newton Methods
Newton's Method: Limitations
- High Cost: Computing the Hessian and its inverse requires \(\mathcal{O}(n^3)\) operations, impractical for large problems.
- Indefinite Hessian: In nonconvex cases, the Hessian may lead to steps toward saddle points.
- Ill-Conditioning: Poorly conditioned Hessians amplify errors and hinder convergence.
- Inaccurate Model: Local quadratic approximations may fail for complex functions, causing inefficiency or divergence.
Quasi-Newton Approximation
Newton approximation:
Quasi-Newton approximation:
The matrix \(\mathbf{B}^k\) should:
- Avoid full second-order derivatives.
- Have a closed-form solution for linear equations.
- Retain first-order curvature information.
- Preserve the descent direction.
BFGS Method
Descent Direction
Search direction \(d^k\) should make acute angle with negative gradient.
\(\mathbf{B}^k\) must be PD since for all non-negative \(\mathbf{g}^k\), \((\mathbf{g}^k)^\top(\mathbf{B}^k)^{-1}\mathbf{g}^k>0\).
Curvature Information
At the point \(\mathbf{x}^{k+1}\), the gradient is \(\mathbf{g}^{k+1}\). We want \(\mathbf{B}^{k+1}\) to satisfy:
The Optimal \(\mathbf{B}^{k+1}\)?
Infinitely many \(\mathbf{B}^{k+1}\) satisfy the secant condition, how to choose the best one. We define the following weighted least square problem
In BFGS, weight matrix is select to be:
Closed-form Solution for \(\mathbf{B}^{k+1}\)
To derive the optimal \(\mathbf{B}^{k+1}\), we construct the Lagrangian function:
Taking the derivative of the Lagrangian with respect to \(\mathbf{B}\) and setting it to zero gives:
Rearranging the terms, we express \(\mathbf{B}\) as:
Substituting this expression into the secant condition \(\mathbf{B}\mathbf{s}^{k} = \mathbf{y}^{k}\), we obtain:
Solving for \(\boldsymbol{\Lambda}\), we find:
Finally, substituting \(\boldsymbol{\Lambda}\) back, the closed-form solution for \(\mathbf{B}^{k+1}\) is:
Broyden-Fletcher-Goldfarb-Shanno (BFGS)
Given the initial value \(\mathbf{B}^0=\mathbf{I}\), the updates are performed iteratively:
with \(\mathbf{s}^{k} = \mathbf{x}^{k+1} - \mathbf{x}^k\) and \(\mathbf{y}^{k} = \mathbf{g}^{k+1} - \mathbf{g}^k\).
For computational efficiency, we often work with the inverse of \(\mathbf{B}^k\) directly. The iterative update for \((\mathbf{B}^k)^{-1}\) is given by:
Guaranteeing PD of \(\mathbf{B}^{k+1}\)
To ensure that \(\mathbf{B}^{k+1}\) remains positive definite (PD), the following curvature condition must hold:
For any nonzero vector \(\mathbf{z}\), using the Cauchy-Schwarz inequality:
Equalities hold only when \(\mathbf{z}^\top\mathbf{y}^{k} = 0\) and \(\mathbf{z} \parallel \mathbf{s}^{k}\). Given that \((\mathbf{y}^{k})^\top\mathbf{s}^{k} > 0\), these conditions cannot hold simultaneously. Therefore, if \(\mathbf{B}^k \succ 0\), it follows that \(\mathbf{B}^{k+1} \succ 0\).
Guranteeing \((\mathbf{y}^{k})^\top\mathbf{s}^{k} > 0\)
Armijo Condition (AC) cannot gurantee curvature, we need curvature condition (CC):
Typically, \(c_1=10^{-4}, c_2=0.9\).
L-BFGS Method
Lewis-Overton Line Search
The Lewis-Overton line search is a sophisticated backtracking line search designed specifically for quasi-Newton methods:
- Given search direction \(\mathbf{d}^k\), current point \(\mathbf{x}^k\) and gradient \(\mathbf{g}^k\)
- Initialize trial step \(\alpha:=1\), \(\alpha_l:=0\), \(\alpha_r:=+\infty\)
- Repeat
- Update bounds:
- If \(\text{AC}(\alpha)\) fails, set \(\alpha_r := \alpha\)
- Else if \(\text{CC}(\alpha)\) fails, set \(\alpha_l := \alpha\)
- Otherwise, return \(\alpha\)
- Update \(\alpha\):
- If \(\alpha_r < +\infty\), set \(\alpha := \text{CubicInterpolate}(\alpha_l, \alpha_r)\) or \(\alpha := (\alpha_l + \alpha_r) / 2\)
- Otherwise, set \(\alpha := \text{CubicExtrapolate}(\alpha_l, \alpha_r)\) or \(\alpha := 2\alpha_l\)
- Ensure \(\alpha \in [\alpha_{\min}, \alpha_{\max}]\)
Cautious Update
Sometimes, when line search is inexact or the function is poorly conditioned, \((\mathbf{y}^{k})^\top \mathbf{s}^{k} > 0\) cannot gurantee. To ensure numerical stability and maintain the PD Hessian approximation, L-BFGS employs a cautious update strategy:
Skip update condition: If the curvature condition \((\mathbf{y}^k)^\top\mathbf{s}^k > \epsilon |\mathbf{s}^k|^2\) is not satisfied, where \(\epsilon\) is a small positive constant (e.g., \(10^{-6}\)), skip the update for this iteration: \(\mathbf{B}^{k+1} = \mathbf{B}^k\).
Powell's Damping: If the curvature condition \((\mathbf{y}^k)^\top \mathbf{s}^k \geq \eta (\mathbf{s}^k)^\top \mathbf{B}^k \mathbf{s}^k\) is not satisfied, where \(\eta\) is a small positive constant (e.g., \(0.2\) or \(0.25\)).
Cautious updates guaranteed to have its iterates converge to a critical point if the function has bounded sublevel sets and a Lipschitz continuous gradient.
Two-Loop Recursion
L-BFGS uses a two-loop recursion to compute the search direction without explicitly forming the Hessian approximation. The algorithm maintains a history of the most recent \(m\) pairs \({(\mathbf{s}^i, \mathbf{y}^i)}_{i=k-m}^{k-1}\), where typically \(m\) is between 5 and 20.
- Initialize an empty array \(\mathcal{A}\) of length \(m\), \(\mathbf{d}=\mathbf{g}^k\)
- For \(i=k-1,k-2,\dots,k-m\):
- \(\mathcal{A}^{i+m-k} := \langle\mathbf{s}^i,\mathbf{d}\rangle/\langle \mathbf{s}^i,\mathbf{y}^i\rangle\)
- \(\mathbf{d}:=\mathbf{d}-\mathbf{L}^{i+m-k}\mathbf{y}^i\)
- \(\mathbf{d}:=\mathbf{d}\cdot\langle \mathbf{s}^{k-1},\mathbf{y}^{k-1}\rangle/\langle \mathbf{y}^{k-1},\mathbf{y}^{k-1}\rangle\)
- For \(i=k-m,k-m+1,\dots,k-1\):
- \(a:=\langle \mathbf{y}^i,\mathbf{d}\rangle/\langle \mathbf{s}^i,\mathbf{y}^i\rangle\)
- \(\mathbf{d}:=\mathbf{d}+\mathbf{s}^i(\mathcal{A}^{i+m-k}-a)\)
- Return \(d\)
This approach reduces the storage requirement from \(\mathcal{O}(n^2)\) to \(\mathcal{O}(mn)\) and the computational cost per iteration from \(\mathcal{O}(n^2)\) to \(\mathcal{O}(mn)\).
Algorithm Summary
The complete L-BFGS algorithm with cautious update and Lewis-Overton line search:
- Initialize \(\mathbf{x}^0,\mathbf{g}^0:=\nabla f(\mathbf{x}^0)\), choose \(m\)
- For \(k = 0, 1, 2, \ldots\) until convergence:
- Compute search direction: \(\mathbf{d}^k\) using L-BFGS two-loop recursion
- Find step size \(\alpha^k\) using Lewis-Overton line search
- Update: \(\mathbf{x}^{k+1} = \mathbf{x}^k + \alpha^k \mathbf{d}^k\)
- Compute \(\mathbf{s}^k = \mathbf{x}^{k+1} - \mathbf{x}^k\), \(\mathbf{y}^k = \nabla f(\mathbf{x}^{k+1}) - \nabla f(\mathbf{x}^k)\)
- Apply cautious update to \((\{\mathbf{s}^k\}, \{\mathbf{y}^k\})\)
Open Source Implementation
- https://github.com/chokkan/liblbfgs
- https://github.com/hjmshi/PyTorch-LBFGS
- https://github.com/ZJU-FAST-Lab/LBFGS-Lite
- https://github.com/yixuan/LBFGSpp