Over time, I found myself checking the Wikipedia page for Conjugate Gradient over and over again, and finally got tired of it. So here is a complete, self-contained and hopefully correct derivation of the method, including non-standard inner products and preconditioning, up to the Conjugate Residuals variations. The alternate Lanczos formulation can be found in the notes for Krylov methods.
When a matrix \(A\) is positive definite, the solution to \(Ax = b\) is given by the unique minimizer of the following convex quadratic form:
\[E(x) = \half x^T A x - b^T x\]The gradient descent algorithm proceeds by minimizing \(E\) along successive descent directions \(p_k\), given in this case by the negative gradient \(r_k = b - A x_k\) of \(E\):
\[x_{k+1} = x_k + \alpha_k p_k\]where \(p_k = r_k\) is known as the steepest descent direction. \(\alpha_k\) is found by the following line-search procedure, a one-dimensional minimization problem (convex in this case):
\[\alpha_k = \argmin{\alpha \in \RR} \quad f_k(\alpha) = E\block{x_k + \alpha p_k}\]Stationarity conditions give:
\[\begin{align} \dd f_k(\alpha) = 0 &\iff \block{x_k + \alpha p_k}^T A p_k - b^Tp_k = 0 \\ & \iff \quad p_k^T A p_k = \block{b - A x_k}^T p_k \\ & \iff \quad \alpha = \frac{\block{b - A x_k}^T p_k}{p_k^T A p_k} \end{align}\]Geometrically, the line-search can also be interpreted in terms of the \(A\)-projection of \(x_k\) onto \(p_k\):
\[x_{k+1} = x_k - p_k \frac{p_k^TAx_k}{p_k^TAp_k} \quad + \quad p_k \frac{p_k^T b}{p_k^TAp_k}\]At each iteration, the \(A\)-projection of \(x_k\) along \(p_k\) is replaced with that of the solution \(\inv{A}b = x^\star\) to obtain \(x_{k+1}\). Consequently, an interesting choice for descent directions \(\block{p_k}_k\) is to use mutually \(A\)-orthogonal (“\(A\)-conjugate”, or simply “conjugate”) directions, so that the solution is found in maximum \(n\) steps in exact arithmetic. This is precisely the idea behind the Conjugate Gradient (CG) method.
The CG algorithm chooses \(p_{k+1}\) by \(A\)-orthogonalizing the gradient \(r_{k+1}\) against the previous \(p_k\), so that the family \(\block{p_k}_k\) is \(A\)-orthogonal:
\[p_{k+1} = r_{k+1} - \sum_{i=1}^k p_i \frac{p_i^TAr_{k+1}}{p_i^T A p_i}\]We let \(\beta_{ik} = p_i^TAr_k\) if \(i < k\), \(\beta_{ii} = 1\) and \(\beta_{ik} = 0\) if \(i > k\), so that the recurrence relation becomes \(r_{k+1} = \sum_{i=1}^{i = k+1} p_i \beta_{ik}\) or, in matrix form:
\[R_k = P_k B_k\]where the matrix \(B_k = \block{\beta_{ij}}_{i, j \leq k}\) is upper-triangular with unit diagonal (hence invertible). Now, since the \(p_i\) are now mutually \(A\)-orthogonal, and from the geometric interpretation above (or simply by induction over the update formula for \(x_k\)), the \(A\)-projection of \(x_{k+1}\) over the previous \(\block{p_i}_{i\leq k} = P_k\) gives:
\[P_k^T A x_{k+1} = P_k^T b\]Grouping both sides, we obtain:
\[\underbrace{P_k^T}_{B_k^{-T} R_k^T} r_{k+1} = 0\]Which implies the following important property:
\[R_k^T r_{k+1} = 0\]That is, the \(r_k\) are mutually orthogonal. Before moving on, we also note that \(p_k^T r_k = \norm{r_k}^2\). At this point, let us rewrite the \(r_k\) update equations in block form, to get a clearer view of the situation. From each \(r_k - r_{k+1} = \alpha_k A p_k\), we obtain:
\[R_{k+1} \underbrace{\mat{ 1& \\ -1 & 1\\ & -1 & 1 \\ & & \ddots & \ddots \\ & & & -1 & 1 \\ & & & & -1 \\ }}_{\underline{D}_k} = A P_k \block{\alpha_k}_k\]Projecting onto \(R_k\), and introducing \(\phi_k = \norm{r_k}^2\), gives:
\[\begin{align} R_k^T R_{k+1} \underline{D}_k &= R_k^T A P_k \block{\alpha_k}_k \\ \mat{ \diag\block{\phi_k} & 0_k} \underline{D}_k &= R_k^T A P_k \block{\alpha_k}_k \\ \diag\block{\phi_k} D_k &= R_k^T A P_k \block{\alpha_k}_k \\ \end{align}\]where \(D_k\) is the upper \(k\times k\) (upper Hessenberg) block from \(\underline{D}_k\). More precisely:
\[\diag\block{\phi_k} D_k = \mat{ \phi_1 & \\ -\phi_2 & \phi_2\\ & -\phi_3 & \phi_3 \\ & & \ddots & \ddots \\ & & & -\phi_k & \phi_k \\ }\]From this, we finally get:
\[r_i^T A p_j \alpha_j = \cases{ \phi_i & if $i = j$ \\ -\phi_i & if $j = i - 1$ \\ 0 & otherwise }\]Since \(\alpha_k \neq 0\) (unless we are finished), \(p_i^T A r_{k+1} = 0\) for \(i < k\) and the update for \(p_{k+1}\) reduces to:
\[p_{k+1} = r_{k+1} - p_k \frac{p_k^TAr_{k+1}}{p_k^T A p_k}\]This is called a short-recurrence relation: each \(p_k\) only needs to be conjugated with the previous one. Some alternate formula before finishing:
\[\begin{align} \beta_k &= \frac{p_k^TAr_{k+1}}{p_k^T A p_k} \\ &= -\frac{\phi_{k+1}}{\alpha_k p_k^T A p_k } = -\frac{\phi_{k+1}}{r_k^T p_k} \\ &= -\frac{\phi_{k+1}}{\phi_k} \\ \\ \alpha_k &= \frac{r_k^T p_k}{p_k^T A p_k} \\ &= \frac{\phi_k}{p_k^T A p_k} \\ \end{align}\]While the initial search direction \(p_0\) is set to the initial residual \(r_0\), any other initial direction could be chosen. Some references mention, however, that the speed of convergence can be affected ( i.e. linear instead of superlinear) when the initial residual is not chosen for \(p_0\).
The normalized \(r_k\) provide an orthogonal basis for the Krylov subspace \(\Krylov_k(A, b)\) and can be related to the Lanczos vectors. If \(p_0 = r_0\), then the \(p_k\) vectors also provide an \(A\)-orthogonal basis for \(\Krylov_k(A, b)\).
Let \(A\) be self-adjoint and positive definite with respect to an inner product \(M\), that is:
(or simply: let \(MA\) be symmetric positive definite). The following quadratic form is positive-definite (thus convex):
\[\begin{align} E_M(x) &= \half \inner{x, A x - b}_M \\ &= \half x^TMAx - x^TMb \\ \end{align}\]So again, we look for its minimum. We apply the CG algorithm to the modified system \(MA x = Mb\), letting \(z_k = M r_k\) with \(r_k = b - Ax_k\) as before, and denoting the \(MA\)-conjugate directions by \(q_k\). We again obtain the following conditions for step length and conjugation:
Now we are facing a choice regarding whom should span the \(q_k\). If the \(q_k\) are spanned by the residuals \(r_k\) (i.e. the gradient of \(E_M\) with respect to \(M\)), we get:
In contrast, if the \(q_k\) are spanned by the preconditioned residuals \(z_k\) (i.e. the gradient of \(E_M\) with respect to the standard inner product), we get:
which is vanilla CG on the preconditioned system. Using the gradient with respect to \(M\) provides an extra degree of freedom in the algorithm: intuitively, the gradient descent follows directions that are orthogonal to level-sets in order to minimize the function, and orthogonality depends on the metric. Different metrics produce different trajectories across level-sets, hopefully to converge faster towards the solution. Since the second case was already described above, let us now consider the first and obtain the following algorithm:
which is exactly the standard CG algorithm using \(M\) for all inner-products, as one could expect. At each iteration of the algorithm:
The above requires that \(MA\) be symmetric, which could happen even though \(M\) is indefinite, or not even symmetric. However, \(r_k\) is not guaranteed to be a descent direction of the quadratic form when \(M\) is not positive definite: it might happen that the quadratic form simply stagnates along \(r_{k+1} \neq 0\):
\[r_{k+1}^T z_{k+1} = r_{k+1}^T M r_{k+1} = 0\]Should this happen, the subsequent \(p_{k+1}\) will also be a direction of stagnation since \(P_k^T M r_{k+1} = 0\) always holds, resulting in:
\[p_{k+1}^T M r_{k+1} = 0\]At this point, the algorithm will stop making any progress since \(\alpha_{k+1}\) will be zero and \(r_{k+2} = r_{k+1}\). It seems that it is necessary and sufficient that the symmetric part of the metric is positive definite in order that the above breakdown scenario can never happen.
It is now straightforward to derive most classical symmetric Krylov methods as instances of the conjugate gradient algorithm for a suitable choice of metric \(M\).
The matrix \(\inv{B}A\) is trivially positive definite for an inner-product \(B\), and we immediately obtain the Preconditioned Conjugate Gradient algorithm. The \(p_k\) will be \(A\)-conjugate, and the energy norm will be minimized along the way, but the residuals \(r_k\) will now be \(B\)-conjugate, hopefully to follow a faster path towards the solution.
It is generally more convenient to use the residual for the original system instead, and introduce an extra variable \(z = \inv{B} r\) as follows:
Intuitively, the metric \(B\) should ease displacements where the curvature is low, and damp it where the curvature is high. By assigning a high cost to highly-curved directions, the gradient in these directions will become small (since you don’t have to move a lot to get energy decrease, due to the high associated energy cost), and conversely. In this respect, the diagonal metric \(B = \diag(A)\) approximates the curvature along each axis (similar to the Levernberg-Marquardt algorithm), and damps the corresponding displacements accordingly.
This is exactly CG on \(A\) with non-standard metric \(M = A^T\). The residual norm is minimized, and the residuals are \(A^T\)-conjugate.
This algorithm yields the same iterates as MINRES (and is cheaper to compute). \(Ap\) can be obtained from \(Ar\) at each iteration to keep only one multiplication by \(A\) per iteration.
This is the one found on wikipedia. Again, this is CG on \(\inv{B}A\) with metric \(A^T\). The \(\inv{B}\)-norm of the residual is minimized, and the residuals are again \(A^T\)-conjugate.
\(Ap\) can be obtained from \(Ar\) at each iteration to keep only one multiplication by \(A\) per iteration.
This one is CG on \(\inv{B}A\) using metric \(A^TB\). In praticular, \(A\) does not need to be symmetric, only \(A^TB\) has to be positive definite. The norm of the residual is minimized, and the residuals are \(A^TB\)-conjugate:
Using an extra variable \(z = \inv{B} r\):
Any convex combination of \(I, A^T\) as a metric will minimize the corresponding weighted sum of energy and residual norm.