Solving Sparse Linear Systems Faster than Matrix Multiplication (2024)

research-article

Open access

Authors: Richard Peng and Santosh S. Vempala

Communications of the ACM, Volume 67, Issue 7

Pages 79 - 86

Published: 01 July 2024 Publication History

  • 0citation
  • 0
  • Downloads

Metrics

Total Citations0Total Downloads0

Last 12 Months0

Last 6 weeks0

  • Get Citation Alerts

    New Citation Alert added!

    This alert has been successfully added and will be sent to:

    You will be notified whenever a record that you have chosen has been cited.

    To manage your alert preferences, click on the button below.

    Manage my Alerts

    New Citation Alert!

    Please log in to your account

  • All formatsPDF

      • View Options
      • References
      • Media
      • Tables
      • Share

    Abstract

    Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph-structured linear systems, in the general setting, the bit complexity of solving an n × n linear system Ax = b is Õ(nω), where ω is the matrix multiplication exponent. Improving on this has been an open problem even for sparse linear systems with poly(n) condition number.

    In this paper, we present an algorithm that solves linear systems with sparse coefficient matrices asymptotically faster than matrix multiplication for any ω > 2. This speedup holds for any input matrix A with o(nω−1 / log (κ(A))) non-zeros, where κ(A) is the condition number of A.

    Our algorithm can be viewed as an efficient, randomized implementation of the block Krylov method via recursive low displacement rank factorization. It is inspired by an algorithm of Eberly et al. for inverting matrices over finite fields. In our analysis of numerical stability, we develop matrix anti-concentration techniques to bound the smallest eigenvalue and the smallest gap in the eigenvalues of semi-random matrices.

    1. Introduction

    Solving a linear system Ax=b is a basic algorithmic problem with direct applications to scientific computing, engineering, and physics, and is at the core of algorithms for many other problems, including optimization, data science, and computational geometry. It has enjoyed an array of elegant approaches, from Cramer’s rule and Gaussian elimination to numerically stable iterative methods to more modern randomized variants based on random sampling8,19 and sketching.23 Despite much recent progress on faster solvers for graph-structured linear systems8,9,19 progress on the general case has been elusive.

    Most of the work in obtaining better running time bounds for linear systems solvers has focused on efficiently computing the inverse of A, or some factorization of it. Such operations are in turn closely related to the cost of matrix multiplication. Matrix inversion can be reduced to matrix multiplication via divide-and-conquer, and this reduction was shown to be stable when the word size for representing numbersb is increased by a factor of O(logn).4 The current best runtime of ω<2.38 follows a long line of work on faster matrix multiplication algorithms and is also the current best running time for solving Ax=b: when the input matrix/vector are integers, matrix multiplication based algorithms can obtain the exact rational solution using O(nω) word operations.20

    Methods for matrix inversion or factorization are often referred to as direct methods in the linear systems literature. This is in contrast to iterative methods, which gradually converge to the solution. Iterative methods have low space overhead, and therefore are widely used for solving large, sparse, linear systems that arise in scientific computing. Another reason for their popularity is that iterative methods are naturally suited to producing approximate solutions of desired accuracy in floating point arithmetic, the de facto method for representing real numbers. Perhaps the most famous iterative method is the Conjugate Gradient (CG) / Lanczos algorithm.16 It was introduced as an O(n·nnz) time algorithm under exact arithmetic, where nnz is the number of non-zeros in the input matrix. However, this bound only holds under the Real RAM model where words have unbounded precision. When taking bit sizes into account, it incurs an additional factor of n. Despite much progress in iterative techniques in the intervening decades, obtaining gains over matrix multiplication in the presence of round-off errors has remained an open question.

    The convergence and stability of iterative methods typically depend on some condition number of the input. When all intermediate steps are carried out using precision close to the condition number of A, the running time bounds of the CG algorithm, as well as other currently known iterative methods, depend polynomially on the condition number of the input matrix A. Formally, the condition number of a symmetric matrix A, κ(A), is the ratio between the maximum and minimum eigenvalues of A. Here the best known rate of convergence when all intermediate operations are restricted to bit-complexity O(log(κ(A))) is O(κ(A)log(1/ϵ)) iterations to achieve error ϵ. This is known to be tight if one restricts to matrix-vector multiplications in the intermediate steps.12,17 This means for moderately conditioned (e.g., with κ=poly(n)), sparse, systems, the best runtime bounds are still via direct methods, which are stable when O(log(1/κ)) words of precision are maintained in intermediate steps.4

    Many of the algorithms used in practice in scientific computing for solving linear systems involving large, sparse matrices are based on combining direct and iterative methods: we will briefly discuss these perspectives in Section1.3. In terms of asymptotic complexity, the practical successes of many such methods naturally lead to the question of whether one can provably do better than the O(min{nω,nnz·κ(A)}) time corresponding to the faster of direct or iterative methods. Somewhat surprisingly, despite the central role of this question in scientific computing and numerical analysis, as well as extensive studies of linear systems solvers, progress on it has been elusive. The continued lack of progress on this question has led to its use as a hardness assumption for showing conditional lower bounds for numerical primitives such as linear elasticity problems25 and positive linear programs.10 One formalization of such hardness is the Sparse Linear Equation Time Hypothesis (SLTH) from:10 SLTHkγ denotes the assumption that a sparse linear system with κnnz(A)k cannot be solved in time faster than nnz(A)γ to within relative error ϵ=n-10k. Here, improving over the smaller running time of both direct and iterative methods can be succinctly encapsulated as refuting SLTHkmin{1+k/2,ω}.c

    We provide a faster algorithm for solving sparse linear systems. Our formal result is the following (we use the form defined in:10 Linear Equation Approximation Problem, LEA).

    Given a matrix A with maximum dimension n, nnz(A) non-zeros (whose values fit into a single word), along with a parameter κ(A) such that κ(A)σmax(A)/σmin(A), a vector b and error requirement ϵ, we can compute, under fixed point arithmetic, in time

    O(max{nnz(A)ω2ω1n2,n5ω4ω+1}log2(κ/ϵ))

    a vector x such that

    AxΠAb22ϵΠAb22,

    where c is a fixed constant and ΠA is the projection operator onto the column space of A.

    Note that ΠAb2=ATb(ATA), and when A is square and full rank, it is just b2.

    The cross-over point for the two bounds is at nnz(A)=n3(ω-1)ω+1. In particular, for the sparse case with nnz(A)=O(n), and the bound of ω2.38, we get an exponent of

    max{2+ω2ω1,5ω4ω+1}<max{2.28,2.34}=2.34.

    As nnnz, this also translates to a running time of O(nnz5ω-4ω+1), which as 5ω-4ω+1=ω-(ω-2)2ω+1, refutes SLTHkω for constant values of k and any value of ω>2.

    We can parameterize the asymptotic gain over matrix multiplication for moderately sparse instances. Here we use the O~(·) notation to hide lower-order terms, specifically O~(f(n)) denotes O(f(n)·logc(f(n))) for some absolute constant c.

    For any matrix A with dimension at most n, O(nω-1-θ) non-zeros, and condition number nO(1), a linear system in A can be solved to accuracy n-O(1) in time O~(max{n5ω-4ω+1,nω-θ(ω-2)ω-1}).

    Here the cross-over point happens at θ=(ω-1)(ω-2)ω+1. Also, because 5ω-4ω+1=ω-(ω-2)2ω+1, we can also infer that for any 0<θω-2 and any ω>2, the runtime is o(nω), or asymptotically faster than matrix multiplication.

    1.1 Idea

    At a high level, our algorithm follows the block Krylov space method (see e.g., Chapter 6.12 of Saad16). This method is a multi-vector extension of the CG/Lanczos method, which in the single-vector setting is known to be problematic under round-off errors both in theory12 and in practice.16 Our algorithm starts with a set of s initial vectors, Bn×s, and forms a column space by multiplying these vectors by A repeatedly, m times. Formally, the block Krylov space matrix is

    K=[B|AB|A2B|...|Am1B].

    The core idea of Krylov space methods is to efficiently orthogonalize this column space. For this space to be spanning, block Krylov space methods typically choose s and m so that sm=n.

    The conjugate gradient algorithm can be viewed as an efficient implementation of the case s=1, m=n, with B set to b, the RHS of the input linear system. The block case with larger values of s was studied by Eberly, Giesbrecht, Giorgi, Storjohann, and Villard5 over finite fields, and they gave an O(n2.28) timed algorithm for computing the inverse of an O(n)-sparse matrix over a finite field.

    Our algorithm also leverages the top-level insight of the Eberly et al. results: the Gram matrix of the Krylov space matrix, is a block Hankel matrix. Solving linear systems in this Gram matrix, (AK)T(AK), lead to solvers for linear systems in A because

    ((AK)Τ(AK))1=(KΤAΤAK)1=K1A1AΤKΤ,

    so as long as A and K are both invertible, composing this on the left by K and on the right by AK gives

    A1=K((AK)Τ(AK))1ATKT.

    Eberly et al. viewed the Gram matrix as an m-by-m matrix containing s-by-s sized blocks, and critically leveraged the fact that the blocks along each anti-diagonal are identical:

    (AK)T(AK)=BTA2BBTA3BBTA4B...BTAm+1BBTA3BBTA4BBTA5B...BTAm+2BBTA4B...BTAm+1BBTA5B...BTAm+2BBTA6B...BTAm+3B.........BTAm+3B...BTA2mB

    Formally, the s-by-s inner product matrix formed from AiB and AjB is BTAi+jB, and depends only on i+j. So instead of m2 blocks each of size s×s, we are able to represent an n-by-n matrix with only about m blocks.

    Operations involving these m blocks of the Hankel matrix can be handled using O~(m) block operations. This is perhaps easiest seen for computing matrix-vector products using K. If we use {i} to denote the ith block of the Hankel matrix, and define

    H{i,j}=M(i+j)

    for a sequence of matrices M, we get that the ith block of the product Hx can be written in block-form as

    (Hx){i}=jH{i,j}x{j}=jM(i+j)x{j}.

    Note this is precisely the convolution of (a sub-interval) of M and x, with shifts indicated by i. Therefore, in matrix-vector multiplication (the “forward” direction), a speedup by a factor of about m is possible with fast convolution algorithms. The performance gains of the Eberly et al. algorithms5 can be viewed as being of a similar nature, albeit in the more difficult direction of solving linear systems. Specifically, they utilize algorithms for the Padé problem of computing a polynomial from the result of its convolution.1 Over finite fields, or under exact arithmetic, such algorithms for matrix Padé problems take O(mlogm) block operations,1 for a total of O~(sωm) operations.

    The overall time complexity is based on two opposing goals:

    1.

    Quickly generate the Krylov space: repeated multiplication by A allows us to generate AiB using O(ms·nnz)=O(n·nnz) arithmetic operations. Choosing a sparse B then allows us to compute BTAiB in O(n·s) arithmetic operations, for a total overhead of O(n2).

    2.

    Quickly invert the Hankel matrix. Each operation on an s-by-s block takes O(sω) time. Under the optimistic assumption of O~(m) block operations, the total is O~(m·sω).

    Under these assumptions, and the requirement of nms, the total cost becomes about O(n·nnz+m·sω), which is at most O(n·nnz) as long as m>nω-2ω-1. However, this runtime complexity is over finite fields, where numerical stability is not an issue. Over the reals, under round-off errors, one must contend with numerical errors without blowing up the bit complexity. This is a formidable challenge; indeed, as mentioned earlier, with exact arithmetic, the CG method takes time O(n·nnz), but this is misleading since the computation is effective only when the word sizes are increased by a factor of n (to about nlogκ words), which leads to an overall complexity of O(n2·nnz·logκ).

    1.2 Our Contributions

    Our algorithm can be viewed as the numerical generalization of the algorithms from.5 We work with real numbers of bounded precision, instead of entries over a finite field. The core of our approach can be summarized as follows.

    The block Krylov space method together with fast Hankel solvers can be made numerically stable using O~(mlog(κ)) words of precision.

    Doing so requires separately developing tools for two topics that have been extensively studied in mathematics:

    1.

    Obtain low numerical cost solvers for block Hankel/Toeplitz matrices. Many of the prior algorithms rely on algebraic identities that do not generalize to the block setting, and are often (experimentally) numerically unstable.7

    2.

    Develop matrix anti-concentration bounds for analyzing the word lengths of inverses of random Krylov spaces. This is to upper bound the probability of random matrices being in some set of small measure, which in our case is the set of nearly singular matrices. Previously, such bounds were known assuming the matrix entries are independent18,21 but Krylov matrices have correlated columns.

    Before we describe the difficulties and new tools needed, we first provide some intuition on why a factor m increase in word lengths may be the right answer by upper-bounding the magnitudes of entries in an m-step Krylov space. By rescaling, we may assume that the minimum singular value of A is at least 1/κ, and the maximum entry in A is at most κ. The maximum magnitude of (entries of) Amb is bounded by the maximum magnitude of A to the power of m, times a factor corresponding to the number of summands in the matrix product:

    ||Amb||nm|||A|||m||b||(nk)m||b||κO(m)||b||.

    Here the last inequality is via the assumption of κn. So by forming K via horizontally concatenating Aig for sparse Gaussian vectors g, we have with high probability that the maximum magnitude of an entry of K, and in turn AK, is at most κO(m). In other words, O(mlogκ) words in front of the decimal point is sufficient with high probability.

    Should such a bound of O(mlogκ) hold for all numbers that arise in the algorithm, including the matrix inversion steps, and the matrix B is sparse with O(n) entries, the cost of computing the block-Krylov matrices becomes O(mlogκ·ms·nnz), while the cost of the matrix inversion portion encounters an overhead of O(mlogκ), for a total of O~(m2sωlogκ). In the sparse case of nnz=O(n), and nms, this becomes:

    O(n2mlogκ+m2sωlogκ)=O(n2mlogκ+nωmω2logκ).

    (1)

    Due to the gap between n2 and nω, setting m appropriately gives improvement over nω when logκ=no(1).

    However, the magnitude of an entry in the inverse depends on the smallest magnitude, or in the matrix case, its minimum singular value. Bounding and propagating the min singular value, which intuitively corresponds to how close a matrix is to being degenerate, represents our main challenge. In exact/finite fields settings, non-degeneracies are certified via the Schwartz-Zippel lemma about polynomial roots. The numerical analog of this is more difficult — the Krylov space matrix K is asymmetric, even for a symmetric matrix A. It is much easier for an asymmetric matrix with correlated entries to be close to singular.

    Consider for example a two-banded, two-block matrix with all diagonal entries set to the same random variable α (see Figure1):

    Aij={1ifi=jandjn/2,αifi=j+1andjn/2,αifi=j+1andn/2<j,2ifi=j+1andn/2<j,0otherwise.

    Figure 1.

    Solving Sparse Linear Systems Faster than Matrix Multiplication (1)

    In the exact case, this matrix is full rank unless α=0, even over finite fields. On the other hand, its minimum singular value is close to 0 for all values of α. To see this, it’s useful to first make the following observation about the min singular value of one of the blocks.

    The minimum singular value of a matrix with 1s on the diagonal, α on the entries immediately below the diagonal, and 0 everywhere else is at most |α|-(n-1), due to the test vector [1;-α;α2;...;(-α)n-1].

    Then, in the top-left block, as long as |α|>3/2, the top left block has minimum singular value at most (2/3)n-1. On the other hand, rescaling the bottom-right block by 1/α to get 1s on the diagonal gives 2/α on the off-diagonal. So as long as |α|<3/2, this value is at least 4/3, which in turn implies a minimum singular value of at most (3/4)n-1 in the bottom right block. This means no matter what value α is set to, this matrix will always have a singular value that’s exponentially close to 0. Furthermore, the Gram matrix of this matrix also gives such a counter example to symmetric matrices with (non-linearly) correlated entries. Previous works on analyzing condition numbers of asymmetric matrices also encounter similar difficulties; a more detailed discussion of it can be found in Section 7 of Sankar et al.18

    In order to bound the bit complexity of all intermediate steps of the block Krylov algorithm by O~(m·logκ), we devise a more numerically stable algorithm for solving block Hankel matrices, as well as provide a new perturbation scheme to quickly generate a well-conditioned block Krylov space. Central to both of our key components is the close connection between condition number and bit complexity bounds.

    First, we give a more numerically stable solver for block Hankel/Toeplitz matrices. Fast solvers for Hankel (and closely related Toeplitz) matrices have been extensively studied in numerical analysis, with several recent developments on more stable algorithms.24 However, the notion of numerical stability studied in these algorithms is the variant where the number of bits of precision is fixed. Our attempts at converting these into asymptotic bounds yielded dependencies quadratic in the number of digits in the condition number, which in our setting translates to a prohibitive cost of O~(m2) (i.e., the overall cost would be higher than nω).

    Instead, we combine developments in recursive block Gaussian elimination4 with the low displacement rank representation of Hankel/Toeplitz matrices.7 Such representations allow us to implicitly express both the Hankel matrix and its inverse by displaced versions of low-rank matrices. This means the intermediate size of instances arising from recursion is O(s) times the dimension, for a total size of O(nlogn), giving a total of O~(nsω-1) arithmetic operations involving words of size O~(m). We provide a rigorous analysis of the accumulation of round-off errors similar to the analysis of recursive matrix multiplication based matrix inversion from.4

    Motivated by this close connection with the condition number of Hankel matrices, we then try to initialize with Krylov spaces of low condition number. Here we show that a sufficiently small perturbation suffices for producing a well conditioned overall matrix. In fact, the first step of our proof, showing that a small sparse random perturbation to A guarantees good separations between its eigenvalues, is a direct combination of bounds on eigenvalue separation of random Gaussians13 as well as min eigenvalue of random sparse matrices.11 This separation then ensures that the powers of A, A1,A2,...Am, are sufficiently distinguishable from each other. Such considerations also come up in the smoothed analysis of numerical algorithms.18

    The randomness of the Krylov matrix induced by the initial set of random vectors B is more difficult to analyze: each column of B affects m columns of the overall Krylov space matrix. In contrast, all existing analyses of lower bounds of singular values of possibly asymmetric random matrices18,21 rely on the randomness in the columns of matrices being independent. The dependence between columns necessitates analyzing singular values of random linear combinations of matrices, which we handle by adapting ϵ-net based proofs of anti-concentration bounds. Here we encounter an additional challenge in bounding the minimum singular value of the block Krylov matrix. We resolve this issue algorithmically: instead of picking a Krylov space that spans the entire n, we stop short by picking ms=n-O~(m) This set of extra columns significantly simplifies the proof of singular value lower bounds. This is similar in spirit to the analysis of the minimum singular value of a random matrix, which is easier for a non-square matrix.15 In the algorithm, the remaining columns are treated as a separate block that we handle via a Schur complement at the very end of the algorithm. Since this block is small, so is its overhead on the running time.

    1.3 History and Related Work

    Our algorithm has close connections with multiple lines of research on efficient solvers for sparse linear systems. The topic of efficiently solving linear systems has been extensively studied in computer science, applied mathematics and engineering. For example, in the Society of Industrial and Applied Mathematics News’ ‘top 10 algorithms of the 20th century’, three of them (Krylov space methods, matrix decompositions, and QR factorizations) are directly related to linear systems solvers.3

    At a high level, our algorithm is a hybrid linear systems solver. It combines iterative methods, namely block Krylov space methods, with direct methods that factorize the resulting Gram matrix of the Krylov space. Hybrid methods have their origins in the incomplete Cholesky method for speeding up elimination/factorization based direct solvers. A main goal of these methods is to reduce the Ω(n2) space needed to represent matrix factorizations/inverses. This high space requirement is often even more problematic than time requirements when handling large sparse matrices. Such reductions can occur in two ways: either by directly dropping entries from the (intermediate) matrices, or by providing more succinct representations of these matrices using additional structure.

    The main structure of our algorithm is based on the latter line of work on solvers for structured matrices. Such systems arise from physical processes where the interactions between objects have invariances (e.g., either by time or space differences). Examples of such structure include circulant matrices, Hankel/Toeplitz matrices and distances from n-body simulations.7 Many such algorithms require exact preservation of the structure in intermediate steps. As a result, many of these works develop algorithms over finite fields.

    More recently, there has been work on developing numerically stable variants of these algorithms for structured matrices, or more generally, matrices that are numerically close to being structured.24 However, these results only explicitly discussed in the entry-wise Hankel/Toeplitz case (which corresponds to block size s=1). Furthermore, because they rely on domain-decomposition techniques similar to fast multiple methods, they produce one bit of precision for each outer iteration loop. As the Krylov space matrix has condition number exp(Ω(m)), such methods would lead to another factor of m in the solve cost if invoked directly.

    Instead, our techniques for handling and bounding numerical errors are more closely related to recent developments in provably efficient sparse Cholesky factorizations.9 These methods generate efficient preconditioners using only the condition that intermediate steps of Gaussian elimination, known as Schur complements, have small representations. They avoid the explicit generation of the dense representations of Schur complements by treating them as operators, and apply randomized tools to directly sample/sketch the final succinct representations, which have much smaller size and algorithmic cost.

    On the other hand, previous works on sparse Cholesky factorizations required the input matrix to be decomposable into a sum of simple elements, often through additional combinatorial structure of the matrices. In particular, this line of work on combinatorial preconditioning was initiated through a focus on graph Laplacians, which are built from 2-by-2 matrix blocks corresponding to edges of undirected graphs.19 Since then, there have been substantial generalizations to the structures amenable to such approaches, notably to finite element matrices, and directed graphs/irreversible Markov chains. However, recent works have also shown that many classes of structures involving more than two variables are complete for general linear systems.25 Nonetheless, the prevalence of approximation errors in such algorithms led to the development of new ways to bound numerical round-off errors in algorithms, which will be critical to our elimination routine for block-Hankel matrices.

    Key to recent developments in combinatorial preconditioning is matrix concentration.22 Such bounds provide guarantees for (relative) eigenvalues of random sums of matrices. For generating preconditioners, such randomness arises from whether each element is kept or not, and a small condition number (which in turn implies a small number of outer iterations using the preconditioner) corresponds to a small deviation between the original and sampled matrices. In contrast, we introduce randomness in order to obtain block Krylov spaces whose minimum eigenvalue is large. As a result, the matrix tool we need is anti-concentration, which somewhat surprisingly is far less studied. Previous works on it are related to similar problems of numerical precision18,21 and address situations where the entries in the resulting matrix are independent. Our bound on the min singular value of the random Krylov space also yields a crude bound for a sum of rectangular random matrices.

    1.4 Subsequent Improvements and Extensions

    Nie14 gave a more general and tighter version of matrix anti-concentration that also works for square matrices, answering an open question we posed. For an m-step Krylov space instantiated using nm vectors, Nie’s bound reduces the middle term in our analysis from O(n2m3) to O(n2m), thus leading to a running time of nnz·n·m+nω·m2-ω, which matches the bound for finite fields. Moreovver, it does so without the padding step at the end. We elect to keep for this article our epsilon-net based analyses, and the padded algorithm required for it, both for completeness, and due to it being a more elementary approach toward the problem with a simpler proof.

    Faster matrix multiplication is an active area of work with recent progresses. Due to the dependence on fast multiplication in our algorithm, such improvements also lead to improvements in the running time of solving sparse systems as well.

    Our main result for solving sparse linear systems has also been extended to solving sparse regression, with faster than matrix multiplication bounds for sufficiently sparse matrices.6 The complexity of sparse linear programming remains an interesting open problem.

    2. Algorithm

    We describe the algorithm, as well as the running times of its main components in this section. To simplify the discussion, we assume the input matrix A is symmetric, and has poly(n) condition number. If it is asymmetric (but invertible), we implicitly apply the algorithm to ATA, using the identity A-1=(ATA)-1AT derived from (ATA)-1=A-1A-T. Also, recall from the discussion after Theorem1 that we use O~(·) to hide logarithmic terms in order to simplify runtimes.

    Before giving details of our algorithm, we first discuss what constitutes a linear system solver algorithm, specifically the equivalence between many such algorithms and linear operators.

    For an algorithm ALG that takes a matrix B as input, we say that ALG is linear if there is a matrix ZALG such that for any input B, the output of running the algorithm on B is the same as multiplying B by ZALG:

    ALGB=ZALGB.

    In this section, in particular in the pseudocode in Algorithm2, we use the name of the procedure, SOLVEA(b,δ), interchangeably with the operator corresponding to a linear algorithm that solves a system in A, on vector b, to error δ>0. In the more formal analysis, we will denote such corresponding linear operators using the symbol Z, with subscripts corresponding to the routine if appropriate.

    Figure 2.

    Solving Sparse Linear Systems Faster than Matrix Multiplication (2)

    This operator/matrix based analysis of algorithms was first introduced in the analysis of a recursive Chebyshev iteration by Spielman and Teng,19 with credits to the technique also attributed to V.Rokhlin. It has the advantage of simplifying the analyis of multiple iterations of such algorithms, as we can directly measure Frobenius norm differences between such operators and the exact ones that they approximate.

    Under this correspondence, the goal of producing an algorithm that solves Ax=b for any b as input becomes equivalent to producing a linear operator ZA that approximates A-1, and then running it on the input b. For convenience, we also let the solver take as input a matrix instead of a vector, in which case the output is the result of solves against each of the columns of the input matrix as the RHS.

    The high-level description of our algorithm is in Figure2.

    Some of the steps of the algorithm require care for efficiency as well as for tracking the number of words needed to represent the numbers. We assume a bound on bit complexity of O~(m) when κ=poly(n) in the brief description of costs in the outline of the steps below.

    We start by perturbing the input matrix, resulting in a symmetric positive definite matrix where all eigenvalues are separated by αA. Then we explicitly form a Krylov matrix from a sparse random Gaussian matrix, see Fig.3. For any vector u, we can compute Aiu from Ai-1u via a single matrix-vector multiplication in A. So computing each column of K requires O(nnz(A)) operations, each involving a length n vector with words of length O~(m). So we get the matrix K, as well as AK, in time

    O˜(nnz(A)nm).

    Figure 3.

    Solving Sparse Linear Systems Faster than Matrix Multiplication (3)

    To obtain a solver for AK, we instead solve its Gram matrix (AK)T(AK). Each block of KTK has the form (GS)TAiGS for some 2i2m, and can be computed by multiplying (GS)T and AiGS. As AiGS is an n-by-s matrix, each non-zero in GS leads to a cost of O(s) operations involving words of length O~(m). Then because we chose GS to have O~(m3) non-zeros per column, the total number of non-zeros in GS is about O~(s·m3)=O~(nm2). This leads to a total cost (across the m values of i) of:

    O˜(n2m3).

    The key step is then Step2, a block version of the Conjugate Gradient method. It will be implemented using a recursive data structure based on the notion of displacement rank.7 To get a sense of why a faster algorithm may be possible, note that there are only O(m) distinct blocks in the matrix (AK)T(AK). So a natural hope is to invert these blocks by themselves; the cost of (stable) matrix inversion,4 times the O~(m) numerical word complexity, would then give a total of

    O˜(m2sω)=O˜(m2(nm)ω)=O˜(nωm2ω).

    Of course, it does not suffice to solve these m s-by-s blocks independently. Instead, the full algorithm, as well as the SOLVEM operator, is built from efficiently convolving such s-by-s blocks with matrices using Fast Fourier Transforms. Such ideas can be traced back to the development of super-fast solvers for (entry-wise) Hankel/Toeplitz matrices.7

    Choosing s and m so that n=sm would then give the overall running time, assuming that we can bound the minimum singular value of K by exp(-O~(m)). This is a shortcoming of our analysis: we can only prove such a bound when n-smΩ(m). The underlying reason is that rectangular semi-random matrices can be analyzed using ϵ-nets, and thus are significantly easier to analyze than square matrices.

    This means we can only use m and s such that n-ms=Θ(m), and we need to pad K with n-ms columns to guarantee a full rank, invertible, matrix. To this end, we add Θ(m) dense Gaussian columns to K to form Q, and solve the system AQ, and its associated Gram matrix (AQ)T(AQ) instead. These matrices are shown in Figure4.

    Figure 4.

    Solving Sparse Linear Systems Faster than Matrix Multiplication (4)

    Since these additional columns are entry-wise i.i.d, the minimum singular value can be analyzed using existing tools18,21 namely lower bounding the inner product of a random vector against any vector. Thus, we can lower bound the minimum singular value of Q, and in turn AQ, by exp(-O~(m)).

    This bound in turn translates to a lower bound on the minimum eigenvalue of the Gram matrix of AQ, (AQ)T(AQ). Partitioning its entries by those from K and G gives four blocks: one (sm)-by-(sm) block corresponding to (AK)T(AK), one Θ(m)-by-Θ(m) block corresponding to (AG)T(AG), and then the cross terms. To solve this matrix, we apply block-Gaussian elimination, or equivalently, form the Schur complement onto the Θ(m)-by-Θ(m) corresponding to the columns in AG.

    To compute this Schur complement, it suffices to solve the top-left block (corresponding to (AK)T(AK)) against every column in the cross term. As there are at most Θ(m)<s columns, this solve cost comes out to less than O~(sωm) as well. We are then left with a Θ(m)-by-Θ(m) matrix, whose solve cost is a lower order term.

    So the final solver costs

    O˜(nnz(A)nm+n2m3+nωm2ω)

    which leads to the final running time by choosing m to balance the terms. This bound falls short of the ideal case given in Equation1 mainly due to the need for a denser B to the well-conditionedness of the Krylov space matrix. Instead of O(n) non-zeros total, or about O(m) per column, we need poly(m) non-zero variables per column to ensure the an exp(-O(m)) condition number of the block Krylov space matrix K. This in turn leads to a total cost of O(n·nnz·poly(m)) for computing the blocks of the Hankel matrix, and a worse trade off when summed against the nωmω-2 term.

    3. Acknowledgments

    Richard Peng was supported in part by NSF CAREER award 1846218/2330255, and Santosh Vempala by NSF awards AF-1909756 and AF-2007443. We thank Mark Giesbrecht for bringing to our attention the literature on block-Krylov space algorithms.

    Richard Peng ([emailprotected]), Carnegie Mellon University, Pittsburgh, PA, USA. Work here done while at Georgia Institute of Technology, Atlanta, GA, USA.

    Santosh S. Vempala ([emailprotected]), Georgia Institute of Technology, Atlanta, GA, USA.

    Footnotes

    a

    Currently, ω<2.371866.

    b

    We will be measuring bit-complexity under fixed-point arithmetic. Here the machine word size is on the order of the maximum number of digits of precision in A, and the total cost is measured by the number of word operations. The need to account for bit-complexity of the numbers naturally led to the notion of condition number.2 The logarithm of the condition number measures the additional number of words needed to store A-1 (and thus A-1b) compared to A. In particular, matrices with poly(n) condition number can be stored with a constant factor overhead in precision, and are numerically stable under standard floating point number representations.

    c

    The hardness results in Kyng et al.10 were based on SLTH1.51.99 under the Real RAM model in part due to the uncertain status of conjugate gradient in different models of computation.

    d

    Rounding up two decimals places in the exponent.

    References

    [1]

    Beckermann, B. and Labahn, G. A uniform approach for the fast computation of matrix-type Padé approximants. SIAM J. on Matrix Analysis and Applications 15, 3 (1994), 804–823.

    Digital Library

    [2]

    Blum, L. Computing over the reals: Where Turing meets Newton. Notices of the AMS 51, 9 (2004), 1024–1034.

    [3]

    Cipra, B.A. The best of the 20th century: Editors name top 10 algorithms. SIAM news 33, 4 (2000), 1–2; https://archive.siam.org/pdf/news/637.pdf.

    [4]

    Demmel, J., Dumitriu, I., Holtz, O., and Kleinberg, R. Fast matrix multiplication is stable. Numerische Mathematik 106, 2 (2007), 199–224.

    [5]

    Eberly, W. et al. Faster inversion and other black box matrix computations using efficient block projections. In Symbolic and Algebraic Computation, Intern. Symp., ISSAC 2007, Waterloo, Ontario, Canada, July 28 - August 1, 2007, Proceedings, 2007, 143–150.

    [6]

    Ghadiri, M., Peng, R., and Vempala, S.S. The bit complexity of efficient continuous optimization. InProceedings of the 64th Symp. on Foundations of Computer Science, 2023; https://arxiv.org/abs/2304.02124.

    [7]

    Gray, R.M. Toeplitz and Circulant Matrices: A Rev. now publishers inc, 2006.

    Digital Library

    [8]

    Koutis, I., L. Miller, G., and Peng, R. A fast solver for a class of linear systems. Communications of the ACM 55, 10 (Oct. 2012), 99–107.

    Digital Library

    [9]

    Kyng, R. Approximate Gaussian Elimination. PhD thesis, Yale University, 2017; http://rasmuskyng.com/rjkyng-dissertation.pdf.

    [10]

    Kyng, R., Wang, D., and Zhang, P. Packing LPs are hard to solve accurately, assuming linear equations are hard. Proceedings of the 2020 ACM-SIAM Symp. on Discrete Algorithms, SODA 2020, Salt Lake City, UT, USA, January 5-8, 2020. S. Chawla (ed). SIAM, 2020, 279–296; https://dl.acm.org/doi/pdf/10.5555/3381089.3381106.

    [11]

    Luh, K. and Vu, V. Sparse random matrices have simple spectrum. arXiv preprint arXiv:1802.03662, 2018.

    [12]

    Musco, C., Musco, C., and Sidford, A. Stability of the Lanczos method for matrix function approximation. In Proceedings of the Twenty-Ninth Annual ACM-SIAM Symp. on Discrete Algorithms, SODA 2018, New Orleans, LA, USA, January 7-10, 2018, 2018, 1605–1624; https://arxiv.org/abs/1708.07788.

    [13]

    Nguyen, H., Tao, T., and Vu, V. Random matrices: Tail bounds for gaps between eigenvalues. Probability Theory and Related Fields 167, 3-4, (2017), 777–816.

    [14]

    Nie, Z. Matrix anti-concentration inequalities with applications. In STOC ’22: 54th Annual ACM SIGACT Symp. on Theory of Computing, Rome, Italy, June 20 - 24, 2022. S. Leonardi and A. Gupta (eds). ACM, (2022), 568–581; https://arxiv.org/abs/2111.05553.

    Digital Library

    [15]

    Rudelson, M. and Vershynin, R. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the Intern. Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures. World Scientific, 2010, 1576–1602

    [16]

    Saad, Y. Iterative Methods for Sparse Linear Systems, 2nd edition. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 2003; http://www-users.cs.umn.edu/~saad/toc.pdf.

    Digital Library

    [17]

    Sachdeva, S. and Vishnoi, N.K. Faster algorithms via approximation theory. Theoretical Computer Science 9, 2 (2013), 125–210.

    [18]

    Sankar, A., Spielman, D.A., and Teng, S.-H. Smoothed analysis of the condition numbers and growth factors of matrices. SIAM J. on Matrix Analysis and Applications 28, 2 (2006), 446–476.

    Digital Library

    [19]

    Spielman, D. and Teng, S. Nearly linear time algorithms for preconditioning and solving symmetric, diagonally dominant linear systems. SIAM J. on Matrix Analysis and Applications 35, 3 (2014), 835–885; http://arxiv.org/abs/cs/0607105.

    Digital Library

    [20]

    Storjohann, A. The shifted number system for fast linear algebra on integer matrices. J. of Complexity 21, 4 (2005), 609–650; https://cs.uwaterloo.ca/~astorjoh/shifted.pdf.

    Digital Library

    [21]

    Tao, T. and Vu, V.H. Smooth analysis of the condition number and the least singular value. Math. Comput. 79, 272 (2010), 2333–2352; https://arxiv.org/abs/0805.3167.

    [22]

    Tropp, J.A. An introduction to matrix concentration inequalities. Found. Trends Mach. Learn. 8, 1-2, (2015), 1–230.

    Digital Library

    [23]

    Woodruff, D.P. Sketching as a tool for numerical linear algebra. Foundations and Trends in Theoretical Computer Science 10, 1-2, (2014), 1–157; https://arxiv.org/abs/1411.4357.

    Digital Library

    [24]

    Xi, Y., Xia, J., Cauley, S., and Balakrishnan, V. Superfast and stable structured solvers for toeplitz least squares via randomized sampling. SIAM J. on Matrix Analysis and Applications 35, 1 (2014), 44–72.

    Digital Library

    [25]

    Zhang, P. Hardness and Tractability For Structured Numerical Problems. PhD thesis, Georgia Institute of Technology, 2018.

    Index Terms

    1. Solving Sparse Linear Systems Faster than Matrix Multiplication

      1. Mathematics of computing

        1. Mathematical analysis

          1. Numerical analysis

            1. Computations on matrices

        2. Theory of computation

          1. Design and analysis of algorithms

            1. Approximation algorithms analysis

        Recommendations

        • Solving sparse linear systems faster than matrix multiplication

          SODA '21: Proceedings of the Thirty-Second Annual ACM-SIAM Symposium on Discrete Algorithms

          Can linear systems be solved faster than matrix multiplication? While there has been remarkable progress for the special cases of graph structured linear systems, in the general setting, the bit complexity of solving an n × n linear system Ax = b is Õ(n...

          Read More

        • Solving unsymmetric sparse systems of linear equations with PARDISO

          Special issue: Selected numerical algorithms

          Supernode partitioning for unsymmetric matrices together with complete block diagonal supernode pivoting and asynchronous computation can achieve high gigaflop rates for parallel sparse LU factorization on shared memory parallel computers. The progress ...

          Read More

        • On solving sparse symmetric linear systems whose definiteness is unknown

          Solving a large, sparse, symmetric linear system Ax=b iteratively must use appropriate methods. The conjugate gradient (CG) method can break down if A is indefinite while algorithms such as SYMMLQ and MINRES, though stable for indefinite systems, are ...

          Read More

        Comments

        Information & Contributors

        Information

        Published In

        Solving Sparse Linear Systems Faster than Matrix Multiplication (5)

        Communications of the ACM Volume 67, Issue 7

        July 2024

        82 pages

        ISSN:0001-0782

        EISSN:1557-7317

        DOI:10.1145/3676630

        • Editor:
        • James Larus

          Association for Computing Machinery, New York, NY

        Issue’s Table of Contents

        Copyright © 2024 Owner/Author.

        This work is licensed under a Creative Commons Attribution International 4.0 License.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 02 July 2024

        Online First: 01 July 2024

        Published inCACMVolume 67, Issue 7

        Check for updates

        Qualifiers

        • Research-article

        Funding Sources

        • NSF

        Contributors

        Solving Sparse Linear Systems Faster than Matrix Multiplication (6)

        Other Metrics

        View Article Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Total Citations

        • Total Downloads

        • Downloads (Last 12 months)0
        • Downloads (Last 6 weeks)0

        Other Metrics

        View Author Metrics

        Citations

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Digital Edition

        View this article in digital edition.

        Digital Edition

        HTML Format

        View this article in HTML Format.

        HTML Format

        Get Access

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        Get this Article

        Media

        Figures

        Other

        Tables

        Solving Sparse Linear Systems Faster than Matrix Multiplication (2024)

        FAQs

        What is the fastest method for matrix multiplication? ›

        Strassen's algorithm

        field operations. Unlike algorithms with faster asymptotic complexity, Strassen's algorithm is used in practice. The numerical stability is reduced compared to the naive algorithm, but it is faster in cases where n > 100 or so and appears in several libraries, such as BLAS.

        Why are sparse matrices faster? ›

        Unlike operations with full matrices, operations with sparse matrices do not perform unnecessary low-level arithmetic, such as zero-adds ( x+0 is always x ). The resulting efficiencies can lead to dramatic improvements in execution time for programs working with large amounts of sparse data.

        Is matrix multiplication computationally expensive? ›

        The evaluation of the product of two matrices can be very computationally expensive. The multiplication of two n×n matrices, using the “default” algorithm can take O(n3) field operations in the underlying field k.

        What is the time complexity of sparse matrix vector multiplication? ›

        Sparse-Matrix Vector Multiplication (SpMV)

        This problem is essentially a simple multiplication task where the worst case (dense matrix) has a complexity of O(N3).

        What is the fastest multiplication algorithm? ›

        For more than 35 years, the fastest known method for integer multiplication has been the Schönhage-Strassen algorithm running in time O(nlog nlog log n). Under certain restrictive conditions there is a corresponding Ω(nlog n) lower bound.

        What are the advantages of sparse matrices over normal matrices? ›

        One of the main benefits of using sparse matrices is that they can save a lot of memory and storage space, compared to dense matrices that store every entry. However, to take advantage of this benefit, choose an appropriate format to store and manipulate sparse matrices.

        What is the problem with sparse matrices? ›

        Representing a sparse matrix by a 2D array leads to wastage of lots of memory as zeroes in the matrix are of no use in most of the cases. So, instead of storing zeroes with non-zero elements, we only store non-zero elements. This means storing non-zero elements with triples- (Row, Column, value).

        When should you use sparse matrices? ›

        The concept of sparsity is useful in combinatorics and application areas such as network theory and numerical analysis, which typically have a low density of significant data or connections. Large sparse matrices often appear in scientific or engineering applications when solving partial differential equations.

        Is matrix multiplication faster on CPU or GPU? ›

        GPU computation is faster than CPU only in some typical scenarios. In other cases, computation in GPU can be slower than in CPU! CUDA is vastly used in Machine Learning and Deep Learning because of its particular goodness at parallel matrix multiplication and addition.

        Is matrix multiplication faster than for loop? ›

        Matrix operations may not be superior if the matrices are sparse. In that case a loop that avoids unnecessary operations can be much faster than a dense matrix operation that wastes a lot of time on multiplying and adding zeros.

        What is the most expensive math operation? ›

        Division is the most expensive arithmetic operation.

        Are sparse matrices faster? ›

        1 Answer. You'll easily see that the more bits are set in the right operand, the more computations are necessary to sum up the left operand. So, provided that your matrix multiplication boils down to machine code multiplications like this, a sparse matrix will multiply significantly faster than a dense matrix.

        What is the cost of sparse matrix multiplication? ›

        The cost of sparse matrix-vector multiplication scales linearly with the number of nonzero entries, as each entry is multiplied once by some entry in the vector.

        What is computational complexity of sparse matrices? ›

        The computational complexity of sparse matrix multiplication on AP is shown to be an O(nnz) where nnz is the number of nonzero elements. The AP is found to be especially efficient in binary sparse matrix multiplication. AP outperforms conventional solutions in power efficiency.

        What is the technique for fast multiplication? ›

        One of the best and easy multiplication tricks for large numbers is to find the tens of one of the numbers, and multiply with that quickly. Adding the remaining leftovers will be easier to calculate fully. E.g., 22 X 83 can be rewritten as (20 X 83) + (2 X 83) which gives us 1660 + 166 = 1826.

        Why is Matlab so fast in matrix multiplication? ›

        Additionally, because the program is run on many data elements and has a high arithmetic intensity, calculations can be used to hide memory access latency in place of big data caches. that's why MATLAB is so fast in matrix Multiplication.

        References

        Top Articles
        Latest Posts
        Article information

        Author: Dong Thiel

        Last Updated:

        Views: 5903

        Rating: 4.9 / 5 (59 voted)

        Reviews: 82% of readers found this page helpful

        Author information

        Name: Dong Thiel

        Birthday: 2001-07-14

        Address: 2865 Kasha Unions, West Corrinne, AK 05708-1071

        Phone: +3512198379449

        Job: Design Planner

        Hobby: Graffiti, Foreign language learning, Gambling, Metalworking, Rowing, Sculling, Sewing

        Introduction: My name is Dong Thiel, I am a brainy, happy, tasty, lively, splendid, talented, cooperative person who loves writing and wants to share my knowledge and understanding with you.