These are some nice quotes from “The Evolution of Mathematical Software”, Turing Lecture by the 2021 Turing Award winner Jack J. Dongarra, which talks about algorithm and software development in the context of ever improving hardware.
a large infrastructure of mathematical libraries […] that must be mainted, ported, and enhanced for many years to come if the value of the application codes that depend on it are to be preserved and extended. The software that encapsulates all this time, energy, and thought, routinely outlasts the hardware it was originally designed to run on.
architectures are becoming progressively more complex […]. This leaves existing software unable to make efficient use of the increased processing power.
Since Dennard Scaling ceased around 2006 due to physical limits, the push has been toward multi-core architectures. Instead of getting improved performance for “free” through hardware improvements, software had to be adapted to parallel, multi-threaded architectures.
The compute speed, memory bandwidth, and memory latency increase at different exponential rates, leading to an increasing gap between data movement speeds and computation speeds. […] Hiding communication costs is thus becoming increasingly more difficult. Instead of just relying on hardware caches, new algorithms must be designed to minimize and hide communication, sometimes at the expense of duplicating memory and computation.
recently–as the compute-speed-to-bandwidth ratio increases–algorithms have again been reformulated as communication avoiding.
By organizing the [matrix-multiplication] computation into blocks, we provide for full reuse of data while each block is held in cache or local memory, avoiding excessive movement of data, and giving a surface-to-volume effect for the ratio of data movement to arithmetic operations, that is, \(O(n^2)\) data movement to \(O(n^3)\) arithmetic operations.
the bandwidth/compute imbalance will become even more pronounced. In such an environment, we must abandon the notion that knowing the number of operation for an algorithm is a good indicator of its ultimate performance. […] we must focus on data movement, synchronization points, and understanding of the nature of the interaction between threads and processes
we must examine the amount of data the algorithm accesses and choose one that can minimize accesses–we call this approach “communication avoiding”.
Opting for higher complexity algorithms may be preferable if the operations better fit the hardware and transfer less data across the modern memory hierarchy and on-node interconnects.