
Can the Internet Fix GPU Performance Issues?
"GPU library performance can be very notchy -- runtime of batched torch.linalg.solve_ex() went up by over 10x going from 511x511 matrices to 512x512."
Legenday programmer John Carmack, now an AI researcher, posted about a PyTorch performance issue on X concerning
torch.linalg.solve_ex. His point is mainly how GPU performance will often fall off a cliff for unclear reasons when using GPU libraries (although I suspect the same is true when writing GPU kernels as well).
The reason this particular post is notable is is that you would generally expect a power-of-two size to be roughly as fast as the next smaller size, but 512x512 is far slower than 511x511.
There are at least two interesting questions to answer here:
- What was the cause of this particular performance issue?
- How informative were the replies on X?
Cause
torch.linalg.solve_ex(A, B) solves the linear system AX = B, where A is a square matrix (or batch of square matrices), and assumes that A is invertible. Specifically it computes X = A.inverse() @ B but supposedly in some cooler way.
The call stack for actually invoking the GPU library function looks something like this:
If we have cusolver as the preferrered linear algebra library, we have two main paths:
If
batch_size == 1 || m >= 512 we call
lu_factor_looped_cusolver where
m is the size of our square matrix
A:
Otherwise we call
lu_factor_batched_cublas:
Note the special behavior around a size of 512, where we switch from a single batched call to a looped call. This is likely the source of the performance cliff that Carmack ran into.
When we swap from a single call of cublasSgetrfBatched at (batch_size, 511, 511) to calling cusolverDnSgetrf in a loop at (batch_size, 512, 512) we introduce a performance discontinuity. Presumably this is done because looped cusolverDnSgetrf was measured to be faster than cublasSgetrfBatched on some particular GPU in the past. If our threshold is chosen well, we would expect no performance regression from doing this, but since this is a rough heuristic that ignores batch_size and GPU hardware it can only do so much.
I was able to reproduce a performance gap (with MAGMA disabled, because it only crashed when I used it), but unfortunately not the cudaMalloc issue that Carmack reported. When running code in a loop, PyTorch has a caching allocator that will keep GPU memory around to avoid calling cudaMalloc since calling it will slow down your program. The looped version does do allocations for each batch element, but those should in theory hit the caching allocator and not slow down the program substantially.
Looking at
torch.profiler recordings of the different sizes, we can see the single batched call for the 511x511 take 21 ms, while the looped version for the 512x512 takes 69ms (note that the time axis is not the same scale for these two):
In the absence of a good heuristic, making this user-controllable would be nice, but the user cannot currently choose directly between cuSOLVER and cuBLAS.
Comments
Although we don't have a repro of the original issue, let's assume that this is the root cause, and see how accurate the comments are on the post. I've tried to classify the comments into one of four categories:
- correct
- partially correct - in the right direction, but not specific enough
- incorrect - makes a claim or suggestion that is wrong in this case (like a true statement that doesn't matter here) or just wrong in general
- irrelevant - doesn't provide information related to resolving this performance issue
Here's the (noisy) count I got from reading through the replies to the original post as well as the clarification post:
- correct: 3
- partially correct: 3
- incorrect: 16
- irrelevant: 32
So out of 22 posts that are not irrelevant, 27% are correct or partially correct and 14% point out a likely potential root cause. 2 of the correct posts come from the same person,
Ivan Yashchuk, who reviewed the
original PR that added this function 4 years ago.
Ivan filed 3 issues against PyTorch:
This issue has likely existed for awhile, but, due to Carmack's post on X and its visibility, seems much more likely to get fixed now. The general problem of GPU performance being notchy, however, remains.