X Solve - Affine Layer

Can the Internet Fix GPU Performance Issues?

Written by Christopher Hesse — May 25^th, 2026

"GPU library performance can be very notchy -- runtime of batched torch.linalg.solve_ex() went up by over 10x going from 511x511 matrices to 512x512."

— John Carmack — Apr 29, 2026

Legenday programmer John Carmack, now an AI researcher, posted about a PyTorch performance issue on X concerning torch.linalg.solve_ex. His point is mainly how GPU performance will often fall off a cliff for unclear reasons when using GPU libraries (although I suspect the same is true when writing GPU kernels as well).

The reason this particular post is notable is that you would generally expect a power-of-two size to be roughly as fast as the next smaller size, but 512x512 is far slower than 511x511.

There are at least two interesting questions to try to answer here:

What was the cause of this particular performance issue?
How informative were the replies on X?

Cause

torch.linalg.solve_ex is an "extended" or perhaps "expert" version of torch.linalg.solve which disables error checking by default to avoid synchronizing the device (which is generally bad for GPU performance), added in PR 80073.

torch.linalg.solve_ex(A, B) solves the linear system AX = B, where A is a square matrix (or batch of square matrices), and assumes that A is invertible. Specifically it computes X = A.inverse() @ B but supposedly in some cooler way.

The call stack for actually invoking the GPU library function looks something like this:

BatchLinearAlgebra.cpp:linalg_solve_ex forwards to _linalg_solve_ex
native_functions.yaml:_linalg_solve_ex seems to be used for some codegen or something that ends up calling _linalg_solve_ex_out
BatchLinearAlgebra.cpp:_linalg_solve_ex_out calls linalg_lu_factor_ex_out
BatchLinearAlgebra.cpp:linalg_lu_factor_ex_out calls lu_factor_stub
CUDA BatchLinearAlgebra.cpp forwards that to lu_factor
CUDA BatchLinearAlgebra.cpp:lu_factor selects which implementation to use, in this case it choose MAGMA, cuSOLVER, or cuBLAS

If we have cusolver as the preferrered linear algebra library, we have two main paths:

If batch_size == 1 || m >= 512 we call lu_factor_looped_cusolver where m is the size of our square matrix A:

BatchLinearAlgebraLib.cpp:lu_factor_looped_cusolver calls getrf ("GEneral TRiangular Factorization") in a loop
CUDASolver.cpp:getrf<float> calls cusolverDnSgetrf

Otherwise we call lu_factor_batched_cublas:

BatchLinearAlgebraLibBlas.cpp:lu_factor_batched_cublas calls apply_lu_factor_batched_cublas
BatchLinearAlgebraLibBlas.cpp:apply_lu_factor_batched_cublas calls getrfBatched
CUDABlas.cpp:getrfBatched<float> calls cublasSgetrfBatched

Note the special behavior around a size of 512, where we switch from a single batched call to a looped call. This is likely the source of the performance cliff that Carmack ran into.

When we swap from a single call of cublasSgetrfBatched at (batch_size, 511, 511) to calling cusolverDnSgetrf in a loop at (batch_size, 512, 512) we introduce a performance discontinuity. Presumably this is done because looped cusolverDnSgetrf was measured to be faster than cublasSgetrfBatched on some particular GPU in the past. If our threshold is chosen well, we would expect no performance regression from doing this, but since this is a rough heuristic that ignores batch_size and GPU hardware it can only do so much.

I was able to reproduce a performance gap (with MAGMA disabled, because it crashed whenever I used it), but unfortunately not the cudaMalloc issue that Carmack reported. When running code in a loop, PyTorch has a caching allocator that will keep GPU memory around to avoid calling cudaMalloc since calling it will slow down your program. The looped version does do allocations for each batch element, but those should in theory hit the caching allocator and not slow down the program substantially.

Looking at torch.profiler recordings of the different sizes, we can see the single batched call for the 511x511 take 21 ms, while the looped version for the 512x512 takes 69ms (note that the time axis is not the same scale for these two):

In the absence of a good heuristic, making this user-controllable would be nice, but the user cannot currently choose directly between cuSOLVER and cuBLAS.

Since cuBLAS and cuSOLVER are from the same company and do kind of the same thing for this function call, it's unclear why the library doesn't figure out internally which approach to use and avoid exposing this knob to the user (in this case, PyTorch).

Comments

Although we don't have a repro of the original issue and thus cannot know for sure, let's see how accurate the comments would be if the above was the root cause. I've tried to classify the comments into one of four categories:

correct
partially correct - in the right direction, but not specific enough, e.g. "sweep everything"
incorrect/irrelevant - provides no information related to this particular issue or makes a claim/suggestion that is wrong in this case (like a true statement that doesn't matter here) or wrong in general

Here's the (noisy) count I got from reading through the replies to the original post as well as the clarification post:

correct: 2
partially correct: 4
incorrect/irrelevant: 48

So out of 54 posts that were counted, 11% are correct or partially correct and 4% point out a likely potential root cause. Both of the "correct" posts come from the same person, Ivan Yashchuk, who reviewed the original PR that added this function 4 years ago.

Ivan filed 3 issues against PyTorch:

This issue has likely existed for awhile, but, due to Carmack's post on X and its visibility, seems much more likely to get fixed now. The general problem of GPU performance being notchy, however, remains.

Subscribe to updates via email