"the reference code is in the default FP32, and given a tolerance threshold (1e-...

constantcrying · 2025-05-30T23:35:22 1748648122

This means the results are useless. Did they even check the relative error at all?

Replacing float32 operations with float16 is also pointless. There is nothing to be gained by doing this, as it removes the actual accuracy advantage of float32s, which would the single most important reason to use that version of the algorithm.

threeducks · 2025-05-31T12:14:58 1748693698

I ran their matrix multiplication code from GitHub (https://github.com/ScalingIntelligence/good-kernels/blob/mai...) and got a mean squared error of approximately 0.056 for two 4096x4096 matrices containing random values between 0 and 1.

I think this error is large enough that referring to it as FP32 is misleading.

Also, the performance gains do not translate to my RTX 3060M GPU (3.8 GFLOPS vs PyTorch's 5.3), presumably because it lacks the optimized hardware for half precision.

But on the plus side, the single file was very easy to adapt and the code is quite readable. I have seen much uglier kernels.

saagarjha · 2025-06-01T10:50:53 1748775053

That's an Ampere chip IIRC so it should have tensor cores

unignorant · 2025-05-30T23:16:07 1748646967

yeah, it seems likely the underlying task here (one reasoning step away) was: replace as many fp32 operations as possible in this kernel with fp16. i'm not sure exactly how challenging a port like that is, but intuitively seems a bit less impressive

maybe this intuition is wrong but would be great for the work to address it explicitly if so!

AlotOfReading · 2025-05-30T23:54:08 1748649248

Only seems to have done that in a couple places, like the MatMul. The softmax kernel (https://github.com/ScalingIntelligence/good-kernels/blob/mai...) seem to be entirely bog-standard, and the layernorm kernels are only slightly more interesting.

ekelsen · 2025-06-09T21:52:39 1749505959

I looked at the softmax kernel and the cast that it does from a float* to a float4* is extremely brittle -- it's trivial to break by offsetting the input slightly.

Very likely a kernel for a standard library could not employ such a trick that relies on alignment of input pointers. Certainly not without a fallback.

beyonddream · 2025-05-31T18:15:11 1748715311

Why do you think it is a huge tolerance ? (Just curious since it is not clear to me if that will lead to too much of reduction in numerical accuracy compared to the speedup)

creato · 2025-06-01T04:07:22 1748750842

The point is, this amount of error is huge for fp32, but may be expected for fp16. But then why compare to fp32 performance baselines? An algorithm that gives you the accuracy of fp16 should be compared to an fp16 baseline, and this may not be (it probably is not) a speedup at all, it's likely much slower.

beyonddream · 2025-06-01T22:02:20 1748815340

My original question is to understand why it is considered as huge tolerance and what should be considered low tolerance. I am suspecting the paper’s intention is not to compare apples and oranges. They are trying to optimize fp32 baseline by sometime resorting using fp16 as long as the resultant solution’s numerical accuracy is within thr tolerance level. They are going for the “low hanging fruits” type of optimization.