Fun fact: if you say the right prayers to the Myelin Gods it will fuse straight through sage3 at D/DQ like it's seen it before, which of course it has.
That's TensorRT-LLM in it's entirety at 1.2.0rc6 locked to run on Ubuntu or NixOS with full MPI and `nvshmem`, the DGX container Jensen's Desk edition (I know because I also rip apart and `autopatchelf` NGC containers for repackaging on Grace/SBSA).
It's... arduous. And the benefit is what exactly? A very mixed collection of maintainers have asserted that software behavior is monotonic along a single axis most of which they can't see and we ran a solver over those guesses?
I think the future is collections of wheels that have been through a process the consumer regards as credible.
NVFP4 is the thing no one saw coming. I wasn't watching the MX process really, so I cast no judgements, but it's exactly what it sounds like, a serious compromise in resource constrained settings. And it's in the silicon pipeline.
NVFP4 is to put it mildly a masterpiece, the UTF-8 of its domain and in strikingly similar ways it is 1. general 2. robust to gross misuse 3. not optional if success and cost both matter.
It's not a gap that can be closed by a process node or an architecture tweak: it's an order of magnitude where the polynomials that were killing you on the way up are now working for you.
sm_120 (what NVIDIA's quiet repos call CTA1) consumer gear does softmax attention and projection/MLP blockscaled GEMM at a bit over a petaflop at 300W and close to two (dense) at 600W.
This changes the whole game and it's not clear anyone outside the lab even knows the new equilibrium points, it's nothing like Flash3 on Hopper, lotta stuff looks FLOPs bound, GDDR7 looks like a better deal than HBMe3. The DGX Spark is in no way deficient, it has ample memory bandwidth.
This has been in the pipe for something like five years and even if everyone else started at the beginning of the year when this was knowable, it would still be 12-18 months until tape out. And they haven't started.
Years Until Anyone Can Compete With NVIDIA is back up to the 2-5 it was 2-5 years ago.
This was supposed to be the year ROCm and the new Intel stuff became viable.
This reads like a badly done, sponsored hype video on YouTube.
So if we look at what NVIDIA has to say about NVFP4 it sure sounds impressive [1]. But look closely that initial graph never compares fp8 and fp4 on the same hardware. They jump from H100 to B200 while implying a 5x jump of going with fp4 which it isn't. Accompanied with scary words like if you use MXFP4 "Risk of noticeable accuracy drop compared to FP8" .
Contrast that with what AMD has to say on the open MXFP4 approach which is quite similar to NVFP4 [2]. Ohh the horrors of getting 79.6 instead of 79.9 on GPQA Diamond when using MXFP4 instead of FP8.
Looking into NVFP4/Nvidia vs MXFP4/AMD the summation was that seem to be pretty close when including the MI355X which leads in VRAM and throughput but trails in accuracy (slightly)--and for that mixing in MXFP6 makes up for it.
I'm a bit later in my career and I've been involved with modern machine learning for a long time which probably affects my views on this, but I can definitely relate to aspects of it.
I think there are a couple of good signals in what you've said but also some stuff (at least by implication/phrashing) that I would be mindful of.
The reason why I think your head is fundamentally in a good place is that you seem to be shooting for an outcome where already high effort stays high, and with the assistance of the tools your ambition can increase. That's very much my aspiration with it, and I think that's been the play for motivated hackers forever: become as capable as possible as quickly as possible by using every effort and resource. Certainly in my lifetime I've seen things like widely distributed source code in the 90s, Google a little later, StackOverflow indexed by Google, the mega-grep when I did the FAANG thing, and now the language models. They're all related (and I think less impressive/concerning to people who remember pre-SEO Google, that was up there with any LLM on "magic box with reasonable code").
But we all have to self-police on this because with any source of code we don't understand, the abstraction almost always leaks, and it's a slippery slope: you get a little tired or busy or lazy, it slips a bit, next thing you know the diff or project or system is jeopardized, and you're throwing long shots that compound.
I'm sure the reviewers can make their own call about whether you're in an ok place in terms of whether you're making a sincere effort or if you've slipped into the low-integrity zone (LLVM people are serious people), just be mindful that if you want the most out of it and to be welcome on projects and teams generally, you have to keep the gap between ability and scope in a band: pushing hard enough to need the tools and reviewers generous with their time is good, it's how you improve, but go too far and everyone loses because you stop learning and they could have prompted the bot themselves.
There's nontrivial historical precedent for this exact playbook: when a new paradigm (Lisp machines and GOFAI search, GPU backprop, softmax self-attention) is scaling fast, a lot of promises get made, a lot of national security money gets involved, and AI Summer is just balmy.
But the next paradigm breakthrough is hard to forecast, and the current paradigm's asymptote is just as hard to predict, so it's +EV to say "tomorrow" and "forever".
When the second becomes clear before the first, you turk and expert label like it's 1988 and pray that the next paradigm breakthrough is soon, you bridge the gap with expert labeling and compute until it works or you run out of money and the DoD guy stops taking your calls. AI Winter is cold.
And just like Game of Thrones, no I mean no one, not Altman, not Amodei, not Allah Most Blessed knows when the seasons in A Song of Math and Grift will change.
Now imagine if someone combined Jia Tan patience with swiss-cheese security like all of our editor plugins and nifty shell user land stuff and all that.
Developer stuff is arguably the least scrutinized thing that routinely runs as mega root.
I wish I could say that I audit every elisp, neovim, vscode plugin and every nifty modern replacement for some creaky GNU userland tool. But bat, zoxide, fzf, atuin, starship, viddy, and about 100 more? Nah, I get them from nixpkgs in the best case, and I've piped things to sh.
Write a better VSCode plugin for some terminal panel LLM gizmo, wait a year or two?
I'd like to "reclaim" both AI and machine learning as relatively emotionally neutral terms of art for useful software we have today or see a clearly articulated path towards.
Trying to get the most out of tools that sit somewhere between "the killer robots will eradicate humanity", " there goes my entire career", "fuck that guy with the skill I don't want to develop, let's take his career", and "I'm going to be so fucking rich if we can keep the wheels on this" is exhausting.
I don't think that's achievable with all the science fiction surrounding "AI" specifically. You wouldn't be "reclaiming" the term, you'd be conquering an established cultural principality of emotionally-resonant science fiction.
Which is, of course, the precise reason why stakeholders are so insistent on using "AI" and "LLM" interchangeably.
Personally I think the only reasonable way to get us out of that psycho-linguistic space is just say "LLMs" and "LLM agents" when that's what we mean (am I leaving out some constellation of SotA technology? no, right?)
I personally regard posterior/score-gradient/flow-match style models as the most interesting thing going on right now, ranging from rich media diffusers (the extended `SDXL` family tree which is now MMDiT and other heavy transformer stuff rapidly absorbing all of 2024's `LLM` tune ups) all the way through to protein discovery and other medical applications (tomography, it's a huge world).
LLM's are very useful, but they're into the asymptote of expert-labeling and other data-bounded stuff (God knows why the GB200-style Blackwell build-out is looking like a trillion bucks when Hopper is idle all over the world and we don't have a second Internet to pretrain a bigger RoPE/RMSNorm/CQA/MLA mixture GPT than the ones we already have).
The fast interconnect between nodes has aaplications in inference at scale (big KV caches and other semi-durable state, multi-node tensor parallelism on mega models).
But this article in particular is emphasizing extreme performance ambitions for columnar data processing with hardware acceleration. Relevant to many ML training scenarios, but also other kinds of massive MapReduce-style (or at least scale) workloads. There are lots of applications of "magic massive petabyte plus DataFrame" (which is not I think solved in the general case).
https://gist.github.com/b7r6/94f738f4e5d1a67d4632a8fbd18d347...
Faster than Turbo with no pre-distill.
reply