They said they noticed themselves looking at their phone far less. Most likely this is what saved the power, just a typical spurious correlation and a bad theory for the reason.
They said they found themselves using their phone far less due to the grayscale, which would be the real thing extending battery life here. Or at least, this was what I assumed on reading.
That is likely. Another factor that came into my mind is the gpu using less power due to simpler computations. You can store less data for grayscale, so you need to go over less pixel data to do effects etc. Whether accessibility controls achieve this or not would be implementation dependent I guess.
Even with the best GPU optimizations, most of the data will be processed in full color and then tossed through an extra pass at the end. More likely is that all the data does that.
I guess if one color pixel was significantly less efficient, and that color was also overrepresented on the display, then MAYBE changing to grayscale would require slightly less power to display the same intensity. But I don’t think that convoluted scenario probably isn’t what this person was thinking.
No, not at all. There is a transformer obsession that is quite possibly not supported by the actual facts (CNNs can still do just as well: https://arxiv.org/abs/2310.16764), and CNNs definitely remain preferable for smaller and more specialized tasks (e.g. computer vision on medical data).
If you also get into more robust and/or specialized tasks (e.g. rotation invariant computer vision models, graph neural networks, models working on point-cloud data, etc) then transformers are also not obviously the right choice at all (or even usable in the first place). So plenty of other useful architectures out there.
Using transformers does not mutually exclude other tools in the sleeve.
What about DINOv2 and DINOv3, 1B and 7B, vision transformer models? This paper [1] suggests significant improvements over traditional YOLO-based object detection.
Indeed, there are even multiple attempts to use both self-attention and convolutions in novel architectures, and there is evidence this works very well and may have significant advantages over pure vision transformer models [1-2].
IMO there is little reason to think transformers are (even today) the best architecture for any deep learning application. Perhaps if a mega-corp poured all their resources into some convolutional transformer architecture, you'd get something better than just the current vision transformer (ViT) models, but, since so much optimizations and work on the training of ViTs has been done, and since we clearly still haven't maxed out their capacity, it makes sense to stick with them at scale.
That being said, ViTs are still currently clearly the best if you want something trained on a near-entire-internet of image or video data.
Is there something I can read to get a better sense of what types of models are most suitable for which problems? All I hear about are transformers nowadays, but what are the types of problems for which transformers are the right architecture choice?
Just do some basic searches on e.g. Google Scholar for your task (e.g. "medical image segmentation", "point cloud segmentation", "graph neural networks", "timeseries classification", "forecasting") or task modification (e.g. "'rotation invariant' architecture") or whatever, sort by year, make sure to click on papers that have a large number of citations, and start reading. You will start to get a feel for domains or specific areas where transformers are and are not clearly the best models. Or just ask e.g. ChatGPT Thinking with search enabled about these kinds of things (and then verify the answer by going to the actual papers).
Also check HuggingFace and other model hubs and filter by task to see if any of these models are available in an easy-to-use format. But most research models will only be available on GitHub somewhere, and in general you are just deciding between a vision transformer and the latest convolutional model (usually a ConvNext vX for some X).
In practice, if you need to work with the kind of data that is found online, and don't have a highly specialized type of data or problem, then you do, today, almost always just want some pre-trained transformer.
But if you actually have to (pre)train a model from scratch on specialized data, in many cases you will not have enough data or resources to get the most out of a transformer, and often some kind of older / simpler convolutional model is going to give better performance at less cost. Sometimes in these cases you don't even want a deep-learner at all, and just classic ML or algorithms are far superior. A good example would be timeseries forecasting, where embarrassingly simple linear models blow overly-complicated and hugely expensive transformer models right out of the water (https://arxiv.org/abs/2205.13504).
Oh, right, and unless TabPFNv2 (https://www.nature.com/articles/s41586-024-08328-6) makes sense for your use-case, you are still better off using boosted decision trees (e.g. XGBoost, LightGBM, or CatBoost) for tabular data.
Seconding this, the terms "Query" and "Value" are largely arbitrary and meaningless in practice, look at how to implement this in PyTorch and you'll see these are just weight matrices that implement a projection of sorts, and self-attention is always just self_attention(x, x, x) or self_attention(x, x, y) in some cases, where x and y are are outputs from previous layers.
Plus with different forms of attention, e.g. merged attention, and the research into why / how attention mechanisms might actually be working, the whole "they are motivated by key-value stores" thing starts to look really bogus. Really it is that the attention layer allows for modeling correlations and/or multiplicative interactions among a dimension-reduced representation.
>the terms "Query" and "Value" are largely arbitrary and meaningless in practice
This is the most confusing thing about it imo. Those words all mean something but they're just more matrix multiplications. Nothing was being searched for.
Better resources will note the terms are just historical and not really relevant anymore, and just remain a naming convention for self-attention formulas. IMO it is harmful to learning and good pedagogy to say they are anything more than this, especially as we better understand the real thing they are doing is approximating feature-feature correlations / similarity matrices, or perhaps even more generally, just allow for multiplicative interactions (https://openreview.net/forum?id=rylnK6VtDH).
Definitely mostly just a practical thing IMO, especially with modern attention variants (sparse attention, FlashAttention, linear attention, merged attention etc). Not sure it is even hardware scarcity per se / solely, it would just be really expensive in terms of both memory and FLOPs (and not clearly increase model capacity) to use larger matrices.
Also for the specific part where you, in code for encoder-decoder transformers, call the a(x, x, y) function instead of the usual a(x, x, x) attention call (what Alammar calls "encoder-decoder attention" in his diagram just before the "The Decoder Side"), you have different matrix sizes, so dimension reduction is needed to make the matrix multiplications work out nicely too.
Also, in case you missed the recent big thread, fMRI has taught us almost nothing due to its serious limitations and various measurement and design issues in the field. IMO it is way too slow and clunky to ever yield insights into something as fast as linguistic thought.
This comment and GP comment are why the word "causal model" is needed. LLMs are predictive* models of human language, but they are not causal models of language.
If you believe that some of human cognition is linguistic (even if e.g. inner monologue and spoken language are just the surface of deeper more unconscious processes), then, yes, we might say LLMs can predictively model some aspects of human cognition, but, again, they are certainly not causal models, and they are not predictive models of human cognition generally (as cognition is clearly far, far more than linguistic).
* I avoid calling LLMs "statistical" because they really aren't even that. They are not calibrated, and including a softmax and log-loss in things doesn't magically make your model statistical (especially since ad-hoc regularization methods, other loss functions and simplex mappings, e.g. sparsemax, often work better and then violate the assumptions that are needed to prove these things are behaving statistically). LLMs really are more accurately just doing (very, very fancy and impressive) curve/manifold-fitting.
They are not predictive models in the domains Chomsky investigated. LLMs make no predictions about, say, when non-surface quantifier scope should or should not be possible, or what should or shouldn’t be a wh-island. They are predictive in a sense that’s largely irrelevant to cognitive science. (Trying to guess what words might come after some other words isn’t a problem in cognitive science.)
"What should or shouldn’t be a wh-island" is literally a statement of "what words might come after some other words"! An LLM encodes billions of such statements, just unfortunately in a quantity and form that makes them incomprehensible to an unaided human. That part is strictly worse; but the LLM's statements model language well enough to generate it, and that part is strictly better.
As I read Norvig's essay, it's about that tradeoff, of whether a simple and comprehensible but inaccurate model shows more promise than a model that's incomprehensible except in statistical terms with the aid of a computer, but far more accurate. I understand there's a large group of people who think Norvig is wrong or incoherent; but when those people have no accomplishments except within the framework they themselves have constructed, what am I supposed to think?
Beyond that, if I have a model that tells me whether a sentence is valid, then I can always try different words until I find one that makes it valid. Any sufficiently good model is thus capable of generation. Chomsky never proposed anything capable of that; but that just means his models were bad, not that he was working on a different task.
As to the relationship between signals from biological neurons and ANN activations, I mean something like the paper linked below, whose authors write:
> Thus, even though the goal of contemporary AI is to improve model performance and not necessarily to build models of brain processing, this endeavor appears to be rapidly converging on architectures that might capture key aspects of language processing in the human mind and brain.
I emphasize again that I believe these results have been oversold in the popular press, but the idea that an ANN trained on brain output (including written language) might provide insight into the physical, causal structure of the brain is pretty mainstream now.
> What should or shouldn’t be a wh-island" is literally a statement of "what words might come after some other words"!
This gets at the nub of the misunderstanding. Chomsky is interested in modeling the range of grammatical structures and associated interpretations possible in natural languages. The wh-island condition is a universal structural constraint that only indirectly (and only sometimes) has implications for which sequences of words are ‘valid’ in a particular language.
LLMs make no prediction at all as to whether or not natural languages should have wh-islands: they’ll happily learn languages with or without such constraints.
If you want a more concrete example of why wh-islands can’t be understood in terms of permissible or impermissible sequences of words, consider cases like
How often did you ask why John took out the trash?
The wh-island created by ‘why’ removes one of the in-principle possible interpretations (the embedded question reading where ‘how often’ associates with ‘took’), but the sequence of words is fine.
> Chomsky never proposed anything capable of that; but that just means his models were bad, not that he was working on a different task.
No, Chomsky really was working on a different task: a solution to the logical problem of language acquisition and a theory of the range of possible grammatical variation across human languages. There is no reason to think that a perfect theory in this domain would be of any particular help in generating plausible-looking text. From a cognitive point of view, text generation rather obviously involves the contribution of many non-linguistic cognitive systems which are not modeled (nor intended to be modeled) by a generative grammar.
>the paper linked below
This paper doesn’t make any claims that are obviously incompatible with anything that Chomsky has said. The fundamental finding is unsurprising: brains are sensitive to surprisal. The better your language model is at modeling whether or not a sequence of words is likely, the better you can predict the brain’s surprisal reactions. There are no implications for cognitive architecture. This ought to be clear from that fact that a number of different neural net architectures are able to achieve a good degree of success, according to the paper’s own lights.
> LLMs make no prediction at all as to whether or not natural languages should have wh-islands: they’ll happily learn languages with or without such constraints.
The human-designed architecture of an LLM makes no such prediction; but after training, the overall system including the learned weights absolutely does, or else it couldn't generate valid language. If you'd prefer to run in the opposite direction, then you can feed in sentences with correct and incorrect wh-movement, and you'll find the incorrect ones are much less probable.
That prediction is commingled with billions of other predictions, which collectively model natural language better than any machine ever constructed before. It seems like you're discounting it because it wasn't made by and can't be understood by an unaided human; but it's not like the physicists at the LHC are analyzing with paper and pencil, right?
> There is no reason to think that a perfect theory in this domain would be of any particular help in generating plausible-looking text.
Imagine that claim in human form--I'm an expert in the structure of the Japanese language, but I'm unable to hold a basic conversation. Would you not feel some doubt? So why aren't you doubting the model here? Of course it would have been outlandish to expect that of a model five years ago, but it isn't today.
I see your statement that Chomsky isn't attempting to model the "many non-linguistic cognitive systems", but those don't seem to cause the LLM any trouble. The statistical modelers have solved problem after problem that was previously considered impossible, and the practical applications of that are (for better or mostly worse) reshaping major aspects of society. Meanwhile, every conversation I've had with a Chomsky supporter seems to reduce to "he is deliberately choosing not to produce any result evaluable by a person who hasn't spent years studying his theories". I guess that's true, but that mostly just makes me regret what time I've already spent.
> The human-designed architecture of an LLM makes no such prediction; but after training, the overall system including the learned weights absolutely does, or else it couldn't generate valid language.
It makes a prediction about whatever language(s) are in the training data, but it doesn’t make any (substantial) predictions about general constraints on human languages. It really seems that you’re missing the absolutely fundamental goal of Chomsky’s research program here. Remember that whole “universal grammar” thingy?
> -I'm an expert in the structure of the Japanese language, but I'm unable to hold a basic conversation. Would you not feel some doubt?
I expect anyone learning Japanese as a second language will get a chuckle out of this one. It’s in fact a common scenario. You can learn a lot about the grammar of a language, but conversation requires the ability to use that knowledge immediately and fluidly in a wide variety of situations. It is like the difference between “knowing how to solve a differential equation” and being able to answer 50 questions within an hour in a physics exam.
> I see your statement that Chomsky isn't attempting to model the "many non-linguistic cognitive systems", but those don't seem to cause the LLM any trouble.
Of course they don’t, because researchers creating LLMs are (in the vast majority of cases) not attempting to model any particular cognitive system; they have engineering goals, not scientific ones. You seem to be stuck in the view that Chomsky is somehow trying and completely failing to do the thing that LLMs do successfully. This certainly makes for a good straw man (if Chomsky had the same goals, then yeah, he never got anywhere), but it’s a misunderstanding of his research program.
> "he is deliberately choosing not to produce any result evaluable by a person who hasn't spent years studying his theories"
You could say this of many perfectly respectable fields. Andrew Wiles has not produced any result evaluable by me or by almost anyone else. It would certainly take me a lot more than “a few years” of study to evaluate his work.
I’m afraid there are no intellectual shortcuts. If you want to evaluate Chomsky’s work, you will have to at least read it, and maybe even think about it a bit too! It seems a bit churlish to whine about that. All you are being deprived of by opting out of this time investment is the opportunity to make informed criticisms of his work on the internet.
(The good news is that generative linguistics is actually pretty accessible, and one year of part time study would probably be enough to get the lay of the land.)
> Andrew Wiles has not produced any result evaluable by me or by almost anyone else.
Fermat wrote the theorem in the margin long before Wiles was born. There is no question that many people tried and failed to prove it. There is no question that Wiles succeeded, because the skill required to verify a proof is much less than the skill required to generate it. I haven't done so myself; but lots of other people have, and there is no dispute by any skilled person that his proof is correct. So I believe that Wiles has accomplished something significant.
I don't think Chomsky has any similar accomplishment. I roughly understand the grandiose final goal; I just see no evidence that he has made any progress towards it. Everything that I'd see as an interesting intermediate goal is dismissed as out of scope, especially when others achieve it. On the rare occasion that Chomsky has made externally intelligible predictions on the range of human language, they've been falsified anthropologically. I assume you followed the dispute on Pirahã, which I believe clarified that features like recursion were in fact optional, rendering the theory safely non-falsifiable again.
So what's his progress? Everything that I see turns inward, valuable only within the framework that he himself constructed. Anyone can build such a framework, so that's not an accomplishment. Convincing others to spend years of their lives on that framework is a sort of an achievement, but it's not a scientific one--homeopathy has many practitioners.
> I expect anyone learning Japanese as a second language will get a chuckle out of this one. It’s in fact a common scenario.
I think this view is just as wrong applied to a human as to a model. A beginning language student probably knows a lot more grammar rules than a native speaker, but their inability to converse doesn't come from their inability to quickly apply them. It comes from the fact that those rules capture only a small amount of the structure of natural language. You seem to acknowledge this yourself--if nothing Chomsky is working on would help a machine generate language, then it wouldn't help a human either. This also explains my teachers' usual advice to stop studying and converse as best I could, watch movies, etc.
Humans clearly learn language in a more structured way than LLMs do (since they don't need trillions of tokens), but they learn primarily from exposure, with partial structure but many exceptions. I don't think that's surprising, since most other things "designed" in an evolutionary manner have that same messy form. LLMs have succeeded spectacularly in modeling that, taking the usual definition in ML or other math for "modeling".
It's thus strange to me to see them dismissed as a source of insight into natural language. I guess most experts in LLMs are busy becoming billionaires right now; but if anything resembling Chomsky's universal grammar ever does get found to exist, then I'd guess it will be extracted computationally from models trained on corpora of different languages and not any human insight, in the same way that the Big Five personality traits fall out of a PCA.
> So what's his progress? Everything that I see turns inward, valuable only within the framework that he himself constructed.
It's really not true that the whole of generative linguistics is just some kind of self-referential parlor game. A lot of what we take for granted today as legitimate avenues of research in cognitive science were opened up as a direct consequence of Chomsky's critique of behaviorism and his insight that the mind is best understood as a computational system. Ironically, any respectable LLM will be perfectly happy to cover this in more detail if you probe it with some key terms like "behaviorism", "cognitive revolution" or "computational theory of mind".
> Pirahã
It's very unlikely that Everett's key claims about Pirahã are true (see e.g. https://dspace.mit.edu/bitstream/handle/1721.1/94631/Nevins-...). But anyway, the universality of recursive clausal embedding has never been a central issue in generative linguistics. Chomsky co-authored one speculative paper late in his career suggesting that recursion in some (vague) sense might be the core computational innovation responsible for the human language faculty. Everett latched on to that claim and the dispute went public, which has given a false impression of its overall centrality to the field.
> So what's his progress?
I don't see how we can discuss this question without getting into specifics, so let me try to push things in that direction. Here is a famous syntax paper by Chomsky: https://babel.ucsc.edu/~hank/On_WH-Movement.pdf It claims to achieve various things. Do you disagree, and if so, why?
> Japanese
A generative linguist studying Japanese wouldn't claim to be an expert on the structure of Japanese in your broad sense of the term. One thing to bear in mind is that generative linguistics is entirely opportunistic in its approach to individual languages. Generative linguists don't don't study Japanese because they give a fuck about Japanese as such (any more than physicists study balls rolling down inclined planes because balls and inclined planes are intrinsically fascinating). The aim is just to find data to distinguish competing hypotheses about the human language faculty, not to come to some kind of total understanding of Japanese (or whatever language).
> I guess most experts in LLMs are busy becoming billionaires right now; but if anything resembling Chomsky's universal grammar ever does get found to exist, then I'd guess it will be extracted computationally from models trained on corpora of different languages and not any human insight, in the same way that the Big Five personality traits fall out of a PCA.
This is a common pattern of argumentation. First, Chomsky's work is critically examined according to the highest possible scientific standards (every hypothesis must be strictly falsifiable, etc. etc.) Then when we finally get to see the concrete alternative proposal, it turns out to be nothing more than a promissory note.
> It's very unlikely that Everett's key claims about Pirahã are true
Everett achieved something unequivocally difficult--after twenty years of failed attempts by other missionaries, he was the first Westerner to learn Pirahã, living among the people and conversing with them in their language. In my view, that gives him significantly greater credibility than academics with no practical exposure to the language (and I assume you're aware of his response to the paper you linked).
I understand that to Chomsky's followers, Everett's achievement is meaningless, in the same way that LLMs saturating almost every prior benchmark in NLP is meaningless. But what achievements outside the "self-referential parlor game" are meaningful then? You must need something to ground yourself in outside reality, right?
> Then when we finally get to see the concrete alternative proposal, it turns out to be nothing more than a promissory note.
I'm certainly not claiming that statistical modeling has already achieved any significant insight into how physical structures in the brain map to an ability to generate language, and I don't think anyone else is either. We're just speculating that it might in future.
That seems a lot less grandiose to me than anything Chomsky has promised. In the present, that statistical modeling has delivered some pretty significant, strictly falsifiable, different but related achievements. Again, what does Chomsky's side have?
> I don't see how we can discuss this question without getting into specifics, so let me try to push things in that direction. Here is a famous syntax paper by Chomsky: https://babel.ucsc.edu/~hank/On_WH-Movement.pdf
And when I asked that before, you linked a sixty-page paper, with no further indication ("various things"?) of what you want to talk about. If you're trying to argue that Chomsky's theories are anything but a tarpit for a certain kind of intellectual curiosity, then I don't think that's helping.
Believe Everett if you want to, but it doesn’t make much difference to anything. Not every language has to exploit the option of recursive clausal embedding. The implications for generative linguistics are pretty minor. Yes, Everett responded to the paper I linked, and then there were further papers in the chain of responses (e.g. http://lingphil.mit.edu/papers/pesetsk/Nevins_Pesetsky_Rodri...).
> And when I asked that before, you linked a sixty-page paper, with no further indication ("various things"?) of what you want to talk about.
I was suggesting that we talk about the central claim of the paper (i.e. that the answer to question (50) is ‘yes’).
I don’t see how it’s reasonable to ask for something smaller than a paper if you want evidence that Chomsky’s research program has achieved some insight. That’s the space required to argue for a particular viewpoint rather than just state it.
In other words, if I concisely summarize Chomsky’s findings you’ll just dismiss them as bogus, and if I link to a paper arguing for a particular result, you’ll say it’s too long to read. So, essentially, you have decided not to engage with Chomsky’s work. That is a perfectly legitimate thing to do, but it does mean that you cannot make informed criticisms of it.
> So, essentially, you have decided not to engage with Chomsky’s work. That is a perfectly legitimate thing to do, but it does mean that you cannot make informed criticisms of it.
Any criticism that I'd make of homeopathy would be uninformed by the standards of a homeopath--I don't know which poison to use, or how many times to strike the bottle while I'm diluting it, or whatever else they think is important. But to their credit they're often willing to put their ideas to the external test (like with an RCT), and I know that evidence in aggregate shows no benefit. I'm therefore comfortable criticizing homeopathy despite my unfamiliarity with its internals.
I don't claim any qualifications to criticize the internals of Chomsky's linguistics, but I do feel qualified to observe the whole thing appears to be externally useless. It seems to reject the idea of falsifiable predictions entirely, and if one does get made and then falsified then "the implications for generative linguistics are pretty minor". After dominating academic linguistics for fifty years, it has never accomplished anything considered difficult outside the newly-created field. So why is this a place where society should expend more of its finite resources?
Hardy wrote his "Mathematician's Apology" to answer the corresponding question for his more ancient field, explicitly acknowledging the uselessness of many subfields but still defending them. He did that with a certain unease though, and his promises of uselessness also turned out to be mistaken--he repeatedly took number theory as his example, not knowing that in thirty years it would underly modern cryptography. Chomsky's linguists seem to me like the opposite of that, shouting down anyone who questions them (he called Everett a "charlatan") while proudly delivering nothing to the society funding their work. So why would I want to join them?
>but I do feel qualified to observe the whole thing [Chomskyian linguistics] appears to be externally useless
Sure, Chomsky's work doesn't have practical applications. Most scientific work doesn't. It's just that, for obvious reasons, you tend to hear more about the work that does. You mention number theory. Number theory had existed for a lot longer than Chomskyan linguistics has now when Hardy chose it as an example of a field with no practical applications.
> seems to reject the idea of falsifiable predictions entirely,
As a former syntactician who's constructed lots of theories that turned out to be false, I can't really relate to this one. If you look through the generative linguistics literature you can find innumerable instances of promising ideas rejected on empirical grounds. Chomsky himself has revised or rejected his earlier work many times. A concrete example would be the theory of parasitic gaps presented in Concepts and Consequences (quickly falsified by the observation that parasitic gap dependencies are subject to island constraints).
The irony here is that generative syntax is actually a field with a brutal peer review culture and extremely high standards of publication. Actual syntax papers are full of detailed empirical argumentation. Here is one relatively short and accessible example chosen at random: http://www.skase.sk/Volumes/JTL03/04.pdf
>After dominating academic linguistics for fifty years, it has never accomplished anything considered difficult outside the newly-created field
What does this even mean? Has geology accomplished something considered difficult outside of geology? I don't really understand what standard you are trying to apply here.
> Sure, Chomsky's work doesn't have practical applications. Most scientific work doesn't.
> Has geology accomplished something considered difficult outside of geology?
Ask an oilfield services company? A structural engineer who needs a foundation? If that work were easy, then their geologists wouldn't get paid.
I could have just said "economically important", but that seemed too limiting to me. For example, computer-aided proofs were a controversial subfield of math, but I'd take their success on the four-color theorem (which came from outside their subfield and had resisted proof by other means) as evidence of their value, despite the lack of practical application for the result. I think that broader kind of success could justify further investment, but I also don't see that here.
> As a former syntactician who's constructed lots of theories that turned out to be false
I should clarify that I do see a concept of falsifiability at that level, of whether a grammar fits a set of examples of a language. That seems pretty close to math or CS to me. I don't see how that small number of examples is supposed to scale to an entire natural language or to anything about the human brain's capability for language, and I don't see any falsifiable attempt to make that connection. (I don't see much progress towards the loftiest goals from the statistical approach either, but their spectacular engineering results break that tie for me.)
Anyways, Merry Christmas if you're celebrating. I guess we're unlikely to be the ones to settle this dispute, but I appreciate the insight into the worldview.
I am not arguing that people should be paid public money to do Chomskyan linguistics. That is an entirely separate question from the question of whether or not Chomsky's key claims are true and whether his research program has made progress. Again, you will have to throw out the majority of science if you hold to the criterion that only work with practical applications has any value.
I also think that you continue to underestimate Chomsky's overall influence on cognitive science. If you think that post-cognitive-revolution cognitive science has achieved anything of note, then you ought to give Chomsky partial credit for that.
>I don't see how that small number of examples is supposed to scale to an entire natural language
Wide coverage generative grammars certainly exist, though they were never something that Chomsky himself was interested in. Here is one in a Chomskyan idiom: https://aclanthology.org/P19-1238.pdf
I'm still puzzled by your point about falsifiability. I haven't seen anything close to a falsifiable claim from people who are excited about the cognitive implications of LLMs. The argument is little more than "look at the cool stuff these things can do – surely brains must work a bit like this too!" Read almost anything by Chomsky and you'll find it's full of quite specific claims that can be empirically tested. I guess people get excited about the fact that the architecture of LLMs is superficially brain-like, but it's doubtful that this gets us any closer to an understanding of the relevant computations at the neural level.
> I'm surprised to see it viewed so negatively here, dismissed with no engagement with his specific arguments and examples.
I struggle to motivate engaging with it because it is unfortunately quite out of touch with (or just ignores) some core issues and the major advances in causal modeling and causal modeling theory, i.e. Judea Pearl and do-calculus, structural equation modeling, counterfactuals, etc [1].
It also, IMO, makes a (highly idiosyncratic) distinction between "statistical" (meaning, trained / fitted to data) and "probabilistic" models, that doesn't really hold up too well.
I.e. probabilistic models in quantum physics are "fit" too, in that the values of fundamental constants are determined by experimental data, but these "statistical" models are clearly causal models regardless. Even most quantum physical models can be argued to be causal, just the causality is probabilistic rather than absolute (i.e. A ==> B is fuzzy implication rather than absolute implication). It's only if you ask deliberately broad ontological questions (e.g. "Does the wave function cause X") that you actually run into the problem of quantum models being causal or not, but for most quantum physical experiments and phenomena generally, the models are still definitely causal at the level of the particles / waves / fields involved.
IMO I don't want to engage much with the arguments because it starts on the wrong foot and begins by making, in my opinion, an incoherent / unsound distinction, while also ignoring or just being out of date with the actual scientific and philosophical progress and issues already made here.
I would also say there is a whole literature on tradeoffs between explanation (descriptive models in the worst case, causal models in the best case) and prediction (models that accurately reproduce some phenomenon, regardless of if they are based on and true description or causal model). There are also loads of examples of things that are perfectly deterministic and modeled by perfect "causal" models but which are of course still defy human comprehension / intuition, in that the equations need to be run on computers for us to make sense of them (differential equation models, chaotic systems, etc). Or just more practically, we can learn to do all sorts of physical and mental skills, but of course we understand barely anything about the brain and how it works and co-ordinates with the body. But obviously such an understanding is mostly irrelevant for learning how to operate effectively in the world.
I.e. in practice, if the phenomenon is sufficiently complex, an accurate causal model that also accurately models the system is likely to be too complex for us to "understand" anyway (or you just have identifiability issues so you can't decide between multiple different models; or you don't have the time / resources / measurement capacity to do all the experiments needed to solve the identifiability problem anyway), so there is almost always a tradeoff between accuracy/understanding. Understanding is a nice luxury, but in many cases not important, and in complex cases, probably not achievable at all. If you are coming from this perspective, the whole "quandary" of the essay seems just odd.
Unless and until neurologists find evidence of a universal grammar unit (or a biological Transformer, or whatever else) in the human connectome, I don't see how any of these models can be argued to be "causal" in the sense that they map closely to what's physically happening in the brain. That question seems so far beyond current human knowledge that any attempt at it now has about as much value as the ancient Greek philosophers' ideas on the subatomic structure of matter.
So in the meantime, Norvig et al. have built statistical models that can do stuff like predicting whether a given sequence of words is a valid English sentence. I can invent hundreds of novel sentences and run their model, checking each time whether their prediction agrees with my human judgement. If it doesn't, then their prediction has been falsified; but these models turned out to be quite accurate. That seems to me like clear evidence of some kind of progress.
You seem unimpressed with that work. So what do you think is better, and what falsifiable predictions has it made? If it doesn't make falsifiable predictions, then what makes you think it has value?
I feel like there's a significant contingent of quasi-scientists that have somehow managed to excuse their work from any objective metric by which to evaluate it. I believe that both Chomsky and Judea Pearl are among them. I don't think every human endeavor needs to make falsifiable predictions; but without that feedback, it's much easier to become untethered from any useful concept of reality.
I would think it was quite clear from my last two paragraphs that I agree causal models are generally not as important as people like Chomsky think, and that in general are achievable only in incredibly narrow cases. Besides, all models are wrong: but some are useful.
> You seem unimpressed with that work
I didn't say anything about Norvig's work, I was saying the linked essay is bad. It is correct that Chomsky is wrong, but is a bad essay because it tries to argue against Chomsky with a poorly-developed distinction while ignoring much stronger arguments and concepts that more clearly get at the issues. IMO the essay is also weirdly focused on language and language models, when this is a general issue about causal modeling and scientific and technological progress, and so the narrow focus here also just weakens the whole argument.
Also, Judea Pearl is a philosopher, and do-calculus is just one way to think about and work with causality. Talking about falsifiability here is odd, and sounds almost to me like saying "logic is unfalsifiable" or "modeling the world mathematically is unfalsifiable". If you meant something like "the very concept of causality is incoherent", that would be the more appropriate criticism here, and more arguable.
I could iterate with an LLM and Lean, and generate an unlimited amount of logic (or any other kind of math). This math would be correct, but it would almost surely be useless. For this reason, neither computer programs nor grad students are rewarded simply for generating logically correct math. They're instead expected to prove a theorem that other people have tried and failed to prove, or perhaps to make a conjecture with a form not obvious to others. The former is clearly an achievement, and the latter is a falsifiable prediction.
I feel like Norvig is coming from that standpoint of solving problems well-known to be difficult. This has the benefit that it's relatively easy to reach consensus on what's difficult--you can't claim something's easy if you can't do it, and you can't claim it's hard if someone else can. This makes it harder to waste your life on an internally consistent but useless sidetrack, as you might even agree (?) Chomsky has.
You, Chomsky, and Pearl seem to reject that worldview, instead believing the path to an important truth lies entirely within your and your collaborators' own minds. I believe that's consistent with the ancient philosophers. Such beliefs seem to me halfway to religious faith, accepting external feedback on logical consistency, but rejecting external evidence on the utility of the path. That doesn't make them necessarily bad--lots of people have done things I consider good in service of religions I don't believe in--but it makes them pretty hard to argue with.
I'm not sure how you can square anything you said in your last paragraph with anything I said about all models being wrong, and causal modeling being extremely limited.
I had this exact reaction, no discussion of "causal modeling" makes the whole thing seem horribly out of touch with the real issues here. You can have explanatory and predictive models that are causal models, or explanatory and predictive models that are non-causal, and that this the actual issue, not "explanation" vs. "prediction", which is not a tight enough distinction.
The section on Claude Code is very ambiguously and confusingly written, I think he meant that the agent runs on your computer (not inference) and that this is in contrast to agents running "on a website" or in the cloud:
> I think OpenAI got this wrong because I think they focused their codex / agent efforts on cloud deployments in containers orchestrated from ChatGPT instead of localhost. [...] CC got this order of precedence correct and packaged it into a beautiful, minimal, compelling CLI form factor that changed what AI looks like - it's not just a website you go to like Google, it's a little spirit/ghost that "lives" on your computer. This is a new, distinct paradigm of interaction with an AI.
However, if so, this is definitely a distinction that needs to be made far more clearly.
reply