Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Interesting read, and some interesting ideas, but there's a problem with statements like these:

> Sean proposes that in the AI future, the specs will become the real code. That in two years, you'll be opening python files in your IDE with about the same frequency that, today, you might open up a hex editor to read assembly.

> It was uncomfortable at first. I had to learn to let go of reading every line of PR code. I still read the tests pretty carefully, but the specs became our source of truth for what was being built and why.

This doesn't make sense as long as LLMs are non-deterministic. The prompt could be perfect, but there's no way to guarantee that the LLM will turn it into a reasonable implementation.

With compilers, I don't need to crack open a hex editor on every build to check the assembly. The compiler is deterministic and well-understood, not to mention well-tested. Even if there's a bug in it, the bug will be deterministic and debuggable. LLMs are neither.



The fun part is that specs already are non-deterministic.

If you spend time to write out requirements in English in a way that cannot be misinterpreted in any way you end up with programming language.


Humans don't make mistakes nearly as much, the mistakes they do make are way more predictable (they're easier to spot in code review), and they don't tend to make the kinds of catastrophic mistakes that could sink a business. They also tend to cause codebases to rapidly deteriorate, since even very disciplined reviewers can miss the kinds of strange and unpredictable stuff an LLM will do. Redundant code isn't evident in a diff, and things like tautological tests, or useless tests where they're mocking everything and only actually testing the mocks. Or they'll write a bunch of redundant code because they really just aggressively avoid code re-use unless you are very specific.

The real problem is just that they don't have brains, and can't think. They generate text that is optimized to look the most right, but not to be the most right. That means they're deceptive right off the bat. When a human is wrong, it usually looks wrong. When an LLM is wrong, it's generating the most correct looking thing it possibly could while still being wrong, with no consideration for actual correctness. It has no idea what "correctness" even means, or any ideas at all, because it's a computer doing matmul.

They are text summarization/regurgitation, pattern matching machines. They regurgitate summaries of things seen in their training data, and that training data was written by humans who can think. We just let ourselves get duped into believing the machine is the where the thinking is coming from and not the (likely uncompensated) author(s) whose work was regurgitated for you.


> Humans don't make mistakes nearly as much [...]

Yeah, I remember how in every large corporation the specs were perfectly interpreted and with no issues, at all. Humans are great at communication and understanding each other.

https://google.com/search?q=tree+swing+cartoon


I don't really see this argument playing out.

As much as I was trying out copilot — if I ask it to make "todo app" it will make me a "todo app" it will not hallucinate making "calculator app".

Duplicated code or tautological tests are not going to sink the business.

From my experience I have seen much more problems and hours burned caused by wrong application of DRY than from duplication.

I can say that in regulated environments duplicated code that has no abstractions is even preferred.


>The real problem is just that they don't have brains, and can't think.

That would have had more weight if you haven't just described junior developer behavior beforehand.

"LLMs can't think" is anthropocentric cope. It's the old AI effect all over again - people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.


> That would have had more weight if you haven't just described junior developer behavior beforehand.

Effectively telling that junior developers "don't have brains" is in very bad taste and offensively wrong.

> people would rather die than admit that there's very little practical difference between their own "thinking" and that of an AI chatbot.

Would you like to elaborate on this?

I was told that McDonalds employees would have been replaced by now, self-driving cars will be driving the streets and new medicines would have been discovered.

It's been a couple of years that "AI" is out, and no singularity yet.


LLMs use the same type of "abstract thinking" process as humans. Which is why they can struggle with 6-digit multiplication (unlike computer code, very much like humans), but not with parsing out metaphors or describing what love is (unlike computer code, very much like humans). The capability profile of an LLM is amusingly humanlike.

Setting the bar for "AI" at "singularity" is a bit like setting requirements for "fusion" at "creating a star more powerful than the Sun". Very good for dismissing all existing fusion research, but not any good for actually understanding fusion.

If we had two humans, one with IQ 80 and another with IQ 120, we wouldn't say that one of them isn't "thinking". It's just that one of them is much worse at "thinking" than the other. Which is where a lot of LLMs are currently at. They are, for all intents and purposes, thinking. Are they any good at it though? Depends on what you want from them. Sometimes they're good enough, and sometimes they aren't.


> LLMs use the same type of "abstract thinking" process as humans

It's surprising you say that, considering we don't actually understand the mechanisms behind how humans think.

We do know that human brains are so good at patterns, they'll even see patterns and such that aren't actually there.

LLMs are a pile of statistics that can mimic human speech patterns if you don't tax them too hard. Anyone who thinks otherwise is just Clever Hans-ing themselves.


We understand the outcomes well enough. LLMs converge onto a similar process by being trained on human-made text. Is LLM reasoning a 1:1 replica of what the human brain does? No, but it does something very similar in function.

I see no reason to think that humans are anything more than "a pile of statistics that can mimic human speech patterns if you don't tax them too hard". Humans can get offended when you point it out though. It's too dismissive of their unique human gift of intelligence that a chatbot clearly doesn't have.


> We understand the outcomes well enough

We do not, in fact, "understand the outcomes well enough" lol.

I don't really care if you want to have an AI waifu or whatever. I'm pointing out that you're vastly underestimating the complexity behind human brains and cognition.

And that complex human brain of yours is attributing behaviors to a statistical model that the model does not, in fact, possess.


[flagged]


I think saying that "LLMs can produce outcomes akin to those produced by human intelligence (in many but not all cases)" and "LLMs are intelligent" to both be fairly defensible.

> I see no reason whatsoever to believe that what your wet meat brain is doing now is any different from what an LLM does.

I don't think this follows though. Birds and planes can both fly, but a bird and a plane are clearly not doing the same thing to achieve flight. Interestingly, both birds and planes excel at different aspects of flight. It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans, and that that might manifest as some aspects of intelligence being accessible to LLMs but not humans and vice versa.


> It seems at least plausible (imo likely) that there are meaningful differences in how intelligence is implemented in LLMs and humans

Intelligence isn’t "implemented" in an LLM at all. The model doesn’t carry a reasoning engine or a mental model of the world. It generates tokens by mathematically matching patterns: each new token is chosen to best fit the statistical patterns it learned from its training data and the immediate context you give it. In effect, it’s producing a compressed, context-aware summary of the most relevant pieces of its training data, one token at a time.

The training data is where the intelligence happened, and that's because it was generated by human brains.


There doesn't seem to be much consensus on defining what intelligence is. For the definitions of at least some reasonable people of sound mind, I think it is defensible to call them intelligent, even if I don't necessarily agree. I sometimes call them "intelligent" because many of the things they do seem to me like they should require intelligence.

That said, to whatever extent they're intelligent or not, by almost any definition of intelligence, I don't think they're achieving it through the same mechanism that humans do. That is my main argument. I thing confident arguments that "LLMs think just like humans" are very bad, given that we clearly don't understand how humans achieve intelligence and the vastly different substrates and constraints that humans and LLMs are working with.


I guess to me, how is the ability to represent the statistical distribution of outcomes of almost any combination of scenarios, represented as textual data not a form of world model?


I think you're looking at it too abstractly. An LLM isn't representing anything, it has a bag of numbers that some other algorithm produced for it. When you give it some numbers, it takes them and does matrix operations with them in order to randomly select a token from a softmax distribution, one at a time, until the EOS token is generated.

If they don't have any training data that covers a particular concept, they can't map it onto a world model and make predictions about that concept based on an understanding of the world and how it works. [This video](https://www.youtube.com/watch?v=160F8F8mXlo) illustrates it pretty well. These things may or may not end up being fixed in the models, but that's only because they've been further trained with the specific examples. Brains have world models. Cats see a cup of water, and they know exactly what will happen when you tip it over (and you can bet they're gonna do it).


That video is a poor and mis-understood analysis of an old version of ChatGPT.

Analyzing an image generation failure modes from the dall-e family of models isn't really helpful in understanding if the invoking LLM has a robust world model or not.


The point of me sharing the video was to use the full glass of wine as an example for how generative AI models doing inference lack a true world model. The example was just as relevant now as it was then, and it applies to inference being done by LMs and SD models in the same way. Nothing has fundamentally changed in how these models work. Getting better at edge cases doesn't give them a world model.


That's the point though. Look at any end-to-end image model. Currently I think nano banana (Gemini 2.5 Flash) is probably the best in prod. (Looks like ChatGPT has regressed the image pipeline right now with GPT-5, but not sure)

SD models have a much higher propensity to fixate on proximal in distribution solutions because of the way they de-noise.

For example.. you can ask nano banana for a "Completely full wine glass in zero g" which I'm pretty sure is way more out of distribution, the model does a reasonable job at approximating what they might look like.


That's a fairly bad example. They don't have any trouble taking unrelated things and sticking them together. A world model isn't required for you to take two unrelated things and stick them together. If I ask it to put a frog on the moon, it can know what frogs look like and what the the moon looks like, and put the frog on the moon.

But what it won't be able to do, which does require a world model, is put a frog on the moon, and be able to imagine what that frog's body would look like on the moon in the vacuum of space as it dies a horrible death.


Your example is a good one. The frog won't work because ethically the model won't want to show a dead frog very easily, BUT if you ask nano-banana for:

"Create an image of what a watermelon would look like after being teleported to the surface of the moon for 30 seconds."

You'll see a burst frozen melon usually.


> "We don't fully understand how a bird works, and thus: "wind tunnel" is useless, Wright brothers are utter fools, what their crude mechanical contraptions are doing isn't actually flight, and heavier than air flight is obviously unattainable."

Completely false equivalency. We did in fact back then completely understand "how a bird works", how the physics of flight work. The problem getting man-made flying vehicles off the ground was mostly about not having good enough materials to build one (plus some economics-related issues).

Whereas in case of AI, we are very far from even slightly understanding how our brains work, how the actual thinking happens.


One of the Wright brothers achievements was to realize the published tables of flight physics was wrong and to carefully redo it with their own wind tunnel until they had a correct model from which to design a flying vehicle https://humansofdata.atlan.com/2019/07/historical-humans-of-...


Ok, that's pretty cool. I didn't know that, thanks!


We have a good definition of flight, we don't have a good definition of intelligence.


"Anthropocentric cope >:(" is one of the funniest things I've read this week, so genuinely thank you for that.

"LLMs think like people do" is the equivalent of flat earth theory or UFO bros.

Flerfers run on ignorance, misunderstanding and oppositional defiant disorder. You can easily prove the earth is round in quite a lot of ways (the Greeks did it) but the flerfers either don't know them or refuse to apply them.

There are quite a lot of reasons to believe brains work differently than LLMs (and ways to prove it) you just don't know them or refuse to believe them.

It's neat tech, and I use them. They're just wayyyyyyyy overhyped and we don't need to anthropomorphize them lol


This is wrong on so many levels. I feel like this is what I would have said if I never took a neuroscience class, or actually used an LLM for any real work beyond just poking around ChatGPT from time to time between TED talks.


There is no actual object-level argument in your reply, making it pretty useless. I’m left trying to infer what you might be talking about, and frankly it’s not obvious to me.

For example, what relevance is neuroscience here? Artificial neural nets and real brains are entirely different substrates. The “neural net” part is a misnomer. We shouldn’t expect them to work the same way.

What’s relevant is the psychology literature. Do artificial minds behave like real minds? In many ways they do — LLMs exhibit the same sorts fallacies and biases as human minds. Not exactly 1:1, but surprisingly close.


I didn't say brains and ANNs are the same, in fact I am making quite the opposite argument here.

LLMs exhibit these biases and fallacies because they regurgitate the biases and fallacies that were written by the humans that produced their training data.


Maybe. That’s not an obvious conclusion in the strong sense that you mean it here. If you train a LLM on transcripts of multiplying very large numbers, machine generated and perfectly accurate transcripts, the LLM still exhibits the same sorts of mental math errors that people make.

Math, logical reasoning, etc. are cultural knowledge, not architecturally built-in. These biases and fallacies arise because of how we process higher order concepts via language-like mechanisms. It should not be surprising that LLMs, which mimic human-like natural language abilities (at the culture/learned level of abstraction, if not computation substrate) exhibit the same sorts of errors.


Living in Silicon Valley, there are MANY self driving cars driving around right now. At the stop light the other day, I was between 3 of them without any humans in them.

It is so weird when people pull self driving cars out as some kind of counter example. Just because something doesn't happen on the most optimistic time scale, doesn't mean it isn't happening. They just happen slowly and then all at once.


15 years ago they said truck drivers would be obsolete in 1-2 years. They are still not obsolete, and they aren't on track to be any time soon, either.


So… COBOL?


Specs are ambiguous but not necessarily non-deterministic.

The same entity interpreting the spec in exactly the same way will resolve the ambiguities the same way each time.

Human and current AI interpretation of specs is non-deterministic process. But, if we wanted to build a deterministic AI we could.


> But, if we wanted to build a deterministic AI we could.

Is this bold proposal backed by any theory?


Given that they all use pseudo-random (and not actually random) numbers, they are "deterministic" in the sense that given a fixed seed, they will produce a fixed result...

But perhaps that's not what was meant by deterministic. Something like an understandable process producing an answer rather than a pile of linear algebra?


I was thinking the exact same thing: if you don’t change the weights, use identical “temperature” etc, the same prompt will yield the same output. Under the hood it’s still deterministic code running on a deterministic machine


This is incorrect. Temperature would need to be zero to get same result.


This is not correct. If the algorithm is deterministic and the random seed is the same, temperature can be anything and get the same result.

Same as using a seed to get the same map generated in Dwarf Fortress.


Thanks for replying, this is what I thought as well.


You’re right - TIL


You can just change your definition of "AI". Back in the 60s the pinnacle of AI was things like automatic symbolic integration and these would certainly be completely deterministic. Nowadays people associate "AI" with stuff like LLMs and diffusion etc. that have randomness included in to make them seem "organic", but it doesn't have to be that way.

I actually think a large part of people's amazement with the current breed of AI is the random aspect. It's long been known that random numbers are cool (see Knuth volume 2, in particular where he says randomness make computer-generated graphics and music more "appealing"). Unfortunately being amazed by graphics and music (and now text) output is one thing, making logical decisions with real consequences is quite another.


In 2025, 99% of people are talking about LLMs or stable diffusion.


So then your question boils down to "how could X, which I've defined as Y, be Z?"


The cutting edge of AI research is LLMs. Those aren't deterministic.

You can build an "AI" with whatever you want, but context matters and we live in 2025, not 1985.


> if we wanted to build a deterministic AI we could

What's holding us back from building a deterministic generative AI?

I probably don't understand enough, but I assumed that the non-determinism was "inherit" in the current LLM technology


Not really, code even in high level languages is always lower level than English just for computer nonsense reasons. Example: "read a CSV file and add a column containing the multiple of the price and quantity columns".

That's about 20 words. Show me the programming language that can express that entire feature in 20 words. Even very English-like languages like Python or Kotlin might just about do it, if you're working in something else like C++ then no.

In practice, this spec will expand to changes to your dependency lists (and therefore you must know what library is used for CSV parsing in your language, the AI knows this stuff better than you), then there's some file handling, error handling if the file doesn't exist, maybe some UI like flags or other configuration, working out what the column names are, writing the loop, saving it back out, writing unit tests. Any reasonable programmer will produce a very similar PR given this spec but the diff will be much larger than the spec.


> Show me the programming language that can express that entire feature in 20 words.

In python:

    import pandas
    mycsv = pandas.read_csv("/path/to/input.csv")
    mycsv['total_cost'] = mycsv.price*mycsv.quantity
Not only is this shorter, but it contains all of the critical information that you left out of your english prompt: where is the csv? what are the input columns named? what are output columns named? what do you want to do with the output?

I also find it easier to read than your english prompt.


> `mycsv = pandas.read_csv("/path/to/input.csv")`

You have to count the words in the functions you call to get the correct length of the implementation, which in this case is far far more than 20 words. read_csv has more than 20 arguments, you can't even write the function definition in under 20 words.

Otherwise, I can run every program by importing one function (or an object with a single method, or what have you) and just running that function. That is obviously a stupid way to count.


I really can't tell if this is meant as a joke.

Anyway, I just wrote what I, personally, would type in a normal work day to accomplish this coding task.


It isn't a joke, you need the Kolmogorov complexity of the code that implements the feature, which has nothing to do with the fact that you're using someone else's solution. You may not have to think about all the code needed to parse a CSV, but someone did and that's a cost of the feature, whether you want to think about it or not.

Again, if someone else writes a 100,000 line function for you, and they wrap it in a "do_the_thing()" method, you calling it is still calling a 100,000 line function, the computer still has to run those lines and if something goes wrong, SOMEONE has to go digging in it. Ignoring the costs you don't pay is ridiculous.


We are comparing between a) asking an LLM to write code to parse a csv and b) writing code to parse a csv.

In both cases, they'll use a csv library, and a bajillion items of lower-level code. Application code is always standing on the shoulders of giants. Nobody is going to manually write assembly or machine code to parse a csv.

The original contention, which I was refuting, is that it's quicker and easier to use an LLM to write the python than it is to just write the python.

Kolmogorov complexity seems pretty irrelevant to this question.


You actually have to count the number of bytes in the generated machine code to get the real count


Ok but how much physical space do those bytes take up? Need to measure them.


>"read a CSV file and add a column containing the multiple of the price and quantity columns"

This is an underspecification if you want to reliably repeatably produce similar code.

The biggest difference is that some developers will read the whole CSV into memory before doing the computations. In practice the difference between those implementation is huge.

Another big difference is how you represent the price field. If you parse them as floats and the quantity is big enough, you'll end up with errors. Even if quantity is small, you'll have to deal with rounding in your new column.

You didn't even specific the name of the new column, so the name is going to be different every time you run the LLM.

What happens if you run this on a file the program has already been ran on?

And these are just a few of the reasonable ways of fitting that spec but producing wildly different programs. Making a spec that has a good chance of producing a reasonably similar program each time looks more like:

“Read input.csv (UTF-8, comma-delimited, header row). Read it line by line, do not load the entire file into memory. Parse the price and quantity columns as numbers, stripping currency symbols and thousands separators; interpret decimals using a dot (.). Treat blanks as null and leave the result null for such rows. Compute per-row line_total = round(Decimal(price) * Decimal(quantity), 2). Append line_total as the last column (name the column "Total") without reordering existing columns, and write to output.csv, preserving quoting and delimiter. Do not overwrite existing columns. Do not evaluate or emit spreadsheet formulas.”

And even then you couldn't just check this in and expect the same code to be generated each time, you'd need a large test suite--just to constraint the LLM. And even then the LLM would still occasionally find ways to generate code that passes the tests but does thing you don't want it to.


But why would I want to reliably produce similar code? The underspec is deliberate. Maybe I don't care about the name of the column as long as it's reasonable.

How to represent prices: same. This is computer nonsense. There's one right way to do it, the LLM knows that way, it should do it.

How to do it scalably: same. If the file is named the agent can just look at its size to decide on the best implementation.

Your alternative spec is too detailed and has many details that can be easily inferred by the AI, like defaulting to UTF-8 and comma delimited. This is my point. There are many possible implementations in code, some better and some worse, and we shouldn't need to spell out all that detail in English when so much of it is just about implementation quality.


>But why would I want to reliably produce similar code?

If you're doing a one short CSV then an LLM or a custom program is the wrong way to do it. Any spreadsheet editor can do this task instantly with 4 symbols.

Assuming you want a repeatable process you need to define that repeatable process with enough specificity to make it repeatable and reliable.

You can do this in a formal language created for this or you can do invent your own English like specification language.

You can create a very loose specification and let someone else, a programmer or an LLM define the reliable, repeatable process for you. If you go with a junior programmer or an LLM though, you have to verify that the process they designed is actually reliable and repeatable. Many times it won't be and you'll need to make changes.

It's easier to write a few lines of python than to go through that process--unless you don't already know how to program, in which case you can't verify the output anyway.

That's not to say that I don't see beneficial use cases for AI, this just isn't one of them.

>This is my point. There are many possible implementations in code, some better and some worse, and we shouldn't need to spell out all that detail in English when so much of it is just about implementation quality.

If you don't actually care about implementation quality or correctness, sure. You should, and LLMs can not reliably pick the correct implementation details. They aren't even close to being able to do that.

The only people who are able to produce working software with LLMs are either writing very very detailed specifications. To the point where they aren't operating at a much higher level than Python.

Btw I had a Claude Sonnet 4 agent try your prompt.

It produced a 90 line python file in 7 minutes that reads the entire file into memory, performs floating point multiplication, doesn't correctly display the money values, and would crash if the price column ever had any currency symbols.


> I had a Claude Sonnet 4 agent try your prompt. It produced a 90 line python file in 7 minutes that reads the entire file into memory, performs floating point multiplication, doesn't correctly display the money values, and would crash if the price column ever had any currency symbols.

OK, that ups the stakes :)

I'm working on my own agent at the moment and gave it this task. I first had it generate a 10M row CSV with randomize product code, price and quantity.

It has two modes: fast and high quality. In fast mode I gave it the task "add to products.csv a column containing the multiple of the price and quantity columns". In 1m21s it wrote an AWK script that processed the file in a streaming manner and used it to add the column, with a backup file. So the solution did scale but it didn't avoid the other edge cases.

Then I tried the higher quality mode with the slightly generalized prompt "write a program that adds a column to a CSV file containing the multiple of the price and quantity columns". In this mode it generates a spec from the task, then reviews its own spec looking for potential bugs and edge cases, then incorporates its own feedback to update the spec, then implements the spec (all in separate contexts). This is with GPT-5.

The spec it settled on takes into account all those edge cases and many more, e.g. it thought about byte order marks, non-float math, safe overwrite, scientific notation, column name collisions, exit codes to use and more. It considered dealing with currency symbols but decided to put that out of scope (I could have edited the spec to override its decision here, but didn't). Time elapsed:

1. Generating the test set, 1m 9sec

2. Fast mode, 1m 21sec (it lost time due to a header quoting issue it then had to circle back and fix)

3. Quality mode, 48sec on initial spec, 2m on reviewing the spec, 1m 30sec on updating the spec (first attempt to patch in place failed, it tried again by replacing the entire file), 4m on implementing the spec - this includes time in which it tested its own solution and reviewed the output.

I believe the results to be correct and the program to tackle not only all the edge cases raised in this thread but others too. And yet the prompt was no more complex than the one I gave originally, and the results are higher quality than I'd have bothered to write myself.

I don't know which agent you used but right now we're not model intelligence constrained. Claude is a smart model, I'm sure it could have done the same, but the workflows the agents are implementing are full of very low hanging fruit.


Your spec isn’t actually a spec because it doesn’t produce the same software between runs.

The prompt is fantasy, all the “computer stuff” is reality. The computer stuff is the process that is actually running. If it’s not possible to look at your prompt and know fairly accurately what the final process is going to look like, you are not operating at a higher level of abstraction, you are asking a Genie to do your work for you and maybe it gets it right.

Your prompt produces a spec—the actual code. Now that code is the spec, but you need to spend the time reading it well enough to understand what the spec actually is since you didn’t write the spec.

Then you need to go through the new spec and make sure you’re happy with all of the decisions the LLM made. Do they make sense? Are there any requirements you need that it missed Do actually need to handle all of the edge cases it did handle?

>many more

The resulting code is almost certainly over engineered if its handling many more. Byte order marks, name collision etc… What you should do is settle on the column names beforehand.

This is a very common issue with junior developers. I call it “what if driven development”. Which again is why you the only people having success with LLM coding are writing highly detailed specs that are very close to programming language, or they are generating something small like a function at a time.


Full code with prebuilt libraries/packages/components will be the winning setup.


Iverson languages could do that quite succinctly.


"The prompt could be perfect, but there's no way to guarantee that the LLM will turn it into a reasonable implementation."

I think it is worse than that. The prompt, written in natural language, is by its very nature vague and incomplete, which is great if you are aiming for creative artistry. I am also really happy that we are able to search for dates using phrases like "get me something close to a weekend, but not on Tuesdays" on a booking website instead of picking dates from a dropdown box.

However, if natural language was the right tool for software requirements, software engineering would have been a solved problem long ago. We got rightfully excited with LLMs, but now we are trying to solve every problem with it. IMO, for requirements specification, the situation is similar to earlier efforts using formal systems and full verification, but at the exact opposite end. Similar to formal software verification, I expect this phase to end up as a partially failed experiment that will teach us new ways to think about software development. It will create real value in some domains and it will be totally abandoned in others. Interesting times...


“This doesn't make sense as long as LLMs are non-deterministic.”

I think this is a logical error. Non-determinism is orthogonal to probability of being correct. LLMs can remain non-deterministic while being made more and more reliable. I think “guarantee” is not a meaningful standard because a) I don’t think there can be such a thing as a perfect prompt, and b) humans do not meet that standard today.


> With compilers, I don't need to crack open a hex editor on every build to check the assembly.

The tooling is better than just cracking open the assembly but in some areas people do effectively do this, usually to check for vectorization of hot loops, since various things can mean a compiler fails to do it. I used to use Intel VTune to do this in the HPC scientific world.


We also have to pretend that anyone has ever been any good at writing descriptive, detailed, clear and precise specs or documentation. That might be a skillset that appears in the workforce, but absolutely not in 2 years. A technical writer that deeply understands software engineering so they can prompt correctly but is happy not actually looking at code and just goes along with whatever the agent generates? I don't buy it.

This seems like a typical engineer forgets people aren't machines line of thinking.


I agree this whole spec based approach is misguided. Code is the spec.


> This doesn't make sense as long as LLMs are non-deterministic.

I think we will find ways around this. Because humans are also non-deterministic. So what do we do? We review our code, test it, etc. LLMs could do a lot more of that. Eg, they could maintain and run extensive testing, among other ways to validate that behavior matches the spec.


If you're reviewing the code, then you're no longer "opening python files with the same frequency that you open up a hex editor to read assembly".


This. Even with Junior Devs, implementation is always more or less deterministic (based on ones abilities/skills/aptitude). With AI models, you get totally different implementations even when specifically given clear directions via prompt.


Neither are humans, so this argument doesn't really stand.


> Neither are humans, so this argument doesn't really stand.

Even when we give a spec to a human and tell them to implement it, we scrutinize and test the code they produce. We don't just hand over a spec and blindly accept the result. And that's despite the fact that humans have a lot more common sense, and the ability to ask questions when a requirement is ambiguous.


Not only that but they’re lossy. A hex representation is strictly more information as long as comments are included or generated.


sounds like a good nudge to make tests better


If the tests are written by the AI, who watches the watchers? :-)


but ppl writing that already knew that . so why are they writing this kind of stuff. what the fuck is even going on?


> there's no way to guarantee that the LLM will turn it into a reasonable implementation.

There's also no way to guarantee that you're not going to get hit by a meteor strike tomorrow. It doesn't have to be provably deterministic at a computer science PhD level for people without PhDs to say eh, it's fine. Okay, it's not deterministic. What does that mean in practice? Given the same spec.md file, at the layer of abstraction where we're no longer writing code by hand, who cares, because of a lack of determinism, if the variable for the filename object is called filename or fname or file or name as long as the code is doing something reasonable? If it works, if it passes tests, if we presume that the stoichastic parrot is going to parrot out its training data sufficiently close each time, why is it important?

As far as compilers being deterministic, there's a fascinating detail we ran into with Ksplice. They're not. They're only sufficiently enough that we trust them to be fine. There was this bug we kept tripping, back in roughly 2006, where GCC would swap registers used for a variable, resulting in the Ksplice patch being larger than it had to be, to include handling the register swap as well. The bug has since been fixed, exposing the details of why it was choosing different registers, but unfortunately I don't remember enough details about it. So don't believe me if you don't want to, but the point is, we trust the c compiler, given a function that takes in variables a, b, c, d, that a, b, c, and d will be map them to r0, r1, r2, or r3. We don't actually care what the order that mapping goes, so long as it works.

So the leap, that some have made, and others have not, is that LLMs aren't going to randomly flip out and delete all your data. Which is funny, because that's actually happened on replit. Despite that, despite the fact that LLMs still hallucinate total bullshit and goes off the rail; some people trust LLMs enough to convert a spec to working code. Personally, I think we're not there yet and won't be while GPU time isn't free. (Arguably it is already because anybody can just start typing into chat.com, but that's propped up by VC funding. That isn't infinite, so we'll have to see where we're at in a couple of years.)

That addresses the determinism part. The other part that was raised is debuggable. Again, I don't think we're at a place where we can get rid of generated code any time soon, and as long as code is being generated, then we can debug it using traditional techniques. As far as debugging LLMs themselves, it's not zero. They're not mainstream yet, but it's an active area of research. We can abliterate models and fine tune them (or whatever) to answer "how do you make cocaine", counter to their training. So they're not total black boxes.

Thus, even if traditional software development dies off, the new field is LLM creation and editing. As with new technologies, porn picks it up first. Llama and other downlodable models (they're not open source https://www.downloadableisnotopensource.org/ ). Downloadable models have been fine tuned or whatever to generate adult content, despite being trained not to. So that's new jobs being created in a new field.


What does "it works" mean to you? For me, that'd be deterministic behavior, and your description about brute forcing LLMs to the desired result through a feedback loop with tests is just that. I mean, sure, if something gives the same result 100% of the time, or 90% of the time, or fuck it, even 80-50% of the time, that's all deterministic in the end, isn't it?

The interesting thing is, for something to be deterministic that thing doesn't need to be defined first. I'd guess we can get an understanding of day/night-cycles without understanding anything about the solar system. In that same vein your Ksplice GCC bug doesn't sound nondeterministic. What did you choose to do in the case of the observed Ksplice behavior? Did you debug and help with the patch, or did you just pick another compiler? It seems that somebody did the investigation to bring GCC closer to the "same result 100% of the time", and I truly have to thank that person.

But here we are and LLMs and the "90% of the time"-approach are praised as the next abstraction in programming, and I just don't get it. The feedback loop is hailed as the new runtime, whereas it should be build time only. LLMs take advantage of the solid foundations we built and provide an NLP-interface on top - to produce code, and do that fast. That's not abstraction in the sense of programming, like Assembly/C++/Blender, but rather abstraction in the sense of distance, like PC/Network/Cloud. We use these "abstractions in distance" to widen reach, design impact and shift responsibilities.


Having been writing a lot of AWS CDK/IAC code lately, I'm looking at this as the "spec" being the infrastructure code and the implementation being the deployed services based on the infrastructure code.

It would be an absolute clown show if AWS could take the same infrastructure code and perform the deployment of the services somehow differently each time... so non-deterministically. There's already all kinds of external variables other than the infra code which can affect the deployment, such as existing deployed services which sometimes need to be (manually) destroyed for the new deployment to succeed.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: