I wish I had the time to try this: 1.) Grab many GBs of text (books, etc). 2.) F...

0cf8612b2e1e · 2025-03-17T20:09:40 1742242180

The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.

GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.

andrewla · 2025-03-17T20:50:43 1742244643

Importantly, though, LLMs do not take the embeddings as input during training; they take the tokens and learn the embeddings as part of the training.

Specifically all Transformer-based models; older models used things like word2vec or elmo, but all current LLMs train their embeddings from scratch.

naasking · 2025-03-17T22:54:26 1742252066

And tokens are now going down to the byte level:

https://ai.meta.com/research/publications/byte-latent-transf...

YesBox · 2025-03-18T00:16:25 1742256985

You shouldn't need to allocate every possible combination !_! if you dynamically add new pairs/distance as you find them. Im talkin simple for loops.

currymj · 2025-03-18T02:34:11 1742265251

you might enjoy this read, which is an up-to-date document from this year laying out what was the state of the art 20 years ago:

https://web.stanford.edu/~jurafsky/slp3/3.pdf

Essentially you just count every n-gram that's actually in the corpus, and "fill in the blanks" for all the 0s with some simple rules for smoothing out the probability.

docfort · 2025-03-17T18:51:22 1742237482

There is some recent work [0] that explores this idea, scaling up n-gram models substantially while using word2vec vectors to understand similarity. Used to compute something the authors call the Creativity Index [1].

[0]: https://infini-gram.io [1]: https://arxiv.org/abs/2410.04265v1

mchinen · 2025-03-18T08:57:05 1742288225

Claude Shannon was interested in this kind of thing and had a paper on the entropy per letter or word of English. He also has a section in his famous "A Mathematical Theory of Communication that has experiments using the conditional probability of the next word based on the previous n=1,2 words from a few books. I wonder if the conditional entropy approaches zero as n increases assuming ergodicity. But the number of entries in the conditional probability table blows up exponentially. The trick of combining multiple n=1 of different distances sounds interesting, and reminds me a bit of contrastive prediction ml methods.

Anyway the experiments in Shannon's paper sound similar to what you describe but with less data and distance, so it should give some idea of how it would look: From the text:

* 5. First-order word approximation. Rather than continue with tetragram, : : : , n-gram structure it is easier and better to jump at this point to word units. Here words are chosen independently but with their appropriate frequencies.

REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NAT- URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.

6. Second-order word approximation. The word transition probabilities are correct but no further structure is included.

THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR- ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED *

currymj · 2025-03-17T18:45:32 1742237132

this is pretty close to how language models worked in the 90s-2000s. deep language models -- even GPT 2 -- are much much better. on the other hand, the n-gram language models are "surprisingly good" even for small n.

WheatMillington · 2025-03-17T18:50:58 1742237458

Pretty sure this wouldn't produce anything useful. Pretty sure this would generate incoherent gibberish that looks and sounds like English but makes no sense. This ignores perhaps the most important element of LLM's, the attention mechanism.

bob1029 · 2025-03-18T10:04:13 1742292253

And, the attention mechanism scales quadratically with context length. This is where all of the insane memory bandwidth requirements come from.

pyinstallwoes · 2025-03-18T09:37:08 1742290628

Every thing has meaning in precise relation to the frequency of cooccurrence to every other thing.

I, too, have been mulling this. Word to word, paragraph to paragraph. Even letter to letter.

Also what if you processed text in signal space? I keep wondering if that’s possible. Then you get it all at once rather than windows. Use a derivative of change for every page, so the phase space is the signal end to end.

montebicyclelo · 2025-03-17T18:57:38 1742237858

> How close would this be to GPT 2

Here's a post from 2015 doing something a bit like this [1]

[1] https://nbviewer.org/gist/yoavg/d76121dfde2618422139

janalsncm · 2025-03-17T20:08:31 1742242111

The problem is that for any reasonable value of N (>100) you will need prohibitive amounts of storage. And it will be extremely sparse. And you won’t capture any interactions between N-99 and N-98.

Transformers do that fairly well and are pretty efficient in training.

nickysielicki · 2025-03-17T19:24:33 1742239473

Markov chains are very very far off from gpt2.

procaryote · 2025-03-18T08:25:11 1742286311

Aren't they technically the same? GPT picks the next token given the state of current context, based on probabilities and a random factor. That is mathematically equivalent to a Markov chain, isn't it?

SJC_Hacker · 2025-03-18T16:29:33 1742315373

Markov chains don't account for the full history. While all LLMs do have a context length, this is more a practical limitation based on resources rather than anything implicit in the model.

fsndz · 2025-03-17T22:37:47 1742251067

I actually tried sth like that with the Bible back in 2021. scaling is bitch. very difficult to train these types of models.