2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.
3.) For each word, store most frequent word for each $N distance. [a]
4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)
How close would this be to GPT 2?
[a] You could go one step further and store multiple words for each distance, ordered by frequency
The scaling is brutal. If you have a 20k word vocabulary and want to do 3 grams, you need a 20000^3 matrix of elements (8 trillion). Most of which is going to be empty.
GPT and friends cheat by not modeling each word separately, but a large dimensional “embedding” (just a vector if you also find new vocabulary silly). The embedding represents similar words near each other in this space. The famous king-man-queen example. So even if your training set has never seen “The Queen ordered the traitor <blank>”, it might have previously seen, “The King ordered the traitor beheaded”. The vector representation lets the model use words that represent similar concepts without concrete examples.
Essentially you just count every n-gram that's actually in the corpus, and "fill in the blanks" for all the 0s with some simple rules for smoothing out the probability.
There is some recent work [0] that explores this idea, scaling up n-gram models substantially while using word2vec vectors to understand similarity. Used to compute something the authors call the Creativity Index [1].
Claude Shannon was interested in this kind of thing and had a paper on the entropy per letter or word of English. He also has a section in his famous "A Mathematical Theory of Communication that has experiments using the conditional probability of the next word based on the previous n=1,2 words from a few books. I wonder if the conditional entropy approaches zero as n increases assuming ergodicity. But the number of entries in the conditional probability table blows up exponentially. The trick of combining multiple n=1 of different distances sounds interesting, and reminds me a bit of contrastive prediction ml methods.
Anyway the experiments in Shannon's paper sound similar to what you describe but with less data and distance, so it should give some idea of how it would look:
From the text:
* 5. First-order word approximation. Rather than continue with tetragram, : : : , n-gram structure it is easier and better to jump at this point to word units. Here words are chosen independently but with their appropriate frequencies.
REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NAT-
URAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES
THE LINE MESSAGE HAD BE THESE.
6. Second-order word approximation. The word transition probabilities are correct but no further structure is included.
THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHAR-
ACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT
THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED
*
this is pretty close to how language models worked in the 90s-2000s. deep language models -- even GPT 2 -- are much much better. on the other hand, the n-gram language models are "surprisingly good" even for small n.
Pretty sure this wouldn't produce anything useful. Pretty sure this would generate incoherent gibberish that looks and sounds like English but makes no sense. This ignores perhaps the most important element of LLM's, the attention mechanism.
Every thing has meaning in precise relation to the frequency of cooccurrence to every other thing.
I, too, have been mulling this. Word to word, paragraph to paragraph. Even letter to letter.
Also what if you processed text in signal space? I keep wondering if that’s possible. Then you get it all at once rather than windows. Use a derivative of change for every page, so the phase space is the signal end to end.
The problem is that for any reasonable value of N (>100) you will need prohibitive amounts of storage. And it will be extremely sparse. And you won’t capture any interactions between N-99 and N-98.
Transformers do that fairly well and are pretty efficient in training.
Aren't they technically the same? GPT picks the next token given the state of current context, based on probabilities and a random factor. That is mathematically equivalent to a Markov chain, isn't it?
Markov chains don't account for the full history. While all LLMs do have a context length, this is more a practical limitation based on resources rather than anything implicit in the model.
1.) Grab many GBs of text (books, etc).
2.) For each word, for each next $N words, store distance from current word, and increment count for word pair/distance.
3.) For each word, store most frequent word for each $N distance. [a]
4.) Create a prediction algorithm that determines the next word (or set of words) to output from any user input. Basically this would compare word pairs/distance and find most probable next set of word(s)
How close would this be to GPT 2?
[a] You could go one step further and store multiple words for each distance, ordered by frequency