Yeah I liked the way he explained this in his State of GPT talk (even if it might not be 100% literally accurate), that each token has an equal amount of “computation” behind it so if you want to do something more computationally complex, letting it use more tokens (“show your working” etc) yields better results as it can “do more computation” in a sense