Cloudflare has announced a new feature that automatically serves websites in Markdown to AI agents. With Markdown for Agents, the company is responding to
A token is basically a linguistic unit, like a word or a phrase.
LLMs don’t parse text word-by-word because it would miss a lot of idiomatic meaning and other context. “Dave shot a hole in one at the golf course” might be parsed as “{Dave} {shot} {a hole in one} {at the golf course}”
They use NLP to “tokenize” text, meaning parsing it into individual tokens, so depending on the tokenizer I suppose there could be slight variations on how a text is tokenized.
Then the LLM runs each token through layers of matrices on attention heads (basically, vectors) in order to assess the probabilistic relationships between each token, and uses that process to generate a response via next-token prediction.
It’s a bit more complex than that, of course. Tensor calculus, billions of weighted parameters, layers divided by hidden sizes, also matmuls, masks, softmax, and dropout. Also the “context window” which is how many tokens it can process at a time. But it’s the gist of it.
But a token is just the basic unit that gets run through those processes.
A token is the word for the base unit of text that an LLM works with. It’s always been that way. The LLM does not directly work with characters; they are collected together into chunks less than a word and this stream of tokens is what the LLM is processing. This is also why the LLMs have such trouble with spelling questions like “how many Rs in raspberry?” — they do not see the individual letters in the first place so they do not know.
No, the LLMs do not all tokenize the same way. Different tokenizers are (or at least were once) one of the major ways they differed from each other. A simple tokenizer might split words up into one token per syllable but I think they’ve gotten much more complicated than that, now.
A few things come to mind
A token is basically a linguistic unit, like a word or a phrase.
LLMs don’t parse text word-by-word because it would miss a lot of idiomatic meaning and other context. “Dave shot a hole in one at the golf course” might be parsed as “{Dave} {shot} {a hole in one} {at the golf course}”
They use NLP to “tokenize” text, meaning parsing it into individual tokens, so depending on the tokenizer I suppose there could be slight variations on how a text is tokenized.
Then the LLM runs each token through layers of matrices on attention heads (basically, vectors) in order to assess the probabilistic relationships between each token, and uses that process to generate a response via next-token prediction.
It’s a bit more complex than that, of course. Tensor calculus, billions of weighted parameters, layers divided by hidden sizes, also matmuls, masks, softmax, and dropout. Also the “context window” which is how many tokens it can process at a time. But it’s the gist of it.
But a token is just the basic unit that gets run through those processes.
Here’s an OpenAI page that allows you to enter text and see how it gets tokenized:
https://platform.openai.com/tokenizer
A token is the word for the base unit of text that an LLM works with. It’s always been that way. The LLM does not directly work with characters; they are collected together into chunks less than a word and this stream of tokens is what the LLM is processing. This is also why the LLMs have such trouble with spelling questions like “how many Rs in raspberry?” — they do not see the individual letters in the first place so they do not know.
No, the LLMs do not all tokenize the same way. Different tokenizers are (or at least were once) one of the major ways they differed from each other. A simple tokenizer might split words up into one token per syllable but I think they’ve gotten much more complicated than that, now.
My understanding is very basic and out-of-date.