Cloudflare has announced a new feature that automatically serves websites in Markdown to AI agents. With Markdown for Agents, the company is responding to
A token is the word for the base unit of text that an LLM works with. It’s always been that way. The LLM does not directly work with characters; they are collected together into chunks less than a word and this stream of tokens is what the LLM is processing. This is also why the LLMs have such trouble with spelling questions like “how many Rs in raspberry?” — they do not see the individual letters in the first place so they do not know.
No, the LLMs do not all tokenize the same way. Different tokenizers are (or at least were once) one of the major ways they differed from each other. A simple tokenizer might split words up into one token per syllable but I think they’ve gotten much more complicated than that, now.
A token is the word for the base unit of text that an LLM works with. It’s always been that way. The LLM does not directly work with characters; they are collected together into chunks less than a word and this stream of tokens is what the LLM is processing. This is also why the LLMs have such trouble with spelling questions like “how many Rs in raspberry?” — they do not see the individual letters in the first place so they do not know.
No, the LLMs do not all tokenize the same way. Different tokenizers are (or at least were once) one of the major ways they differed from each other. A simple tokenizer might split words up into one token per syllable but I think they’ve gotten much more complicated than that, now.
My understanding is very basic and out-of-date.