Cloudflare now serves sites in Markdown to AI agents

Beep@lemmus.org · 1 day ago

Cloudflare now serves sites in Markdown to AI agents

uninvitedguest@piefed.ca · 20 hours ago

A few things come to mind

Is this much different from the “reading view” popularized by Instapaper, reading later, etc and now baked into most browsers today?
What is a token?
How is it that tokens have become the base unit on which we denominate LLM work/effort/task difficulty/cost?
Does every LLM model make use of “tokens” or is it just one or a select few?
If multiple use the idea of tokens, is a token with one LLM/provider equivalent to a token with a different LLM/provider?

wonderingwanderer@sopuli.xyz · 18 hours ago

A token is basically a linguistic unit, like a word or a phrase.

LLMs don’t parse text word-by-word because it would miss a lot of idiomatic meaning and other context. “Dave shot a hole in one at the golf course” might be parsed as “{Dave} {shot} {a hole in one} {at the golf course}”

They use NLP to “tokenize” text, meaning parsing it into individual tokens, so depending on the tokenizer I suppose there could be slight variations on how a text is tokenized.

Then the LLM runs each token through layers of matrices on attention heads (basically, vectors) in order to assess the probabilistic relationships between each token, and uses that process to generate a response via next-token prediction.

It’s a bit more complex than that, of course. Tensor calculus, billions of weighted parameters, layers divided by hidden sizes, also matmuls, masks, softmax, and dropout. Also the “context window” which is how many tokens it can process at a time. But it’s the gist of it.

But a token is just the basic unit that gets run through those processes.

wosat@lemmy.world · 18 hours ago

Here’s an OpenAI page that allows you to enter text and see how it gets tokenized:

https://platform.openai.com/tokenizer

CandleTiger@programming.dev · 18 hours ago

A token is the word for the base unit of text that an LLM works with. It’s always been that way. The LLM does not directly work with characters; they are collected together into chunks less than a word and this stream of tokens is what the LLM is processing. This is also why the LLMs have such trouble with spelling questions like “how many Rs in raspberry?” — they do not see the individual letters in the first place so they do not know.

No, the LLMs do not all tokenize the same way. Different tokenizers are (or at least were once) one of the major ways they differed from each other. A simple tokenizer might split words up into one token per syllable but I think they’ve gotten much more complicated than that, now.

My understanding is very basic and out-of-date.