AI costs spike as subscriptions hit pricing wall — firms turn towards Chinese LLMs, open-source models to extend budget

sanitation@lemmy.today · 1 month ago

AI costs spike as subscriptions hit pricing wall — firms turn towards Chinese LLMs, open-source models to extend budget

setsubyou@lemmy.world · 1 month ago

Nowadays agents like Claude Code can run autonomously for hours just given a goal description. It doesn’t take a lot of human effort at all to set up a bunch of sessions, and these companies don’t limit how many instances you run in parallel. Agents can also spawn sub-agents that run in parallel if a task calls for parallelization. Whether all this produces good results is a different story, especially if you don’t put enough effort into the goal description. But burning tokens as such is not difficult.

Even workflows where you’re just chatting with an agent can burn a lot of tokens. When you’re chatting with an LLM, the entire history becomes part of the input each time you send something. This also applies to tool calls, so if the agent decides to read 20 files before it can work on your request that’s 20 times a file gets added to the history and 20 times that entire growing history is then sent back as input to drive the agent’s next step.

Coding is more affected by this than many other applications because even a new conversation tends to start with the agent gathering a bunch of source code files, and then the response to a task is not just a bunch of text once, but a sequence of tool calls to make edits across files, build, run tests, react to test failures, and so on, all for one actual human prompt - but in reality a back-and-forth between the LLM and the harness with a quickly growing history.