• 4 Posts
  • 1.3K Comments
Joined 2 years ago
cake
Cake day: March 22nd, 2024

help-circle
  • Sometimes. As a tool, not an outsourced human, oracle, or some transcendent companion con artists like Altman are trying to sell.

    See how grounded this interview is, from a company with a model trained on peanuts compared to ChatGPT, and that takes even less to run:

    …In 2025, with the launch of Manus and Claude Code, we realized that coding and agentic functions are more useful. They contribute more economically and significantly improve people’s efficiency. We are no longer putting simple chat at the top of our priorities. Instead, we are exploring more on the coding side and the agent side. We observe the trend and do many experiments on it.

    https://www.chinatalk.media/p/the-zai-playbook

    They talk about how the next release will be very small/lightweight, and more task focused. How important gaining efficiency through architecture (not scaling up) is now. They even touch on how their own models are starting to be useful utilities in their workflows, and specifically not miraculous worker replacements.











  • Vllm is a bit better with parallelization. All the kv cache sits in a single “pool”, and it uses as many slots as will fit. If it gets a bunch of short requests, it does many in parallel. If it gets a long context request, it kinda just does that one.

    You still have to specify a maximum context though, and it is best to set that as low as possible.

    …The catch is it’s quite vram inefficient. But it can split over multiple cards reasonably well, better than llama.cpp can, depending on your PCIe speeds.

    You might try TabbyAPI exl2s as well. It’s very good with parallel calls, thoughts I’m not sure how well it supports MI50s.


    Another thing to tweak is batch size. If you are actually making a bunch of 47K context calls, you can increase the prompt processing batch size a ton to load the MI50 better, and get it to process the prompt faster.


    EDIT: Also, now that I think about it, I’m pretty sure ollama is really dumb with parallelization. Does it even support paged attention batching?

    The llama.cpp server should be much better, eg use less VRAM for each of the “slots” it can utilize.








  • Lulz.

    It’s an interesting coding exercise, though. Trying to (for example) OCR all the documents, or generate a relations graph between the documents or concepts, is a great into to language modeling (which is not prompt engineering, like most seem to think).

    If you’re like a reporter or something, it’s also the obvious way to comb through the documents looking for clues to actually make headlines. I dunno what techniques they use at big outlets, though.



  • Meme finetunes are nothing new.

    As an example, there are DPO datasets with positive/negative examples intended to train LLMs to respond politely and helpfully (as opposed to the negative response). There are some that include toxic comments plucked from the web as negative examples.

    And the immediate community thought was “…What if I reversed them?”