• 4 Posts
  • 1.9K Comments
Joined 2 years ago
cake
Cake day: March 22nd, 2024

help-circle

  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    40 minutes ago

    I have a single 3090!

    That’s the dream GPU, these days.

    And I have 128GB CPU RAM. So the best model I can run is MiMo 2.5 (a 300B model) at around 10 tokens/sec, using hybrid CPU inference.

    …But that’s the worst-case scenario, for speed. It’s an IQ3_KT quant (a high quality “trellis” quantization type, but very slow on CPU), with a gigantic model that barely fits in my RAM+VRAM combined, with no DFlash or any kind of speculative decoding turned on. I could tune it to be much faster, but I mostly just want “max quality, fast enough to read as it streams, barely fits in memory” for this model.

    For speed, or prompts with lots of thinking or context (like agenic use), I just run Qwen 3.6 27B now. That would fit in your 3090 no matter how much CPU RAM you have, but you have to be smart about the backend and quantization you pick. If you just use Ollama, it’s gonna tell you it won’t fit, or use some horrible default that spits out garbage.


    …This is what I meant to emphasize.

    It’s not just the hardware. You kinda have to be part developer, part enthusiast to even follow this stuff, it up optimally, and keep it up-to-date. If you just try to Google “best LLM for 3090,” you will get absolute garbage.


  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    4 hours ago

    You don’t even need Claude anymore. GLM 5.2 API is good enough for 95% of the same things and vastly cheaper.

    MiMo 2.5 Pro and Kimi are also very good. And then there’s Cerebras API if you just want simple things done quick.

    The thing with self hosting, while awesome, is that it requires a lot of hardware and considerable time investment for what’s essentially a “base tier model,” or at best one step down for what’s still a very cheap API. I still love it, especially the privacy and control aspect, but you aren’t running Claude at home unless you’ve got a threadripper or server hardware collecting dust.

    …Hence I can understand why people don’t pursue it. Especially since a cursory Google search will lead you to trying the Deepseek distillation on Ollama (which is awful).



  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    3
    ·
    edit-2
    1 day ago

    Oh, both! Yeah. I didn’t even think of that, but [AIT]/[AIP] as separate tags makes a lot of sense.

    I’d like being able to filter by either, actually.

    I guess two tags runs the risk of “rules too complex for some to follow,” but that’s more of a moderation load question. I have no say in that, heh.



  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    1 day ago

    For what it’s worth, I asked my self-hosted LLM (MiMo 2.5, no network access outside my desktop), and it came with [AIT] (AI-Topic).

    …I think that’s my favorite so far. [AIP] would work too.

    I feel like that “obfuscates” the tag enough to blunt impulse downvotes in /new and feeds, without being deceptive or anything.


  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    9
    ·
    edit-2
    1 day ago

    I’m not consistent about it yet, but because of exactly this, I’m trying to differentiate the two when I talk.

    Responsible automation? I use ML or machine learning.

    The grift consuming the world? A Tech Bro? “AI”

    I think one of the saddest things is the conflation between the two, like you can’t even talk about one without invoking the other. Or it opening up that whole ethical debate, when you’re just talking about, like, a 100M transcription model trained by one research in some university on a potato.



  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    1 day ago

    TBD indeed. But it will effectively ‘downrank’ posts and their visibility, maybe into the negative vote range. I’ve seen highly negative scores across the board in more machine-learning focused subs, and that’s without a tag that catches the eye so easily.

    I think even modifying the acronym could make a difference, though (as I ninja edited).


  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    7
    ·
    edit-2
    1 day ago

    Also:

    Anything with an [AI] tag, first thing in the title, will have a drive-by downvote issue.

    Not sure how to deal with that, or if its even a concern.


    EDIT:

    Maybe it should be something else that’s not such a loaded keyword?

    [ML] for Machine Learning? [SAI]? [LAI]?

    I’ve been messing with ‘AI’ for a decade, and even I hate what the term has come to represent.



  • brucethemoose@lemmy.worldtoSelfhosted@lemmy.worldSelfhosted & AI
    link
    fedilink
    English
    arrow-up
    69
    ·
    edit-2
    1 day ago

    +1

    Home-AI oriented channels like Reddit’s localllama are filled with self promotion garbage, and more will trickle here over time… I’m not even against self promo or heavy coding assistance, but 9-times-out-of-10, the linked repo is nonsense, or straight-up fraudulent. And being obviously vibe-coded is a common tell.

    Good to get ahead of this.

    Also, +1 on supressing driveby insults. If the post is tagged up front, there’s no need. That being said, it should be okay for users to call out an obvious grift, or a “nonsense repo” that’s actually pure slop.




  • And issue is it needs to be a specific platform.

    From a game developer’s perspective (who isn’t a pro linux dev or anything), they can support a platform. They support Windows 10. Or Windows 11. They can support stock Ubuntu. They can support a SteamOS image.

    They cannot specifically support your personalized Arch config.

    Linux’s fragmentation has always been an issue in this regard, as they can’t legally support thousands of different possible system configurations.


    HOWEVER,

    I think supporting Proton + SteamOS would be very reasonable for a dev. That is a specific platform, its codebase and infrastructure can stay unified with the Windows version, and support for that would practically mean support in other Linux distros.

    And SteamOS by itself is getting big.


  • It…

    Well, it’s not just about “too much ruthless efficiency.”

    Zuckerburg fired a lot of brainpower in their ML divisions. He torched their landmark in the “AI” landscape, THE thing that literally set the standard for open weights LLMs (Llama), and hired a bunch of Tech Bro narcissists in their place, because he’s doesn’t have the first clue about how LLMs work, nor how important Llama and its wake is.

    They were on track to be at the center of open weights LLMs, setting the standard for the whole world, but Zuck, and only Zuck, blew it up because he is an ignorant coward. He ran, like he does from every bandwagon he jumps on at the first bump.

    Meta still develops PyTorch and a few non-text models openly, but we’ll see how long that lasts.


    Aside: I’m sick of everyone pretending like these ultra-wealthy industry runners are somehow geniuses. Like Bezos swallowing the “space datacenter” scam whole, Google’s leadership consciously sabotaging their core product, or gestures at Elon Musk’s Twitter page.

    They’re making terrible choices. They clearly do not have the scientific/technical background to make them. Their decisions aren’t “hard” like the headlines make it out to be; even for their own pure self interest, they’re irrational.


  • A 3060?

    Exllama/TabbyAPI is still worth looking at if you are trying to run a model purely in GPU RAM. It’s easily the most VRAM efficient backend, it just doesn’t support CPU offloading (which is useful for MoEs if you have considerable spare CPU RAM) and more optimized for 4xxx and up Nvidia cards.

    And TabbyAPI has a docker container you can use. Look for “exl3” models on huggingface.


  • If you’re using docker anyway, and “fast” pure GPU models, you might try a vllm container while you’re at it.

    It should be much faster than even llama.cpp, albeit at the cost of context length, and it supports some exotic 4-bit quantization like SPQA.

    Same with TabbyAPI. It’s quantization is SOTA, though it does not support CPU offloading, and it’s speed is somewhere between vllm and llama.cpp.


  • Mostly, yeah.

    Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded, but you know will be fine in regular use without too many programs open.

    In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap CPU RAM, and don’t OOM on the GPU either.