Do you host your own AI?

SuspiciousCarrot78@aussie.zone · 1 month ago

Do you host your own AI?

algernon@lemmy.ml · 1 month ago

Yes. My Actual Intelligence lives in my head, and runs mostly on coffee.

portifornia@piefed.social · 1 month ago

Just coffee?!? That’s cool.

Mine runs on:

coffee
spite
tortilla chips
& shame

searabbit@piefed.social · 1 month ago

If that’s not already on a shirt it should be

algernon@lemmy.ml · 1 month ago

Mostly on coffee, not exclusively. Noticable amounts of spite & tortilla chips are also present, yes, but… no shame.

portifornia@piefed.social · 1 month ago

Nice!

Diurnambule@jlai.lu · 1 month ago

I replace tortilla by “raclette” but that cultural.

tal@lemmy.today · 1 month ago

Do you get many hallucinations?

algernon@lemmy.ml · 1 month ago

Only when I’m deprived of coffee.

boonhet@sopuli.xyz · 1 month ago

Would flowers work instead?

algernon@lemmy.ml · 1 month ago

No. I’m not dead yet.

SuspiciousCarrot78@aussie.zone · edit-2 1 month ago

I’ll make sure to send you flowers, Algernon lol

curbstickle_lw@lemmy.world · 1 month ago

@[email protected] this comment is not (directly) for you, I just want it in context.

Before you report someone for breaking rule 1, please look a the context. Specifically, the username someone may be replying to.

SuspiciousCarrot78@aussie.zone · edit-2 1 month ago

LOL.

https://en.wikipedia.org/wiki/Flowers_for_Algernon

Looks like someone got big mad over a harmless, good natured and on topic joke. You love to see it.

Sorry they wasted your time.

curbstickle_lw@lemmy.world · 1 month ago

Eh, its fine. Certainly better than the “I don’t like this so I’m going to report it” approach.

GreenCrunch@piefed.blahaj.zone · 1 month ago

critical security bug: if coffee is taken away my head hurts :(

zitrone 🍋@europe.pub · 1 month ago

As we know AI stands for “An Indian”, so if you’re not from India, its actually impossible to self host.

Well, unless you manage to trap one in your basement, but that would violate human rights and hopefully also break the laws of your country.

SuspiciousCarrot78@aussie.zone · edit-2 1 month ago

You may be confusing Indians with gremlins (AGI). Which might explain ChatGPTs obsession with gremlins

ButteredBread@sh.itjust.works · edit-2 1 day ago

deleted by creator

thenextguy@lemmy.world · 1 month ago

With sufficient coffee, mine shows considerable artifice.

SuspiciousCarrot78@aussie.zone · 1 month ago

Plastic flowers.

brucethemoose@lemmy.world · edit-2 1 month ago

An aside for anyone reading this:

https://sleepingrobots.com/dreams/stop-using-ollama/

And that barely scratches the surface. Please.

Use anything but Ollama. Even APIs.

SuspiciousCarrot78@aussie.zone · 1 month ago

Llama.cpp or death!

tristynalxander@mander.xyz · edit-2 8 days ago

deleted by creator

BlackLaZoR@lemmy.world · edit-2 1 month ago

I use LMStudio, because it has quality of life improvements like nice GUI and huggingface search engine. Also they have Vulkan backend that at least on 7900XTX is ~10% faster than rocm (on LLama 3 8b Q4_0 it gets 115Tokens/s vs 105 on rocm)

brucethemoose@lemmy.world · edit-2 1 month ago

Or exllama! Vllm, sglang, Lorax. Koboldcpp, Aphrodite, text-generation-webui, LM Studio, powerinfer, ktransformers, mlc-LLM, really whatever floats your boat. Just not ollama, specifically.

plasma8726@lemmy.today · 1 month ago

Thanks for this link. Because of this article, I had claude stand up a llama.cpp container next to my already running ollama container. It ran side by side tests with the same model and parameters, and the results blew ollama out of the water. I’m in the process of moving hermes and openwebgui over to the llama.cpp instance to see how it goes day to day.

brucethemoose@lemmy.world · edit-2 1 month ago

If you’re using docker anyway, and “fast” pure GPU models, you might try a vllm container while you’re at it.

It should be much faster than even llama.cpp, albeit at the cost of context length, and it supports some exotic 4-bit quantization like SPQA.

Same with TabbyAPI. It’s quantization is SOTA, though it does not support CPU offloading, and it’s speed is somewhere between vllm and llama.cpp.

plasma8726@lemmy.today · 1 month ago

Thanks! I’ll look into this. I’m a bit limited at 12GB of VRAM right now.

brucethemoose@lemmy.world · edit-2 1 month ago

A 3060?

Exllama/TabbyAPI is still worth looking at if you are trying to run a model purely in GPU RAM. It’s easily the most VRAM efficient backend, it just doesn’t support CPU offloading (which is useful for MoEs if you have considerable spare CPU RAM) and more optimized for 4xxx and up Nvidia cards.

And TabbyAPI has a docker container you can use. Look for “exl3” models on huggingface.

pinball_wizard@lemmy.zip · 1 month ago

I agree that the concerns listed there are smells, and I wasn’t aware of some of the options listed there.

Thank you for sharing this!

vagabond@lemmy.dbzer0.com · 1 month ago

Didn’t know this. Going to switch this weekend, thanks for sharing this!

SchwertImStein@lemmy.dbzer0.com · 1 month ago

thank you

Kroko@feddit.online · 29 days ago

Thanks. Good to know

comrademiao@piefed.social · edit-2 1 month ago

looks like extreme nitpicking without any real issues beyond some VC funding a FOSS issues.

//whyre you spamming the comment to everyone? its quite alarmist actually

brucethemoose@lemmy.world · edit-2 1 month ago

I completely disagree.

Frankly, I find the description “VC funding a FOSS” offensive. They aren’t funding the engine. I’ve been messing with LLM inference engines since 2022, and Ollama is the worst I’ve seen in the community.

They misname models for SEO. They leech off llama.cpp while deliberately hiding attribution yet redirecting GH support requests there. They sometimes make their own GGUFs+forked releases which are broken and incompatibile with upstream llama.cpp, just so they can get a release out a day ahead for hype, even though it doesn’t really work and they’ll never upstream one line. They set a default context size thats basically unusable, they screw up chat templates and deep internal code with no obvious indicators, they release suboptimal quants without iMatrix, they gate you into their internal quantization repo and model card format, they hide model downloads on your hard drive, they mess with standard APIs for no good reason other than to mess up other backends. I could go on and on.

And if that’s all fine, they’re enshittifying the app with closed code, and pointers to cloud models.

They GIVE LLM inference a bad name, by making it a terrible quality engine that happens to show up in search as the “default.” Hence the comments below of people being unimpressed with local inference. And they sap attention from actual llama.cpp devs, without contributing a single dime. Everyone in the localllama communtity hates their guts, and that’s not even getting into the interpersonal drama they’ve stirred.

They are a leech that’s a net drag to the whole community, that we can’t get rid of because they’re attention grifters. And they’ve gotten worse and worse over time.

It’s more morale to use any cloud API over Ollama, in my eyes. They’re a grift.

EDIT: And, to be clear, I’m not against VC funded downstream stuff.

LM Studio is good! Even though it’s closed source.

Tons of downstream projects are great.

frongt@lemmy.zip · 1 month ago

Yes. Openwebui/ollama for LLM, comfyui for stable diffusion. I just dick around with it as a toy.

mesa@piefed.social · edit-2 1 month ago

Same. Its somewhat useful on some very small scripting or tasks…but its mostly just to try out a new model or two. Its not really useful for anything big.

I will have to say…even my tiny models are about as good as Chatgpt/Claude/etc… which makes me think about how much people are spending on tokens regularly. I was able to get the same kind of python script started with my local tiny model that was comparable to the newest Claude code offerings.

Lettuce eat lettuce@lemmy.ml · 1 month ago

What local models have you been using? And what hardware are you running them on? I’ve been playing with local LLMs a bit for exactly your use case.

I have zero interest in vibe coding or full agentic workflows. But having a local LLM generate a Bash script to help me automate parts of my home lab infrastructure would be nice.

Die4Ever@retrolemmy.com · 1 month ago

What are your hardware specs?

Lettuce eat lettuce@lemmy.ml · 1 month ago

Ryzen 7 5800 X3D Radeon RX 9070XT 32GB of DDR4 system memory.

OhVenus_Baby@lemmy.ml · 1 month ago

How hard does it push this setup? How far can you scale up your own models on this hardware?

Shimitar@downonthestreet.eu · 1 month ago

I was put off by ComfyUI, seems awfully complex. How is your experience?

Any suggestions to start? I have Fooocus installed now

Honse@lemmy.dbzer0.com · 1 month ago

It is difficult to understand in the beginning but has great support for premade workflows. It even saves the workflow into its output images so you can drag and drop them into the webui to duplicate the setup that generated the image. Use the internet to get premade workflows and mess around with them to see what the options do and you’ll slowly learn how it works. If you don’t care about precise control over the generations or understanding how image generators works then just use something else more all-in-one.

de_lancre@lemmy.world · edit-2 23 days ago

deleted by creator

D_Air1@lemmy.ml · 1 month ago

Yeah, I’m using qwen 31b a3b on an amd 9070xt requires a bit of cpu offloading, but still plenty fast. Using it wall llama.cpp. Combine that with some mcp’s such as ddg-search to make it truly useful by actually being able to search online.

I mostly use it for small tedious tasks with well defined inputs and outputs. For example when hyprland recently changed from their own configuration language to lua. At first I started going line by line translating my config to the new lua language until I realized oh wait this is exactly the type of thing that ML is useful for. Going from the well defined hyprland configuration language to their also well defined lua syntax. It banged it out in less than a minute with only a single mistake which I easily fixed. The mistake it made was that it forgot to translate the comments to lua. It did it in less than a minute and worked first try. Where as I had made several typos and gotten a few lines wrong when I was doing it by hand.

Not to say that I couldn’t do it. I would have gotten it done in about half an hour, but less than a minute is a lot faster.

I also used it to transform a bunch of unstructured data into json data, so that I could then use purpose built tools like jq to parse that. If I’m having trouble finding certain information. I’ll ask it to find me some resources to look at.

Basically small well defined tasks and parsing data is what I use it for and it seems to be pretty good at that.

What I don’t like is the way companies try to market it to people. I don’t believe people should be trying to summarize emails or messages from loved ones, writing essays or any other creative tasks for the most part. Translating is okay. I don’t expect a machine to be able to decide things for me or to be some filter between me and others.

slazer2au@lemmy.world · 1 month ago

Nope.

fluxx@mander.xyz · 1 month ago

I do, but I am becoming increasingly more disappointed as time goes on. Not just self hosted, llms in general. They sometimes help, but they mislead so many times and waste time that you don’t even notice. I think that’s the trap. When you succeed at a task, you become impressed but don’t notice how many times it failed doing a simple task. And as soon as you scratch the surface, you see how you would have done it differently and perhaps in a better way. Even just googling is bad. It does research for you, but it has no critical thinking and can’t decide what is better from the results it gets (other than google ranking) so it often leads you to think it did as good as you would, when it’s nowhere near as good. Every time I did the googling myself after it did, I did it much better. And I mean MUCH better. Ask it to find the app, it misses the most important ones, hallucinates a bunch, for ex. I found this to be the case with frontier models as well.

Self hosting has its benefits, but seeing how the ecosystem looks right now, concluding this is a huge bubble is inevitable. It reminds me of crypto so much. It looks rich and plentiful, but as soon as you dig a mm under the surface - nobody has tested it, it’s got a critical bug, it is overblown and there are issues with no response. No docs, no info, no nothing. For the biggest thing in technology in history, it is awfully hollow. I don’t mean it in a condescending way, in fact community is enthusiastic and very helpful, it’s just that it doesn’t live up to what most would expect.

A caveat I need to mention is I have not used it for coding - I have an irrational fear and resistance towards it, being a programmer. I just won’t touch it, even if it means the end of my career. I’m trying to be grown-up about it, but so far, I dont want to use it, for good and bad reasons.

Domi@lemmy.secnd.me · 1 month ago

Yes, I got a Strix Halo machine before the RAM price hike and use it to run all my ML stuff on it.

Currently using llama-swap with llama.cpp/ComfyUI and opencode/Open WebUI as frontend.

I’m running Qwen3.6-27b, Voxtral Mini 4b, Piper and Qwen Image. Also, some embedding and reranking models.

I use them for:

Tagging and classification of my documents in Paperless
Home Assistant (voice assistant)
Translations (both text and image)
Transcriptions
Some light coding and debugging
Avatar/Backdrop generation for DnD sessions

SuspiciousCarrot78@aussie.zone · 1 month ago

What sort of tok/s are you getting on the strix?

Domi@lemmy.secnd.me · 1 month ago

About 200 t/s prompt processing and 10-20 t/s with MTP.

Greatly depends on the task, predictable things like code generates at 18-20 t/s. Creative writing more like 10-17 t/s.

SuspiciousCarrot78@aussie.zone · 1 month ago

Damn - I thought strix would do a bit better than that, for how much it costs.

robber@lemmy.ml · 1 month ago

Given the 27b is a dense model, I think the numbers are quite ok. Curious about the quant tho.

The cool thing about the strix is its large unified memory, but it lacks memory bandwith for compute intensive workloads. Something like Qwen3.5-122b MoE with only like 12b active parameters might run at twice the speed if it fits the configuration.

Domi@lemmy.secnd.me · 1 month ago

Curious about the quant tho.

Q8 from unsloth.

Something like Qwen3.5-122b

My go to model for knowledge. Definitely much faster at Q5 but it lacks the tool calling quality of the Qwen3.6 models. Really hoping we see a Qwen3.6-122b soon…

robber@lemmy.ml · 1 month ago

In case you missed the Ornith 1.0 release (Qwen and Gemma RL finetunes for agentic / coding workloads), they look interesting to bridge the gap until we see larger 3.6 models or a 3.7 release. I didn’t test them yet but according to benchmarks, the 35b MoE seems to be more or less on par with Qwen3.6 27b dense, while ofc a lot faster.

SuspiciousCarrot78@aussie.zone · 1 month ago

Yeah. Though I think theres a new strix out soon (Medusa? Gorgon? Something like that).

Its a bit like my P40. On paper, it has 24GB. But that 24gb is capped at 400GB/s and the ai compute is what…Pascal era?

AI = Good, fast, cheap - pick 2

robber@lemmy.ml · 1 month ago

Well compared to the strix, 400GB/s is not that bad, I think with fast system RAM and expert offloading you could squeeze quite something out of it when running stuff in the 100b-a10b regions.

Your bigger problem is going to be future software support.

PetteriPano@lemmy.world · 1 month ago

Running qwen3.6 27b through llama.cpp.

It’s about as capable as sonnet 3.5.

I use it for light scripting, but real coding is done by cloud models.

I’m also using it as the brain for my Hermes agent. It sends me digests of news, subreddits, chats that I’d like to read but don’t have time for. It does a great job researching things on the web for me, too.

SuspiciousCarrot78@aussie.zone · 1 month ago

Do you mean Sonnet 4.5?

I don’t have the rig to run it at real speeds but I’ve played with it over API. Seems pretty good.

PetteriPano@lemmy.world · 1 month ago

No, it needs a lot more babysitting than 4.5 does. 3.5 was on the same level of mistakes, at least on the quants I have to use.

PapaSkwat@lemmy.wtf · edit-2 1 month ago

That’s a great model and it’s the one I use too.

Strider@lemmy.world · 1 month ago

No. I still have no use for it and everything I use is automated without at a far lower footprint.

atzanteol@sh.itjust.works · 1 month ago

I’ve tried a few times but with only 8gig of vram it’s simply not worth it.

Franconian_Nomad@feddit.org · 1 month ago

Have you tried qwen3.5-9b? It’s pretty solid for its size.

atzanteol@sh.itjust.works · 1 month ago

Yeah, it’s “good for its size” but it’s just too flaky for me to use for any significant coding.

Franconian_Nomad@feddit.org · 1 month ago

Yeah, I wouldn’t use it for coding. It’s a bit dumb unfortunately.

brucethemoose@lemmy.world · 1 month ago

How much CPU RAM do you have?

atzanteol@sh.itjust.works · 1 month ago

64G. But CPU inference is painfully slow.

brucethemoose@lemmy.world · edit-2 1 month ago

Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I’m running a 300B model on a single 3090, and its faster than I can read.

You just need to use the right framework, and the right model.

I’d suggest trying ik_llama.cpp and a MoE like one of these: https://huggingface.co/models?other=ik_llama.cpp&sort=modified&search=35B

And speculative decoding like DFlash or MTP (which you can also get specific models for).

EDIT: Wrong link.

atzanteol@sh.itjust.works · 1 month ago

I’ll check that out - speed isn’t my biggest issue so much as coding performance… The qwen 3.5 model I was using can write code, but it’s… Meh? Like sometimes it doesn’t even compile.

I did try tweaking llama.cpp to do some cpu offloading and it does seem to allow for much larger contexts at a modest performance loss. I’ll check out larger models.

brucethemoose@lemmy.world · edit-2 1 month ago

CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.

robber@lemmy.ml · 1 month ago

Since implementation of the --fit parameter and its relatives, and --fit on becoming the default, llama.cpp intelligently decides what to offload. For me, it made --n-cpu-moe obsolete.

brucethemoose@lemmy.world · edit-2 1 month ago

Mostly, yeah.

Sometimes it’s better to “cut it close,” with (for instance) a 27B model that’s nearly OOMing your VRAM fully offloaded, but you know will be fine in regular use without too many programs open.

In my case, with MiMo 2.5, it fills both my CPU and GPU RAM rather completely, so it’s best to set a static value so I don’t swap CPU RAM, and don’t OOM on the GPU either.

Terrasque@infosec.pub · 29 days ago

Try qwen3.6-35b-a3b with a lightweight harness like pi.dev

Having it be able to run commands and try to compile or run the code and see the output helps especially on the “doesn’t compile” part of things

atzanteol@sh.itjust.works · edit-2 28 days ago

Yeah - I’ve been playing around more with the Qwen3-Coder-30B-A3B-Instruct MoE model and it’s still quite… Meh. I’ve been using llama.cpp and I’ve tried a bunch of tuning. It works and performs well enough (15t/s) but the output is just garbage. I can do some simple coding but I’m finding I’m fighting with it more than if I just wrote the code myself. Maybe I just have standards that are too high. Claude Opus 3.7 is just in an entirely different league…

Terrasque@infosec.pub · 27 days ago

When you run it, do you use unsloth’s recommended settings for coding?

https://unsloth.ai/docs/models/qwen3.6

Also have preserve thinking on, it helps it stay consistent in multi turn work.

Which model version you’re using can also affect results, usually unsloth’s ones are good.

With all that said, it’s of course a small model so it’s not a super coder. The 27b is better (I’d guess 25-35% better), but of course still a small model so…

So it’ll maybe not be good enough still, but should give it the chance to let it do the best it can :)

Nednarb44@lemmy.world · 1 month ago

I do, I use ollama. I mostly just tinker, but I use with with home assistant for a quasi Alexa like experience with the voice assistant, I use it for summarizing some YouTube transcripts in too lazy to read/watch, and I’ve tried to see how capable it is with coding.

diminou@lemmy.zip · 1 month ago

Can you elaborate on what you are using exactly with home assistant ? And is English your primary language in that context ?

Trying to do something similar, English not primary and its a bit… Harder than it seems. Can’t figure out if it is because I’m not using English or something else. (3060 12GB BTW)

Nednarb44@lemmy.world · 1 month ago

English is my primary, so that does make it easier. I use it for general conversion things, like asking it questions about the Titanic or making up a new story or something. It doesn’t work as well as I’d like yet, but like I said, it’s just an other thing for me to mess around with and change.

Steve@startrek.website · 1 month ago

I recently gave it a try with qwen3.5 and deepseek coder v2. I have a RTX3090 and these are the largest models that can run comfortably on it.

Conclusion, they are both fucking useless. Free tier claude runs circles.

e0qdk@reddthat.com · 1 month ago

If you just pulled the default version of qwen3.5 from ollama’s repo you downloaded a mediocre one that only uses ~6GB.

Check ollama show qwen3.5 and see if you get something like this in the result:

  Model
    architecture        qwen35    
    parameters          9.7B      
    context length      262144    
    embedding length    4096      
    quantization        Q4_K_M

This is the default version I got when I first tried using ollama without any experience. It worked, but it’s a heavily quantized, lower parameter version of the model – i.e. it’s pretty dumb – compared to what you can actually run on your hardware.

Steve@startrek.website · 1 month ago

I will check it later. I loaded whichever one cluade suggested lol

SuspiciousCarrot78@aussie.zone · 1 month ago

Yeah :(

Were not there yet on consumer rigs.

brucethemoose@lemmy.world · 1 month ago

Did you serve them with ollama?

It’s basically broken, if you did. Try the same models over API, and you’ll see what I mean.

Steve@startrek.website · 1 month ago

Is there an alternative to ollama? The point was to run something locally.

brucethemoose@lemmy.world · edit-2 1 month ago

https://sleepingrobots.com/dreams/stop-using-ollama/

And that’s not even all of it. Basically they break models in many ways, and they’re slimey Tech Bros.

LM Studio is better, and easy.

If you’re on Nvidia, and want to run optimally, I would use the ik_llama.cpp fork. On AMD, regular llama.cpp. On a Mac, use an MLX runner (Like LM Studio) with an MLX quant (ideally an MLX-DWQ quant).

It’s all pretty technical, and… thats kinda the point. LLMs are just too performance sensitive and too finicky to not have a grasp of how they work. There is no “easy button” to run them without bad results, there can’t be.

But if you don’t have time for that and just want to see if it’s worth it, I’d suggest self hosing your own UI, and trying the dirt cheap APIs of models you can theoretically run on your setup. This will give you a “best case” taste of what they’re capable of.

brucethemoose@lemmy.world · edit-2 1 month ago

Oh, and I just saw you have a 3090.

To get more specific, you can actually run way better models than Qwen 3.5 and Deepseek coder (both of which are very obsolete now). The best that’s practical depends on how much CPU RAM you have, but at the minimum you can do Qwen 3.6 27B, with a more optimal quant like ones here: https://huggingface.co/ubergarm/Qwen3.6-27B-GGUF/tree/main

Or Gemma 31B QAT: https://huggingface.co/unsloth/gemma-4-31B-it-qat-GGUF

If you have 128GB CPU RAM, I can upload my custom MiMo 2.5 quant. That should “beat” the cheapest Claude, give or take.

If you have 64GB, I’d suggest a quantization of Step 3.7.

If you have 32GB or 48, I’m not sure. I’d need to look if any “small” MoE is actually better than Qwen 27B now.

Schiffsmädchenjunge@sh.itjust.works · 1 month ago

I’ve thought about it, but I actually could never think of anything I would do with it.

Meatwagon@lemmy.dbzer0.com · 1 month ago

I tried but I only have 16g of ram and it wouldn’t complete a thought alas

Alexander@sopuli.xyz · 1 month ago

Technically, TTS/STT are mostly MLs; I’m pretty sure many people run these. I have a setup but I’m better with buttons that with spoken words, and I listen to ambient sounds or music. I think some day I’ll make voice assistant for talking to while driving, but that’s not a trivial task hardware-wise, even if I used cloud LLM layer, which I won’t. Putting AI on baremetal sounds like an interesting project.

I have a homemade “local agent” that can actually “code” somewhat, I use it just to figure out how this thing works on the inside practically. Mostly useless otherwise (also I have GPU that’s older than AI, so it’s kind of fun technical task to run this stuff on pure RAM+swap). Feels like the whole hype is greatly overrated, but I appreciate a chance to learn something new anyway.