A 3060?
Exllama/TabbyAPI is still worth looking at if you are trying to run a model purely in GPU RAM. It’s easily the most VRAM efficient backend, it just doesn’t support CPU offloading (which is useful for MoEs if you have considerable spare CPU RAM) and more optimized for 4xxx and up Nvidia cards.
And TabbyAPI has a docker container you can use. Look for “exl3” models on huggingface.





It…
Well, it’s not just about “too much ruthless efficiency.”
Zuckerburg fired a lot of brainpower in their ML divisions. He torched their landmark in the “AI” landscape, THE thing that literally set the standard for open weights LLMs (Llama), and hired a bunch of Tech Bro narcissists in their place, because he’s doesn’t have the first clue about how LLMs work, nor how important Llama its wake were.
They were on track to be at the center of open weights LLMs, and Zuck, and ony Zuck, blew it up because he is an ignorant coward. He ran, like he does from every bandwagon he jumps on at the first bump.
Meta still develops PyTorch openly, but we’ll see how long that lasts.
Aside: I’m sick of everyone pretending like these ultra-wealthy industry runners are somehow geniuses. Like Bezos swallowing the “space datacenter” scam whole, Google’s leadership consciously sabotaging their core product, or gestures at Elon Musk’s Twitter page.
They’re making terrible choices. They clearly do not have the techical background to make them. Their decisions aren’t “hard” like the headlines make it out to be, they’re objectively, obviously irrational purely for their own self-interest.