Could LowEnd LLM hosting be a thing?
If you are into ai and LLMs you certainly have heard of the recent launch of mixtral, an openly available llm that is a serious competitor to gpt4.
Everyone could download and self-host it, if only it wouldn't need much more GPU ram than any affordable consumer card offers.
Hosting it on a rented gpu would also make no sense if you only need it a couple of times per day. It would be like renting a server if you only need a lamp stack for a small blog.
And using the upcoming usage based providers (price per x consumed tokens) will get expensive if you use it a lot.
So could there be something similar to shared hosting for OS LLMs? Eg a server with a high ram GPU + running a particular model, shared by 50+ users paying 7€ monthly flat.
Comments
In my opinion: yes.
I think that is already what a lot of the subscription based services are actively doing, but with the tokens (which just obfuscate the real high cost as you said).
The hard part is sharing the GPU (stable and repeatedly), billing appropriately, and explaining what someone is actually buying. It's easier on very expensive enterprise units where you can easily sell fractional access.
NVMe VPS | Ryzen 7950X VDS | Dedicated Servers -- Crunchbits.com
@lentro …thoughts?
blog | exploring visually |
Why don't you use it as a service then if you only need to access it infrequently? I'd favor developing more customizations for as a service offers than everybody hosting its own instance. It'll be difficult to balance the gpu usage.
I am using google colab you can try it
https://colab.research.google.com/gist/chigkim/5521120118fd7533a224b36a3167972f/mixtral.ipynb
Create Mobile APP For iPhone And Android Without Coding https://U3.Net
Short answer - no.
The major companies are basically setting mountains of cash on fire in an attempt to gain marketshare. Copilot is rumoured to lose around 20-80 bucks per month per user (estimates vary).
For a single user one can get away with a small model on a powerful CPU server though. Especially with the new mixtral MoE at Q4
You wouldn't need to do anything funky with the GPU - llama.cpp can do concurrent requests
Most LLM's are queue based, you can share a GPU with x people but that's it.
maybe 20-30$ per Person, but fair share.
Free NAT KVM | Free NAT LXC | Bobr
ITS WEDNESDAY MY DUDES
I thought he was asking about sharing (or slicing) the GPU 'permanently' the same way hosts do with vCPUs and not using a queue/token-based system, i.e. vGPU with NVIDIA. Re-read and yeah if it's running a certain specific model and people are sharing it that is basically the same thing that exists already with subscriptions/tokens.
NVMe VPS | Ryzen 7950X VDS | Dedicated Servers -- Crunchbits.com
What would be the most efficient way to do it? In my understanding it would only make sense if all the users use the same model because otherwise every user would always have to wait for his model to load.
Similar to all the users on a shared web host using the same lamp stack with the advantage of paying only a fraction of the price of running an individual stack.
I was wondering if that would make sense technically and cost wise, thats why I asked. In my basic understanding the expensive factor is providing the necessary GPU RAM for a big model like mixtral - the question is of course how many users could it serve in parallel.