Could LowEnd LLM hosting be a thing?

If you are into ai and LLMs you certainly have heard of the recent launch of mixtral, an openly available llm that is a serious competitor to gpt4.

Everyone could download and self-host it, if only it wouldn't need much more GPU ram than any affordable consumer card offers.

Hosting it on a rented gpu would also make no sense if you only need it a couple of times per day. It would be like renting a server if you only need a lamp stack for a small blog.

And using the upcoming usage based providers (price per x consumed tokens) will get expensive if you use it a lot.

So could there be something similar to shared hosting for OS LLMs? Eg a server with a high ram GPU + running a particular model, shared by 50+ users paying 7€ monthly flat.

Comments

  • crunchbitscrunchbits Hosting Provider

    @someTom said:
    If you are into ai and LLMs you certainly have heard of the recent launch of mixtral, an openly available llm that is a serious competitor to gpt4.

    Everyone could download and self-host it, if only it wouldn't need much more GPU ram than any affordable consumer card offers.

    Hosting it on a rented gpu would also make no sense if you only need it a couple of times per day. It would be like renting a server if you only need a lamp stack for a small blog.

    And using the upcoming usage based providers (price per x consumed tokens) will get expensive if you use it a lot.

    So could there be something similar to shared hosting for OS LLMs? Eg a server with a high ram GPU + running a particular model, shared by 50+ users paying 7€ monthly flat.

    In my opinion: yes.
    I think that is already what a lot of the subscription based services are actively doing, but with the tokens (which just obfuscate the real high cost as you said).

    The hard part is sharing the GPU (stable and repeatedly), billing appropriately, and explaining what someone is actually buying. It's easier on very expensive enterprise units where you can easily sell fractional access.

    Thanked by (1)someTom
  • vyasvyas OGSenpai

    @lentro …thoughts?

    Thanked by (1)Not_Oles
  • Why don't you use it as a service then if you only need to access it infrequently? I'd favor developing more customizations for as a service offers than everybody hosting its own instance. It'll be difficult to balance the gpu usage.

  • havochavoc OGContent Writer

    Short answer - no.

    The major companies are basically setting mountains of cash on fire in an attempt to gain marketshare. Copilot is rumoured to lose around 20-80 bucks per month per user (estimates vary).

    For a single user one can get away with a small model on a powerful CPU server though. Especially with the new mixtral MoE at Q4

    @crunchbits said:

    The hard part is sharing the GPU (stable and repeatedly)

    You wouldn't need to do anything funky with the GPU - llama.cpp can do concurrent requests

  • NeoonNeoon OGSenpai

    Most LLM's are queue based, you can share a GPU with x people but that's it.
    maybe 20-30$ per Person, but fair share.

    Thanked by (1)someTom
  • crunchbitscrunchbits Hosting Provider

    @havoc said:
    You wouldn't need to do anything funky with the GPU - llama.cpp can do concurrent requests

    I thought he was asking about sharing (or slicing) the GPU 'permanently' the same way hosts do with vCPUs and not using a queue/token-based system, i.e. vGPU with NVIDIA. Re-read and yeah if it's running a certain specific model and people are sharing it that is basically the same thing that exists already with subscriptions/tokens.

  • @crunchbits said:
    I thought he was asking about sharing (or slicing) the GPU 'permanently' the same way hosts do with vCPUs and not using a queue/token-based system, i.e. vGPU with NVIDIA. Re-read and yeah if it's running a certain specific model and people are sharing it that is basically the same thing that exists already with subscriptions/tokens.

    What would be the most efficient way to do it? In my understanding it would only make sense if all the users use the same model because otherwise every user would always have to wait for his model to load.

    Similar to all the users on a shared web host using the same lamp stack with the advantage of paying only a fraction of the price of running an individual stack.

    I was wondering if that would make sense technically and cost wise, thats why I asked. In my basic understanding the expensive factor is providing the necessary GPU RAM for a big model like mixtral - the question is of course how many users could it serve in parallel.

Sign In or Register to comment.