Semi-private LLM Hosting?

rockinmusicgv · March 2024

Has anyone seen or heard of a "semi-private LLM" hosting provider? I'm hoping for a middle ground between renting a GPU-enabled server for $200/mo for a fully private LLM and sending all my data to OpenAI (or similar). I'm assuming one of the myriad of open source LLMs could be run in an environment like this - assuming they have a commercial license rider. Is anyone doing something like this?

I know Salad exists, and that seems like a good option for content generation since it has hourly billing, but in order to respond to an email or ticket, it would need to be on 24/7.

Calin · March 2024

Hello @rockinmusicgv possible please tell my if you like dedicated server with GPU Nvidia Tesla m10

yucchun · March 2024

@rockinmusicgv said:
Has anyone seen or heard of a "semi-private LLM" hosting provider? I'm hoping for a middle ground between renting a GPU-enabled server for $200/mo for a fully private LLM and sending all my data to OpenAI (or similar). I'm assuming one of the myriad of open source LLMs could be run in an environment like this - assuming they have a commercial license rider. Is anyone doing something like this?

I know Salad exists, and that seems like a good option for content generation since it has hourly billing, but in order to respond to an email or ticket, it would need to be on 24/7.

How often do you think you'll use the service? And how good do you need the GPU to be? I have some providers in mind for hourly billing.
Salad's really cheap indeed.

crunchbits · March 2024

@rockinmusicgv said:
Has anyone seen or heard of a "semi-private LLM" hosting provider? I'm hoping for a middle ground between renting a GPU-enabled server for $200/mo for a fully private LLM and sending all my data to OpenAI (or similar). I'm assuming one of the myriad of open source LLMs could be run in an environment like this - assuming they have a commercial license rider. Is anyone doing something like this?

I know Salad exists, and that seems like a good option for content generation since it has hourly billing, but in order to respond to an email or ticket, it would need to be on 24/7.

This might be sort of what Groq (not Grok) does. I'm not sure about the 'sending data' part, though.

Otherwise I expect we're going to see providers start to acquire good deals on bulk GPU rentals and begin to offer "shared" services like this that are always on.

rockinmusicgv · March 2024

@yucchun said:
How often do you think you'll use the service? And how good do you need the GPU to be? I have some providers in mind for hourly billing.
Salad's really cheap indeed.

I think Salad's possible use case (as I see it) is for training and for content and image generation. They have reasonable prices, but it's hourly billing. After two years, it's more reasonable to get a $7000 Mac than to use cloud GPU hosting --- Not something I'm actually against, I just don't have $7000 to get a Mac with 200GB of ram.

This might be sort of what Groq (not Grok) does. I'm not sure about the 'sending data' part, though.

There are a lot of API providers like Groq. OpenAI, Mistral, and Anthropic all have their own API access. I want to avoid using the API services since it requires sending the prompts, replies, and so on to their servers.

Neoon · March 2024

You could setup a pool where like 5 people pay for a gpu server and use them together.
Should bring the cost down to like 20-30$ per month.

get from @crunchbits and you set

rockinmusicgv · March 2024

I checked their site and it says they're out of stock on the 3090. Renting from a provider here (which means they have a link to a community) would be a plus. Hopefully they'll be in stock soon.

Neoon · March 2024

@rockinmusicgv said:
I checked their site and it says they're out of stock on the 3090. Renting from a provider here (which means they have a link to a community) would be a plus. Hopefully they'll be in stock soon.

Yea, the 3090 is the sweet spot right now, A.I stuff is pretty fast on it, despite its already over 3 years old.
Mostly out of stock though however cheap compared to other deals, might be worth getting a GPU by yourself and just pay for power and share it.

crunchbits · March 2024

@rockinmusicgv said:
I checked their site and it says they're out of stock on the 3090. Renting from a provider here (which means they have a link to a community) would be a plus. Hopefully they'll be in stock soon.

We've got them, a ton inbound/racking daily at the moment specific to 3090's. Open a ticket/DM/e-mail/Discord and I'll get you taken care of--out of stock because the plan is improving (based on customer feedback) for the same price: More cores and double the ram (don't ban me @bikegremlin ).

@Neoon said:
Yea, the 3090 is the sweet spot right now, A.I stuff is pretty fast on it, despite its already over 3 years old.
Mostly out of stock though however cheap compared to other deals, might be worth getting a GPU by yourself and just pay for power and share it.

I agree. Lot's of requests for 4090's but they're a lot more difficult to properly cool @ full TDP (in bulk) as well as pricing getting a bit out of hand on them. Anyone doing AV1 encoding will be better suited by our other Ada lineup.

AuroraZero · March 2024

@crunchbits said: More cores and double the ram (don't ban me @bikegremlin ).

I am going to do it unless I get naked picks of your dog in my inbox within 24 hrs!!!

havoc · March 2024

No, there isn't really an inbetween offering like that. All APIs require sending data, all 24/7 available hosted solution requires paying for the full capacity (i.e. the 200) or selfhost. Closest hybrid I can think of is Cloudflare GPU workers but that is still basically an API.

I'd recommend finding an API provider you're comfortable with. Mistral is in France and should have pretty tight EU privacy rights.

Selfhosting LLMs is super interesting but frankly not particularly competitive on quality or cost.

@Neoon said: Yea, the 3090 is the sweet spot right now

Do check if your usage case allows AMD though. Some LLMs are now compatible with AMD and the 7900XTX is a much newer/stronger card with also 24GB. Obviously depends on how CUDA dependent you are though.

crunchbits · March 2024

@havoc said:
Do check if your usage case allows AMD though. Some LLMs are now compatible with AMD and the 7900XTX is a much newer/stronger card with also 24GB. Obviously depends on how CUDA dependent you are though.

We (as in, everyone interested in using this) need this to happen. Opening up AMD's newer lineups and/or their Mi series stuff to actually being productive (without an extremely tight niche/customized software stack) would be huge. It keeps inching closer...

havoc · March 2024

@crunchbits said:

@havoc said:
Do check if your usage case allows AMD though. Some LLMs are now compatible with AMD and the 7900XTX is a much newer/stronger card with also 24GB. Obviously depends on how CUDA dependent you are though.

We (as in, everyone interested in using this) need this to happen. Opening up AMD's newer lineups and/or their Mi series stuff to actually being productive (without an extremely tight niche/customized software stack) would be huge. It keeps inching closer...

Well I bought 10k worth of AMD shares this morning so they had better make this stick...

without an extremely tight niche/customized software stack

Inference is already pretty point & shoot on the 79** I think. Anything outside of that in either usage case (e.g. training) or card less so.

crunchbits · March 2024

@havoc said:
Well I bought 10k worth of AMD shares this morning so they had better make this stick...

Low End VPS, High End Portfolio.

Inference is already pretty point & shoot on the 79** I think. Anything outside of that in either usage case (e.g. training) or card less so.

Yeah inference is a beast, and they were making good gains in 1:1 CUDA stuff (even if a little behind, the cost was in-line). Next steps are either getting it all the way or pushing a CUDA alternative that is just as easy, then somehow making it a viable customer-facing alternative and helping with the messaging.

rockinmusicgv · March 2024

@havoc said:
No, there isn't really an inbetween offering like that. All APIs require sending data, all 24/7 available hosted solution requires paying for the full capacity (i.e. the 200) or selfhost. Closest hybrid I can think of is Cloudflare GPU workers but that is still basically an API.

I'd recommend finding an API provider you're comfortable with. Mistral is in France and should have pretty tight EU privacy rights.

Selfhosting LLMs is super interesting but frankly not particularly competitive on quality or cost.

A cloud worker would be more in line with what I'm looking for, assuming the data can be encoded in transit. One thing I don't like about the ai providers is they have a vested interest in monitoring the output of their services. They don't want people using their services for fraud, and there have already been voice ai spearfishing attacks so they do have a good reason to keep tabs on people. The downside is that means customer data (even if it's just a name) gets transmitted back and forth and viewer by the provider.

bikegremlin · March 2024

@crunchbits said:

We've got them, a ton inbound/racking daily at the moment specific to 3090's. Open a ticket/DM/e-mail/Discord and I'll get you taken care of--out of stock because the plan is improving (based on customer feedback) for the same price: More cores and double the ram (don't ban me @bikegremlin ).

The LES mod team and me when the word "double" is posted on the forum:

Semi-private LLM Hosting?

Comments