Free hosted Llama 3.1 8B - TensorDock

lentro · August 2024

Hello LES!

If you've been looking to play around with AI endpoints recently but haven't had the chance, I've spun up a GPU cluster running Llama 3.1 8B for anyone who's interested.

Tech deets: Rust load balancer distributing load across 5 NVIDIA GPU servers running Llama inference servers. Max context length 2048 tokens.

The API is 100% OpenAI completions API compatible (let me know if you want streaming support) and free to use for now!

If there's enough interest, I can also set up a Llama 70B or hosted Mixtral cluster.

Example API integration in Python:

"""
An example OpenAI API client that connects to TensorDock's YAP
"""
from openai import OpenAI

client = OpenAI(
    api_key = "dummy",
    base_url="https://yap.tensordock.com"
)
completion = client.chat.completions.create(
    model="Meta-Llama-3.1-8B-Instruct",
    messages=[
        {
            "role" : "system",
            "content" : "You are a pirate who speaks in pirate-speak."
        },
        {
            "role" : "user",
            "content" : "Explain LLMs to me in a single sentence."
        }
    ],
    max_tokens=256,
    temperature=0.7,
    top_p = 0.7,
    frequency_penalty=1.0,
    seed = 17
)

output = completion.choices[0].message.content
print(output)

More details: https://blog.tensordock.com/blog/YAP
Rent affordable GPU servers: https://dashboard.tensordock.com/deploy

Not_Oles · August 2024

Hi @lentro!

Thanks for making TensorDock's YAP GPU cluster available here at LES!

Maybe I could ask five questions, please.

What's the practical difference between Llama 3.1 8B and the model I have been using, which is

Name Version Max In Max Out Description
models/gemini-1.5-pro-latest 001 2097152 8192 Mid-size multimodal model that supports up to 2 million tokens

Are there only three important token counts: max in, max out, and maximum context?
Does the "Max context length 2048 tokens" number you mentioned above refer to the same use of "token" as Google Gemini's "2 million tokens?"
I have been using https://github.com/eliben/gemini-cli, which is a command line interface to Google Gemini. Is there a command line AI interface which you could recommend that supports standard input, standard output, standard error, and pipes?
Can the command line interface which you recommend connect to both YAP and Google Gemini?

Thanks again for making YAP available here at LES! Thanks in advance for any help with my questions!

Best wishes!

BruhGamer12 · August 2024

@Not_Oles said:
Hi @lentro!

Thanks for making TensorDock's YAP GPU cluster available here at LES!

Maybe I could ask five questions, please.

What's the practical difference between Llama 3.1 8B and the model I have been using, which is

Name Version Max In Max Out Description
models/gemini-1.5-pro-latest 001 2097152 8192 Mid-size multimodal model that supports up to 2 million tokens

Are there only three important token counts: max in, max out, and maximum context?

Does the "Max context length 2048 tokens" number you mentioned above refer to the same use of "token" as Google Gemini's "2 million tokens?"

I have been using https://github.com/eliben/gemini-cli, which is a command line interface to Google Gemini. Is there a command line AI interface which you could recommend that supports standard input, standard output, standard error, and pipes?

Can the command line interface which you recommend connect to both YAP and Google Gemini?

Thanks again for making YAP available here at LES! Thanks in advance for any help with my questions!

Best wishes!

I can help answer some if it lifts some of the burden off of @lentro

1) llama 3.1 8b is not multimodal(ie it cant understand images and generate a text based output on those images within the same model) It can only understand text input and do text output. Llama 3.1 8B is also likely much much smaller than google-gemini pro so therefore is likely not as smart in many tasks.

3) Essentially yes

havoc · August 2024

Nice one! Tried it & works as expected.

If you're looking to get additional load for publicity have a look at openrouter...some of the models on there have $0 pricing. Doesn't look like anybody else has free 3.1 on there right now.. You'd need to up the context length though - limiting a 128k model to 2k will raise eyebrows.

Think I've still got some credits w/ tensor...if this does ever switch to paid, could i use it against that?

vyas · August 2024

Nice one, @lentro

Otus9051 · August 2024

@havoc said: Doesn't look like anybody else has free 3.1 on there right now..

https://openrouter.ai/models/meta-llama/llama-3.1-8b-instruct:free maybe?

codelock · August 2024

Hi @lentro long time since I saw you here,how's everything?

lentro · August 2024

@codelock -- hello! Glad to be back -- things have been really crazy over the past two years, but TensorDock work is finally getting less hectic with all the automation & new team members we have... life is finally settling down & I'll finally have more time to collect Linux ISOs soon

lentro · August 2024

@havoc said: Think I've still got some credits w/ tensor...if this does ever switch to paid, could i use it against that?

Yes of course! 70B might be paid, I don't think we'll charge for 8B though. Running it is so cheap if you own the GPUs... Please send me a DM if you do have credits though, we have been shifting database changes around & have not migrated super old accounts if you haven't been an active customer recently.

@havoc said: limiting a 128k model to 2k will raise eyebrows

We're running this at full FP16 precision on 24GB VRAM GPUs. 2k context gives us the sufficient VRAM to batch & serve with low unit costs. Full 128k context requires larger VRAM GPUs that are more expensive to own, so maybe we'd charge for users that need 128k context? But anyways, that's the reasoning behind this decision

lentro · August 2024

@BruhGamer12 // @Not_Oles

I think @BruhGamer12 hit the hammer on the head up a bit. Realistically, Llama 8B is a small model and runs on (relatively) low-end hardware, hence why I can host it for free. The 70B model and 405B Llama models are free to download but need much beefier hardware to run [a 2x H100 at full precision for 70B and 8x H100 at half precision for 405B]. Those 2 models are probably much more comparable to Google Gemini in terms of logical reasoning.

The main selling point of Llama is the price. Because Meta provides weights for free, model hosting providers compete against each other on pricing. Even if we needed to charge for 8B, we could do it for just $0.07 per million tokens to break even [assuming enough customers use it], or 15x cheaper than Google Gemini

Let me get back to you re: command line interface -- I have seen aichat support OpenAI-compatible APIs but I need to get it working on my local system first!

BruhGamer12 · August 2024

@lentro said:

@havoc said: Think I've still got some credits w/ tensor...if this does ever switch to paid, could i use it against that?

Yes of course! 70B might be paid, I don't think we'll charge for 8B though. Running it is so cheap if you own the GPUs... Please send me a DM if you do have credits though, we have been shifting database changes around & have not migrated super old accounts if you haven't been an active customer recently.

@havoc said: limiting a 128k model to 2k will raise eyebrows

We're running this at full FP16 precision on 24GB VRAM GPUs. 2k context gives us the sufficient VRAM to batch & serve with low unit costs. Full 128k context requires larger VRAM GPUs that are more expensive to own, so maybe we'd charge for users that need 128k context? But anyways, that's the reasoning behind this decision

Why full 16 bit? 8 bit gguf takes way less and is way faster or if you are using only GPUs do exl2 6 bit or something for much faster speed - you dont have loss of quality till somewhere inbetween 5 and 6 bit.

lentro · August 2024

@BruhGamer12 said: loss of quality

Hmm, I will have to look into this but I am quite sure even at 8 bit there is measurable quality loss. Of course the question becomes does a quantized 70B model perform better than a, say, 34B model -- and I am sure yes, so it's a balance between quality & performance/cost... I'll look into running these at FP8.

BruhGamer12 · August 2024

@lentro said:

@BruhGamer12 said: loss of quality

Hmm, I will have to look into this but I am quite sure even at 8 bit there is measurable quality loss. Of course the question becomes does a quantized 70B model perform better than a, say, 34B model -- and I am sure yes, so it's a balance between quality & performance/cost... I'll look into running these at FP8.

You are def right so thanks for that correction but maybe the loss is less than you think(it is actually more than I would have thought which surprised me!). This is with 70B llama 3 using the new MMLU-Pro benchmark but here are the scores in the math section for example. So you may have some headroom to reduce down and still maintain quality. This is using GGUF quants btw. Math is where it would struggle the most I think but others like econ are the same between 8 bit and 16bit.

FP16-Q2_KXXS..Correct: 336/1351, Score: 24.87%
FP16-Q2_K.....Correct: 436/1351, Score: 32.27%
FP16-Q4_K_M...Correct: 529/1351, Score: 39.16%
FP16-Q5_K_M...Correct: 543/1351, Score: 40.19%
FP16-Q6_K.....Correct: 547/1351, Score: 40.49%
FP16-Q8_0.....Correct: 532/1351, Score: 39.38%

For context full fp16 score is 54% for math - from huggingface https://huggingface.co/spaces/TIGER-Lab/MMLU-Pro
https://www.reddit.com/r/LocalLLaMA/comments/1ds6da5/mmlupro_all_category_test_results_for_llama_3_70b/

Thanks again for the free project tho. Do not mean to step on your toes at all! Whatever you do is awesome for the community!

vyas · August 2024

@ehab,

Here’s a chance for you to virtually try on as many pants as you like… or not. For e.g. you can try asking "Name 42 ways to use pants without wearing them even once"

Click to expand

Q. Write a short essay on "42 creative ways to use pants without wearing them

Title: 42 Creative Ways to Utilize Pants Without Wearing Them

Introduction

Pants, a staple in many wardrobes, are versatile pieces of clothing that can serve purposes beyond wearing them on our bodies. This essay explores 42 creative and practical ways to utilize pants without wearing them even once, showcasing their hidden potential and value.

Home Decor

Hang pants as wall art for a unique touch to your interior design.
Fold pants and place them in a decorative basket or tray for a stylish display.

Cleaning Cloths

Use pants as rags to clean surfaces, wipe spills, or dust furniture.
Cut pants into smaller pieces for use as reusable cleaning wipes.

Pet Items

Turn pants into a cozy pet bed or blanket.
Use pants as a chew toy for teething puppies or kittens.

Gardening Tools

Cut pants into squares for use as gardening kneepads.
Utilize pants as protective coverings for plants during frost or cold weather.

Craft Supplies

Cut pants into strips for braiding or weaving projects.
Use pants as a canvas for painting or tie-dyeing.

Travel Accessories

Pack pants as a cushion or pillow during long trips.
Wrap pants around breakable items to protect them during travel.

Storage Solutions

Store pants in vacuum-sealed bags to save space.
Use pants as dividers in drawers or closets to keep items organized.

Gift Wrapping

Wrap gifts with pants as a creative alternative to paper.
Cut pants into strips and use them as ribbons or bows.

Emergency Preparedness

Keep pants in your car or emergency kit as a makeshift blanket or shelter.
Use pants to filter water in survival situations.

Fashion Statements

Wear pants as a scarf, headband, or belt.
Create a unique handbag or tote using pants as the primary material.

Playtime Fun

Use pants as a parachute or blanket during outdoor activities.
Cut pants into shapes for use as stencils or templates during arts and crafts.

Personal Care

Apply pants as a cold or hot compress for sore muscles.
Use pants as a barrier between your skin and a hot or cold pack.

Home Improvement

Cut pants into strips for use as sandpaper or emery cloth.
Use pants to clean paint brushes or rollers during home improvement projects.

Conclusion

This extensive list of 42 creative ways to utilize pants without wearing them showcases their hidden potential and value. From home decor to personal care, gardening to travel, pants can serve a multitude of purposes beyond their traditional use as clothing. By repurposing pants, we can extend their lifespan, reduce waste, and embrace a more sustainable lifestyle.

ehab · August 2024

@vyas charming as always. Made me smile.

adly · August 2024

Another option with various models, including this and Llama 3.1 70B is Perplexity Labs, though only free via the web interface.

havoc · August 2024

@adly said:
Another option with various models, including this and Llama 3.1 70B is Perplexity Labs, though only free via the web interface.

Their paid plan includes $5 worth of API credits as well...which is A LOT. Never come close to exhausting that

BruhGamer12 · August 2024

Another is openrouter - got free $5 with first ACH bank deposit on there.

Chievo · August 2024

A noob question which is your usage ai for learning data or?

lentro · August 2024

@Chievo said:
A noob question which is your usage ai for learning data or?

imho, I'm not too sure what people will use this API for

but personally, I just like having an AI that is more privacy-focused [we don't collect any data of course -- I trust myself more than anyone else]

Free hosted Llama 3.1 8B - TensorDock

Comments