Install Llama on a GPU server
Busy testing the GPU servers per @crunchbits thread, jotted down some notes on how to get a fresh ubuntu server to talking llama model. Note this is on a 16gb GPU - if you're on a smaller one you'll need to change the q8_0 part to q4_0 or even q3_k
Also note that here I'm downloading a fp16 model and converting it to q8 GGUF. In practice you can skip over those steps and just download ready made quantized GGUF models from TheBloke's huggingface repo.. i.e. You'd modify the download model step to point to a quantized GGUF model and skip the generate and quantize step after that.
This assumes Ubuntu 22.04 - you may need to do stuff like install python3 if you're on a different distro
Check that we have a GPU
apt update && apt upgrade
apt install hwinfo -y
hwinfo --gfxcard --short
Set up nvidia driver and SDK
apt install nvidia-driver-535-server nvidia-dkms-535-server nvidia-cuda-toolkit -y
reboot
nvidia-smi
nvcc --version
Grab llama.cpp and build it
git clone https://github.com/ggerganov/llama.cpp
apt install cmake -y
cd llama.cpp
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=ON
cmake --build . --config Release
cd ..
Download a model
mkdir -p /root/llama.cpp/models/llama2-fp16
python3 -m pip install huggingface_hub
python3
from huggingface_hub import snapshot_download
snapshot_download(repo_id="TheBloke/Llama-2-13B-Chat-fp16", revision="main",local_dir="/root/llama.cpp/models/llama2-fp16/")
quit()
Generate GGUF file
python3 -m pip install gguf sentencepiece
python3 convert.py ./models/llama2-fp16/
Quantize it
cd ./build/bin
./quantize ../../models/llama2-fp16/ggml-model-f16.gguf ../../models/llama2-q8.gguf q8_0
Run it
./main -m ../../models/llama2-q8.gguf -ngl 99 --color -p "Tell me a story about a unicorn!"
Tell me a story about a unicorn!
Once upon a time, in a far-off land of rolling hills and sparkling streams, there lived a beautiful unicorn named Luna. She had a shimmering coat of silver and white, and her horn was as bright as the stars in the night sky.
Luna lived a peaceful life, roaming the forests and meadows, and making friends with all the creatures she met. She loved to play with the butterflies and dance with the flowers, and she could make the most beautiful music with her horn.
One day, a wicked witch cast a spell on the land, causing all the plants and animals to become sick and tired. The unicorns were especially affected, and their beautiful coats became dull and lifeless.
Luna knew that she had to do something to save her friends and the land they lived in. She set out on a journey to find the witch and break her spell.
As she traveled through the forest, Luna met many creatures who were suffering from the witch's spell. She used her horn to heal them and bring them back to life. She also met a brave knight who had been searching for the witch for many years. Together, they journeyed on, determined to defeat the wicked witch and bring peace back to the land.
Finally, after many days of traveling, they came to the witch's castle. It was a dark and gloomy place, surrounded by a moat of swirling black water. But Luna was not afraid. She knew that her horn could break any spell, no matter how powerful.
She and the knight entered the castle, ready to face whatever dangers lay inside. As they made their way deeper into the castle, they came across the witch herself. She was a terrifying sight, with warts and a crooked nose, and a cackle that sent chills down your spine.
But Luna was not afraid. She raised her horn and pointed it at the witch, ready to break the spell. The witch laughed and tried to stop her, but Luna's horn was too powerful. With one blast of magic, the spell was broken, and the land was once again filled with light and life.
The creatures who had been turned to stone were returned to their true forms, and they cheered and celebrated as Luna and the knight emerged from the castle. The witch was banished from the land forever, and peace was restored.
And Luna, the little unicorn with the powerful horn, lived happily ever after, knowing that she had saved her homeland from the evil witch's spell. The end.
Comments
looking good, finally someone not just using stable diffusion.
are you considering to benchmark few more models? i.e. comparing them at creating stories, code generation, code completion (I'm more interested at them being able to spot bugs however).
other than that, maybe trying some 'uncensored' models, If it's being naughty you can just post the conclusion in this thread.
Fuck this 24/7 internet spew of trivia and celebrity bullshit.
Yep - the LLMs seem much more interesting to me in the long run & def the part I want to learn more about in the host it yourself context. This stuff is absolutely gonna change the world.
It's incredibly hard to benchmark them in any way that is meaningful tbh, so not planning to. I do try out a lot of different ones though because they do have very different vibes.
Code generation yes - the code llama ones are pretty good at generating pieces. Code completion, no. Haven't figured out how to hook copilot extension into a local model for completion. Tried but failed thus far.
The local stuff is solid at explaining code already though.. That code was generated by the same model too.
And also responds well to follow up questions.
For that I still end up using GPT4 when I'm really stuck - especially for not coding but linux things. Like if OpenCL is broken or whatever. Trying to make an effort to ask local models first though so that I can learn their failure points better & develop an intuition for it.
Not super interested in that angle tbh. Uncensored - reckon the importance of it is exaggerated. Tried asking one to generate a spicy story to see what the fuss is about and it got the request and complied but...just so damn bland. I could see them being good at stories though...dungeons and dragons style
Thanks a lot for this tutorial! I had tried to play around with llama.cpp in the past but after getting the main program compiled, I could never figure out how to get models. (And based on your instructions, I never would have figured it out anyway.)
Worked like a charm, and now I'm off generating my own unicorn stories.
Wohoo!
If you're looking for something more user friendly, this works well:
https://github.com/oobabooga/text-generation-webui
You can technically wget the files off huggingface too but this way is cleaner as long as the repo is split into branches. Some are not so downloading "main" ends up unnecessarily downloading all the quantization combinations. So always worth glancing at the repo in a browser first
I simply use the Text Generation WebUI (https://github.com/oobabooga/text-generation-webui) for model downloading and do interacting with them :-)
The @crunchbits server works fine, I think it is best to use it on-demand atm so waiting for their hour-based pricing .
Yep - for GUI use it's probably the best. Started with that & still use it for chatbot use.
llama.cpp becomes more interesting if you want to use it for integration into coding projects. (Also potentially langchain)
16gb GPU and models that are hundreds of gigabytes in size. I hope that one day we can see more manageable and feasible requirements.