Agreed. The way they just dumped support for my card in some update with some vague reason also irked me (we need a newer rocm they said but my card works fine with all current rocm versions)
Also the way they’re now trying to sell cloud AI means their original local service is in competition to the product they sell.
I’m looking to use something new but I don’t know what yet.
If you are running big MoE models that need some CPU offloading, check out ik_llama.cpp. It’s specifically optimized for MoE hybrid inference, but the caveat is that its vulkan backend isn’t well tested. They will fix issues if you find any, though: https://github.com/ikawrakow/ik_llama.cpp/
mlc-llm also has a Vulcan runtime, but it’s one of the more… exotic LLM backends out there. I’d try the other ones first.
Thank you so much!! I have been putting it off because what I have works but a time will soon come when I’ll want to test new models.
I’m looking for a server but not many parallel calls because I would like to use as much context as I can. When making space for e.g. 4 threads, the context is split and thus 4x as small. With llama 3.1 8b I managed to get 47104 context on the 16GB card (though actually using that much is pretty slow). That’s with KV quant to 8b too. But sometimes I just need that much.
I’ve never tried the llama.cpp directly, thanks for the tip!
Kobold sounds good too but I have some scripts talking to it directly. I’ll read up on that too see if it can do that. I don’t have time now but I’ll do it in the coming days. Thank you!
This is the way.
…Except for ollama. It’s starting to enshittify and I would not recommend it.
Agreed. The way they just dumped support for my card in some update with some vague reason also irked me (we need a newer rocm they said but my card works fine with all current rocm versions)
Also the way they’re now trying to sell cloud AI means their original local service is in competition to the product they sell.
I’m looking to use something new but I don’t know what yet.
I’ll save you the searching!
For max speed when making parallel calls, vllm: https://hub.docker.com/r/btbtyler09/vllm-rocm-gcn5
Generally, the built in llama.cpp server is the best for GGUF models! It has a great built in web UI as well.
For a more one-click RP focused UI, and API server, kobold.cpp rocm is sublime: https://github.com/YellowRoseCx/koboldcpp-rocm/
If you are running big MoE models that need some CPU offloading, check out ik_llama.cpp. It’s specifically optimized for MoE hybrid inference, but the caveat is that its vulkan backend isn’t well tested. They will fix issues if you find any, though: https://github.com/ikawrakow/ik_llama.cpp/
mlc-llm also has a Vulcan runtime, but it’s one of the more… exotic LLM backends out there. I’d try the other ones first.
Thank you so much!! I have been putting it off because what I have works but a time will soon come when I’ll want to test new models.
I’m looking for a server but not many parallel calls because I would like to use as much context as I can. When making space for e.g. 4 threads, the context is split and thus 4x as small. With llama 3.1 8b I managed to get 47104 context on the 16GB card (though actually using that much is pretty slow). That’s with KV quant to 8b too. But sometimes I just need that much.
I’ve never tried the llama.cpp directly, thanks for the tip!
Kobold sounds good too but I have some scripts talking to it directly. I’ll read up on that too see if it can do that. I don’t have time now but I’ll do it in the coming days. Thank you!