Using Nvidia GPU can return junk in chatbox #300
Replies: 13 comments
-
That's a nifty GPU you have there, that should have enough VRAM to run granite-code:8b where you might get better results, since the default for granite-code is 3b. This issue belongs in llama.cpp upstream: https://github.com/ggerganov/llama.cpp since that's the engine running this stuff. |
Beta Was this translation helpful? Give feedback.
-
Thanks! The card is a real workhorse! Wish it had a tad more Vram. Great, I'll add the issue there and continue forward! |
Beta Was this translation helpful? Give feedback.
-
@bmahabirbu could you manually print out the llama-cli line that gets executed here by ramalama? Could be useful for filing in llama.cpp we should also link this issue with the llama.cpp. Please file a llama.cpp issue also |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Your initial bug report said granite-code but now the example says llama3... I'm guessing you first saw it in granite-code? This may not even be a llama.cpp issue, it may be a model issue but don't know enough about things to say |
Beta Was this translation helpful? Give feedback.
-
Good catch, I checked llama3 and it does the same thing. I have a feeling its something to do with the configuration for llama.cpp
And here is granite code
when asked "what is 2+2" I get a response from both. So short responses work up until a point. I was digging around and came across this Which describes the issue being basically a config error. Supposedly it was fixed? However, it still doesn't work for me I could be on an older build. Ill investigate this more |
Beta Was this translation helpful? Give feedback.
-
One more interesting test would be to see if this happens when GPU accelerated via Ollama... They also use llama.cpp as an engine/library but call it a little differently |
Beta Was this translation helpful? Give feedback.
-
Btw if you encounter llama.cpp weirdness in future, don't be afraid to open an issue in llama.cpp yourself @bmahabirbu ... Me being a go between is probably worse than you dealing directly with llama.cpp But if you don't have an issue open there soon, I'll open it myself no biggie :) |
Beta Was this translation helpful? Give feedback.
-
I tested this and didn't get the error!
from following these steps (but using llama3) https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image |
Beta Was this translation helpful? Give feedback.
-
I appreciate it! I wanted to do some digging first before moving forward. I'll keep working on it a little more but if I don't come up with something I'll create the issue! |
Beta Was this translation helpful? Give feedback.
-
Hey @ericcurtin, great news! I got this issue fixed by switching the build to CMake and targeting my personal GPU. Building the images took forever! Related to #9848. |
Beta Was this translation helpful? Give feedback.
-
Although now some prompts can stall for a couple seconds. I think this is related to how llama.cpp handles chatboxes. Some models have a "user" but other models don't so that can make things a bit weird. Seems like one chat box template doesn't fit all models |
Beta Was this translation helpful? Give feedback.
-
https://github.com/bmahabirbu/ramalama/blob/nv/container-images/ramalama/latest/Dockerfile_rhel_cuda Heres the new containerfile just for reference. I'll follow the formatting from main and do a pull request soon. |
Beta Was this translation helpful? Give feedback.
-
When I run granite-code every so often (and persistently) the model seems to break and return junk. I'm running it with the container
with a rtx 3080 10gb vram using -ngl 50. Could it be I've exceeded the vram of the GPU?
Beta Was this translation helpful? Give feedback.
All reactions