Using Nvidia GPU can return junk in chatbox #300

bmahabirbu · 2024-10-07T21:59:21Z

bmahabirbu
Oct 7, 2024

When I run granite-code every so often (and persistently) the model seems to break and return junk. I'm running it with the container

nvidia/cuda:12.6.1-devel-ubi9

with a rtx 3080 10gb vram using -ngl 50. Could it be I've exceeded the vram of the GPU?

home/linuxbrew/.linuxbrew/bin/python3 /home/bmahabir/ramalama/bin/ramalama run granite-code

> what is 5+5
> 
Answer:
5+5=10

> what is the first 5 prime numbers
4444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
>

ericcurtin · 2024-10-07T22:31:03Z

ericcurtin
Oct 7, 2024
Maintainer

That's a nifty GPU you have there, that should have enough VRAM to run granite-code:8b where you might get better results, since the default for granite-code is 3b.

This issue belongs in llama.cpp upstream:

https://github.com/ggerganov/llama.cpp

since that's the engine running this stuff.

0 replies

bmahabirbu · 2024-10-07T22:55:06Z

bmahabirbu
Oct 7, 2024
Author

Thanks! The card is a real workhorse! Wish it had a tad more Vram. Great, I'll add the issue there and continue forward!

0 replies

ericcurtin · 2024-10-08T13:18:29Z

ericcurtin
Oct 8, 2024
Maintainer

@bmahabirbu could you manually print out the llama-cli line that gets executed here by ramalama? Could be useful for filing in llama.cpp we should also link this issue with the llama.cpp. Please file a llama.cpp issue also

0 replies

bmahabirbu · 2024-10-08T21:22:25Z

bmahabirbu
Oct 8, 2024
Author

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix '' --in-suffix '' --no-display-prompt -ngl 50 -p "You are a helpful assistant" -c 2048 -cnv

0 replies

ericcurtin · 2024-10-08T21:45:34Z

ericcurtin
Oct 8, 2024
Maintainer

Your initial bug report said granite-code but now the example says llama3... I'm guessing you first saw it in granite-code?

This may not even be a llama.cpp issue, it may be a model issue but don't know enough about things to say

0 replies

bmahabirbu · 2024-10-08T22:39:16Z

bmahabirbu
Oct 8, 2024
Author

Good catch, I checked llama3 and it does the same thing. I have a feeling its something to do with the configuration for llama.cpp

llama-cli -m /var/lib/ramalama/models/ollama/llama3:latest --in-prefix  --in-suffix  --no-display-prompt -ngl 50 -p You are a helpful assistant -c 2048 -cnv

> what is 2+2 answer the question and no more
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG
>

And here is granite code

llama-cli -m /var/lib/ramalama/models/ollama/granite-code:latest --in-prefix  --in-suffix  --no-display-prompt -ngl 50 -p You are a helpful assistant -c 2048 -cnv

> what is 2+2 answer the question only
444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444444
>

when asked "what is 2+2" I get a response from both. So short responses work up until a point. I was digging around and came across this

ggerganov/llama.cpp#4991

Which describes the issue being basically a config error. Supposedly it was fixed? However, it still doesn't work for me I could be on an older build. Ill investigate this more

0 replies

ericcurtin · 2024-10-08T23:12:28Z

ericcurtin
Oct 8, 2024
Maintainer

One more interesting test would be to see if this happens when GPU accelerated via Ollama... They also use llama.cpp as an engine/library but call it a little differently

0 replies

ericcurtin · 2024-10-08T23:14:08Z

ericcurtin
Oct 8, 2024
Maintainer

Btw if you encounter llama.cpp weirdness in future, don't be afraid to open an issue in llama.cpp yourself @bmahabirbu ... Me being a go between is probably worse than you dealing directly with llama.cpp

But if you don't have an issue open there soon, I'll open it myself no biggie :)

0 replies

bmahabirbu · 2024-10-08T23:28:53Z

bmahabirbu
Oct 8, 2024
Author

One more interesting test would be to see if this happens when GPU accelerated via Ollama... They also use llama.cpp as an engine/library but call it a little differently

I tested this and didn't get the error!

bmahabir@DESKTOP-SB69448:~/llama.cpp$ docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
32177d64d6ab4b1e681a4994d6e7ae57dd6349bdc89828f7e883d54b6fb015d9
bmahabir@DESKTOP-SB69448:~$ docker exec -it ollama ollama run llama3
>>> what is 2+2 answer the question only
4

>>> what are the first 5 prime numbers
The first 5 prime numbers are:

1. 2
2. 3
3. 5
4. 7
5. 11

from following these steps (but using llama3)

https://ollama.com/blog/ollama-is-now-available-as-an-official-docker-image

0 replies

bmahabirbu · 2024-10-08T23:35:52Z

bmahabirbu
Oct 8, 2024
Author

Btw if you encounter llama.cpp weirdness in future, don't be afraid to open an issue in llama.cpp yourself @bmahabirbu ... Me being a go between is probably worse than you dealing directly with llama.cpp

But if you don't have an issue open there soon, I'll open it myself no biggie :)

I appreciate it! I wanted to do some digging first before moving forward. I'll keep working on it a little more but if I don't come up with something I'll create the issue!

0 replies

bmahabirbu · 2024-10-16T01:28:44Z

bmahabirbu
Oct 16, 2024
Author

Hey @ericcurtin, great news! I got this issue fixed by switching the build to CMake and targeting my personal GPU. Building the images took forever! Related to #9848.

0 replies

bmahabirbu · 2024-10-16T01:29:07Z

bmahabirbu
Oct 16, 2024
Author

Although now some prompts can stall for a couple seconds. I think this is related to how llama.cpp handles chatboxes. Some models have a "user" but other models don't so that can make things a bit weird. Seems like one chat box template doesn't fit all models

0 replies

bmahabirbu · 2024-10-16T01:40:55Z

bmahabirbu
Oct 16, 2024
Author

https://github.com/bmahabirbu/ramalama/blob/nv/container-images/ramalama/latest/Dockerfile_rhel_cuda

Heres the new containerfile just for reference. I'll follow the formatting from main and do a pull request soon.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using Nvidia GPU can return junk in chatbox #300

{{title}}

Replies: 13 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Using Nvidia GPU can return junk in chatbox #300

bmahabirbu Oct 7, 2024

Replies: 13 comments

ericcurtin Oct 7, 2024 Maintainer

bmahabirbu Oct 7, 2024 Author

ericcurtin Oct 8, 2024 Maintainer

bmahabirbu Oct 8, 2024 Author

ericcurtin Oct 8, 2024 Maintainer

bmahabirbu Oct 8, 2024 Author

ericcurtin Oct 8, 2024 Maintainer

ericcurtin Oct 8, 2024 Maintainer

bmahabirbu Oct 8, 2024 Author

bmahabirbu Oct 8, 2024 Author

bmahabirbu Oct 16, 2024 Author

bmahabirbu Oct 16, 2024 Author

bmahabirbu Oct 16, 2024 Author

bmahabirbu
Oct 7, 2024

ericcurtin
Oct 7, 2024
Maintainer

bmahabirbu
Oct 7, 2024
Author

ericcurtin
Oct 8, 2024
Maintainer

bmahabirbu
Oct 8, 2024
Author

ericcurtin
Oct 8, 2024
Maintainer

bmahabirbu
Oct 8, 2024
Author

ericcurtin
Oct 8, 2024
Maintainer

ericcurtin
Oct 8, 2024
Maintainer

bmahabirbu
Oct 8, 2024
Author

bmahabirbu
Oct 8, 2024
Author

bmahabirbu
Oct 16, 2024
Author

bmahabirbu
Oct 16, 2024
Author

bmahabirbu
Oct 16, 2024
Author