-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: When testing locally and looping runsync, it eventually stalls #40
Comments
It seems to be related to the I'm passing the same image, converted to base64, for each iteration of my loop. This isn't a particular big image, only about 300kb. It works fine the first couple of times but then I start seeing this error over and over again:
No idea why it works at first and then doesn't.. |
@vesper8 thanks for reporting this. Do you maybe have a repo / script that we can use to simulate exactly what you are doing? It would help us a lot to just get into testing. On a first glance, it sounds like a problem in ComfyUI itself, but to be sure, we will also test this. |
I don't have a repo that I can share.. but let me explain in greater detail what I'm doing and maybe that will help. I have a Windows machine on my private home network that has a powerful GPU. I set up runpod-worker-comfy there following the setup instructions and forwarded the ports so that I can access the UI, and the API, from any other machine on my home network. Then, from my Macbook I have a very basic Laravel command that sends the workflow and base64 image to the API running on my Windows machine. This works great for a few images, but if I repeatedly send more images it always ends up stalling with the error message above. It's as if the API is not in a ready-state at some point and craps out. This isn't a problem when generating images that don't have an input image. I think overall the input image logic introduced in 2.0 could maybe be improved so that we could pass an absolute url, such as an S3 url, or maybe we can pass an image file directly instead of having to b64 encode it. Or maybe if there was a way to say "use this one image for all of these generations". I'm not sure.. just throwing out ideas. Maybe the first step is to understand why exactly the image upload works initially and then stops working when the load is too heavy. |
@vesper8 thanks for the detailed explanation. Do you wait inbetween requests until the former request was handled? Or do you send multiple requests at once? |
I use the /runsync endpoint and I don't send another http request to the api until the first one has completed and I've gotten the image back from it. I even added a 1 second wait in between requests. |
@vesper8 ok thank you, this is enough information to actually start debugging. |
thank you! I hope you can at least reproduce it easily. I've been working with it today and it continues to happen a lot.. I'm never able to do more than 3 images at a time. And when it stalls.. the UI at http://192.168.2.179:8188/ becomes unreachable and it seems the only thing to do is CTRL-C the docker instance and bring it back up. I tried enabling REFRESH_WORKER in the docker-compose.yml but that doesn't seem to have any effect.. is that only for running on Runpod and doesn't affect locally? It would be nice to have a similar flag for local testing. A way to start with a clean slate before processing the next image. Right now it's so unclear whether it's my setup running out of memory or what.. the log doesn't say much.. isn't there a way to enabling more logging? here are some more logs that just happened:
|
It seems like it's constantly loading models in and out of memory even though.. as it happens.. i'm using the same nodes and same models for the entire batch.. is keeping the same models loaded in for the whole batch something that's possible, something that we can have control over? |
Im having same issue, does anyone have solution on how to fix this? |
Describe the bug
I'm testing the API locally before deploying to Runpod. I'm testing on a 4070 Super. When I make a single call to /runsync it will complete without fail every time.. and do a really nice job of it. But if I loop let's say 10 requests, it will always eventually stall. There is no more output in the terminal, and the fans keep on spinning.. it just gets stuck.. one would guess there might be some kind of memory leak. Or maybe it tries to load the same models again and again and the memory runs out. It's rather hard to debug I guess. This doesn't seem to happen if I'm generating small images but when they are larger images that take longer to generate, it happens without fail.
I should add that I'm using the same checkpoint, and doing the same operations in my loop, so it's not like I'm requesting it to load a different model repeatedly.
Is there a way to force a memory clean in between generations.. or maybe run with a higher level of verbosity?
The text was updated successfully, but these errors were encountered: