-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Render connection problems on last msg send causes render task to freeze #585
Comments
Hello, Jan! cgru/afanasy/src/render/main.cpp Lines 166 to 176 in 11e78f2
Cycle in a cycle not needed. |
Hi Timur, here are the most common log entries that we found on frozen afrender clients:
and
Is there any way to get more verbose logging especially on what is going on with our sockets? To me it would be interesting to get a log output every time RCVTIMEO and SNDTIMEO have been reached (on both ends -afserver and afrender) - I suspect this to be the cause of the to debug I have a feeling that in our network we should increase RCVTIMEO and SNDTIMEO but not sure by how much? will this setting interfere with other timeouts like zombietime etc? also please remind me where we can configure the delay that afrender waits between its connection attempts to the server. I think this should be longer then 120 secs because it may be that:
Cheers |
Hi, Oliver! |
You can monitor sockets states by such command as |
Hi Timur, Today we had this behaviour:
see this log:
can you add a mechanism that detects such ghost tasks? or is there already something in place that just does not work in this case? cheers |
Hello, Oli! |
I just wanted to add that we're seeing the exact same thing. It tends to happen more often when the server is under heavy load and tasks are fast to exit. My theory is that it starts and stops before the server acknowledges the starting of the task, but I have yet been able to re-produce the issue in a non-production system. |
Hey guys, I am writing to update you guys (@timurhai @lithorus ) with some statistics of how many of these "frozen tasks" we have. @lithorus In the following charts you can see the "amount of renders exited". A 14 means that we exited 14 renders at that exact moment. Here are the exits of the last 90 days. Here a smaller scaled last 7days. This little extra task-time-out-watcher-skript "works" since almost no extra wrangling is needed but it really is the biggest problem we are facing atm. The problem is evenly spread among all our renders and parsers.
@timurhai, our IT monitored the stats and couldn't find any suspicious stats or behavior. I checked the code myself (I am not a c++ pro) and looking at the typical error message the problem may be solved somewhere here? cgru/afanasy/src/libafanasy/name_afnet.cpp Lines 486 to 493 in 1fd2bc4
Since this returns false in our case, the render should exit itself anyways but it seems like it does not work. cgru/afanasy/src/libafanasy/name_afnet.cpp Lines 308 to 314 in 1fd2bc4
If anyone of you guys (@lithorus @timurhai) has an Idea on how we can do sth to detect "frozen tasks" earlier (maybe in the afanasycode directly ourselves) to not wait 30mins or more. It would be very helpful. Thanks a lot, Best |
@eberrippe Good idea about the cron job, although 30 min is a bit long for many of our tasks. But good to have as a fallback. Might implement the same here as a stop gap solution. My theory is that it comes if there's packet loss and/or the server fails to process the "I'm done" request from the client, so the server still thinks it's doing things. One way to detect it aswell, is if you try to pull the output log of the task it will return something in the line of "Render has no task". One way to solve that might be some kind of keep-alive check of the tasks between server and render node. |
@lithorus /sign on what you've mentioned. We also think exactly the same. Cause sometimes files are created correctly but it just won't exit the socket completely. Hoping that Timur maybe has a solution to make the connection more fallout reliable and error proofed. |
@lithorus And here a little very brief guide on how our script works atm:
(Sure you would need some kind of "if progress was done: delete the entry from the DB" stuff, but you get our basic approach.) Best |
Hello!
This means that this system call returns < 0:
And produces this system error message: Resource temporarily unavailable
What is server log this time? You want "stucked" render to exit manually to became zombie on server and later it can be started again? |
@timurhai The Server does not recognize that the render has problems and shows the render and the task as RUN. If we want to stop the job, pause the job, eject the task, delete the job, nothing works. That's probably why the original task-progress-timeout does not work. Example: The Render is RUN, the task is RUN. I see the runtime is 40mins, but task progress 0%. I click "delete job". Nothing happens. I exit the render. Suddenly the job deletes itself. It seems almost like the commands are queued and can not work until the jobs task was released by exiting the renderclient.
This is our only option to make everything work again at the moment. Thats why I am asking, is there any option to fix this issue? Thanks a lot |
So, server log has errors, something unusual when renders stucks?
Also you can try to use
May be the system error will be differ.
|
We have the same problem and ours are :
However we have the issue on another server with default options. |
Can you provide server logs on render stuck? You can increase pool Heart Beat seconds for renders for a big farm (pool). |
Here are our settings: We have 1000 renderclients connecting, ~200 monitors (heartbeat 2,5sec) , ~NIMBY-Trays-Setters (heartbeat 5sec), ~5 get-coronjobs (heartbeat 5sec) cgru/3.3.0/config_default.json
cgru/3.3.0/config.json
cgru/3.3.0/afanasy/config_default.json:
cgru/3.3.0/afanasy/config.json: <--- (Since this is nowhere included, i suppose its unused. 🤔 )
Out serveroutput data is a little sensible, because the project names are also written into job names. But here is a little anonymization Its hard to correlate the errors here to the not-replying renders. Because we cant find the exact time a render was causing issues.:
If I understand you correctly, you think that the renders are kind of DDOS'ing our server. By reducing the heartbeat (https://cgru.readthedocs.io/en/latest/afanasy/pools.html#heartbeat-sec) of the rootpool to 2 or 3 sec, we will slow down the total numbers of connection attempts and reducing our network load? Thanks a lot |
Since pools appear (v >= 3.0.0) such render parameters as |
I will try to set the heartbeat to 2-3 on our different sites and see if it changes the amount of stuck renders my script catches. nothing was set, so the default of 1 sec was taken. But right, this might reduce the symptoms but does not solve the cause. I will keep you posted. |
Hey Timur, I don't know how familiar you are with the MQTT protocol and how it implements a so called Quality of Service (QoS). I am pretty sure that if we could implement at least "QoS 1" (server acknowledging reception of "task finished" to the client and client resending "task finished" if server does not acknowledge it after a certain time) it would solve this very issue. What do you think? |
Now render updates server in a cycle. But as far as i understand the problem, render can't read any server answer (stuck situation). You say, that only afrender restart solves problem. |
I suppose an immediate renderservice restart would be the best solution if we cant resolve the problem at its root. Please keep in mind that sometimes other tasks can start in parallel on the stuck render and renders perfectly fine. These would be killed as well. |
no restarts please...we must fix the root cause and not patch the issue |
I think the first step is to be able to re-produce it in a reliable way. |
Hello! |
correct! the task is stuck and the client happily accepts new tasks and renders them as if nothing ever happened. |
Most of them time they are at not even at 0%. That's how I usually detect the. They have started, but don't have a % yet. |
Hi Timur, you remember our chat in the CGRU Telegram yesterday about our connection problems?
The problem is that the render does not notify the server that it finished the task. So the task stays at no progress but the server can also not eject it with the "task no change timeout". We suppose its because the socket on the render freezes somehow cause of an connection issue. Our only way of fixing is to exit the render client or restarting the afserver.
We were checking the code and we suppose that we found the issue we are running into.
What happens if the last output from the task was not sent successfully? We think there should be some kind of sleep and reconnection try going on here?
cgru/afanasy/src/render/renderhost.cpp
Lines 274 to 287 in 11e78f2
We imagine a solution could be sth like:
Or would this cause a total render lag?
Thanks a lot!
The text was updated successfully, but these errors were encountered: