Skip to content
This repository has been archived by the owner on Jul 12, 2023. It is now read-only.

enable heartbeat after reconnection #21

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

gharia
Copy link

@gharia gharia commented Apr 13, 2020

If in the case for any reason, the kurento client gets disconnected from kurento server WS, it gets reconnected through default reconnection mechanism enabled in the client. However, after reconnection, it seems that it does not start the heartbeat(ping/pong). And hence I think (not 100% sure) after sometime kurento server removes resources associated (pipeline, media endpoints, etc).

See more discussion around this along with logs at https://groups.google.com/forum/#!searchin/kurento/Erasing$20old$20connection$20associated$20with/kurento/_MnIC3bIEYA/Oy6T7QXnAwAJ

What is the current behavior you want to change?

As per my understanding, kurento server monitors the resources and removes stale resources to save CPU. So in the case of reconnection, after the kurento client gets reconnected it does not start heartbeat. Which in turn makes kurento server to mark the associated resources as stale and remove them (This is my understanding maybe authors can confirm). This stops streaming. No issue has been filed fo this.

What is the new behavior provided by this change?

It should start the heartbeat after reconnection and probably the server should conserve and resume the old resources (pipeline, WebRtcEndpoint, RtpEndpoint, etc) associated with this client session.

How has this been tested?

I haven't tested it. Testing is difficult because reproducing/forcing reconnection is difficult. Any advice is helpful here.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature / enhancement (non-breaking change which improves the project)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • My change requires a change to the documentation
  • My change requires a change in other repository

Checklist

  • I have read the Contribution Guidelines
  • I have added an explanation of what the changes do and why they should be included
  • I have written new tests for the changes, as applicable, and have successfully run them locally

@jenkinskurento
Copy link
Contributor

Hi there, thanks for your Pull Request!

A Kurento member needs to verify that this patch is reasonable to test. In case it is, they should write a comment with the phrase test this please. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by Kurento members will still work. Regular contributors can be whitelisted to skip this step.

@j1elo j1elo self-assigned this Apr 21, 2020
@j1elo
Copy link
Member

j1elo commented Apr 21, 2020

Relevant execution flow analysis: https://groups.google.com/d/msg/kurento/_MnIC3bIEYA/vDqNt3KlAQAJ

@gharia , were you able to test with this change and see that the reconnection works and doesn't get dropped after 10 seconds? (i.e. that this change does indeed bring the desired behavior)

@j1elo
Copy link
Member

j1elo commented Apr 21, 2020

@micaelgallego do you think this was just a small typo? or maybe hiding some other behavior of the client reconnection?

@gharia
Copy link
Author

gharia commented Apr 21, 2020

@j1elo I was not able to test this one, but as mentioned in the google group discussion, I have seen this in our deployment. However, the disconnection happens intermittently, It occurs only in case of a crazy CPU spike due to the composite unit, which leads to delayed or no response to ping from Kurento client. So testing is difficult. The only way I see is to put a debug point in kurento server code which is responsible to send pong and see if the client disconnects after 10 seconds. And then resume the flow and check if it reconnects to the same kurento server and obtains the previously created media elements (pipelines, endpoints, etc) associated with that Kurento client. And I am not familiar with that part(Kurento Server code & local setup for execution from code).

@micaelgallego
Copy link
Member

@micaelgallego do you think this was just a small typo? or maybe hiding some other behavior of the client reconnection?

It seems a typo... but extensive testing is needed before accepting this PR

@j1elo
Copy link
Member

j1elo commented Apr 22, 2020

@gharia we're super focused right now with upcoming release of OpenVidu 2.13, so we might take some time until we can test this change.

However, if the issue is right now negatively affecting you in a daily basis, then maybe it is worth it that you just make that change and try in your deployment. Worst case would be that the issue still happens, just like it happens now... but best case would be that the issue is solved, so your deployment would actually work better :-)

@gharia
Copy link
Author

gharia commented Apr 22, 2020

@j1elo Sure sure, I will try that as well.

I am want to get rid of it. A strange thing I noticed just now. The CPU usage was normal, max going 18%, and suddenly I started getting a timeout exceptions

org.kurento.jsonrpc.JsonRpcException: [KurentoClient]  Timeout of 10000 milliseconds waiting from response to request {"id":983,"method":"create","params":{"type":"WebRtcEndpoint","constructorParams":{"mediaPipeline":"6705d62d-af2c-4088-9e47-972991dd4a4d_kurento.MediaPipeline"},"properties":{},"sessionId":"f01b599d-92c6-472e-9ad4-4ebc8515faaf"},"jsonrpc":"2.0"}
	at org.kurento.jsonrpc.client.AbstractJsonRpcClientWebSocket.internalSendRequestWebSocket(AbstractJsonRpcClientWebSocket.java:399)
	at org.kurento.jsonrpc.client.AbstractJsonRpcClientWebSocket$1.internalSendRequest(AbstractJsonRpcClientWebSocket.java:141)
	at org.kurento.jsonrpc.internal.JsonRpcRequestSenderHelper.sendRequest(JsonRpcRequestSenderHelper.java:75)
	at org.kurento.jsonrpc.internal.JsonRpcRequestSenderHelper.sendRequest(JsonRpcRequestSenderHelper.java:69)
	at org.kurento.jsonrpc.client.JsonRpcClient.sendRequest(JsonRpcClient.java:112)
	at org.kurento.client.internal.transport.jsonrpc.RomClientJsonRpcClient.sendRequest(RomClientJsonRpcClient.java:228)
	at org.kurento.client.internal.transport.jsonrpc.RomClientJsonRpcClient.create(RomClientJsonRpcClient.java:157)
	at org.kurento.client.internal.transport.jsonrpc.RomClientJsonRpcClient.create(RomClientJsonRpcClient.java:147)
	at org.kurento.client.internal.client.RomManager.create(RomManager.java:59)
	at org.kurento.client.internal.client.RomManager.createWithKurentoObject(RomManager.java:257)
	at org.kurento.client.AbstractBuilder.build(AbstractBuilder.java:65)
	at com.commentary.session.PresenterSessionImpl.startBroadcast(PresenterSessionImpl.java:53)
	at com.commentary.room.PresenterRoomImpl.startBroadcast(PresenterRoomImpl.java:189)
	at com.commentary.CallHandler.handlePresenter(CallHandler.java:198)
	at com.commentary.CallHandler.handleMessage(CallHandler.java:303)
	at com.commentary.CallHandler.handleTextMessage(CallHandler.java:349)
	at org.springframework.web.socket.handler.AbstractWebSocketHandler.handleMessage(AbstractWebSocketHandler.java:43)
	at org.springframework.web.socket.handler.WebSocketHandlerDecorator.handleMessage(WebSocketHandlerDecorator.java:75)
	at org.springframework.web.socket.handler.LoggingWebSocketHandlerDecorator.handleMessage(LoggingWebSocketHandlerDecorator.java:56)
	at org.springframework.web.socket.handler.ExceptionWebSocketHandlerDecorator.handleMessage(ExceptionWebSocketHandlerDecorator.java:58)
	at org.springframework.web.socket.adapter.standard.StandardWebSocketHandlerAdapter.handleTextMessage(StandardWebSocketHandlerAdapter.java:114)
	at org.springframework.web.socket.adapter.standard.StandardWebSocketHandlerAdapter.access$000(StandardWebSocketHandlerAdapter.java:43)
	at org.springframework.web.socket.adapter.standard.StandardWebSocketHandlerAdapter$3.onMessage(StandardWebSocketHandlerAdapter.java:85)
	at org.springframework.web.socket.adapter.standard.StandardWebSocketHandlerAdapter$3.onMessage(StandardWebSocketHandlerAdapter.java:82)
	at org.apache.tomcat.websocket.WsFrameBase.sendMessageText(WsFrameBase.java:395)
	at org.apache.tomcat.websocket.server.WsFrameServer.sendMessageText(WsFrameServer.java:119)
	at org.apache.tomcat.websocket.WsFrameBase.processDataText(WsFrameBase.java:495)
	at org.apache.tomcat.websocket.WsFrameBase.processData(WsFrameBase.java:294)
	at org.apache.tomcat.websocket.WsFrameBase.processInputBuffer(WsFrameBase.java:133)
	at org.apache.tomcat.websocket.server.WsFrameServer.onDataAvailable(WsFrameServer.java:82)
	at org.apache.tomcat.websocket.server.WsFrameServer.doOnDataAvailable(WsFrameServer.java:171)
	at org.apache.tomcat.websocket.server.WsFrameServer.notifyDataAvailable(WsFrameServer.java:151)
	at org.apache.tomcat.websocket.server.WsHttpUpgradeHandler.upgradeDispatch(WsHttpUpgradeHandler.java:148)
	at org.apache.coyote.http11.upgrade.UpgradeProcessorInternal.dispatch(UpgradeProcessorInternal.java:54)
	at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:59)
	at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:860)
	at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1591)
	at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61)
	at java.lang.Thread.run(Thread.java:748)
	Caused by: java.util.concurrent.TimeoutException: Waited 10000 milliseconds (plus 142377 nanoseconds delay) for com.google.common.util.concurrent.SettableFuture@364001eb[status=PENDING]
	at com.google.common.util.concurrent.AbstractFuture.get(AbstractFuture.java:494)
	at com.google.common.util.concurrent.AbstractFuture$TrustedFuture.get(AbstractFuture.java:97)
	at org.kurento.jsonrpc.client.AbstractJsonRpcClientWebSocket.internalSendRequestWebSocket(AbstractJsonRpcClientWebSocket.java:377)
	... 41 common frames omitted

I tried http GET x.x.x.x:8888/kurento from another machine in the same cluster, just to check if the Kurento's WebSocket is accessible or not. And I got a response.

So this is really kind of no clue what is happening. Why the client gives a timeout? And the timeout says 10000 milliseconds, but I have configured it to be 20000 as below

KurentoClient kurentoClient = new KurentoClientBuilder()
      .setKmsWsUri(kurentoAddress)
      .setConnectionTimeout(20000L)    
      .onConnectionFailed(() -> {
        log.error("Kurento connection status: connection attempt failed");
      })
      .onConnected(() -> {
        log.info("Kurento connection status: connected");
      })
      .onDisconnected(() -> {
        log.error("Kurento connection status: disconnected");
      })
      .onReconnecting(() -> {
        log.info("Kurento connection status: reconnecting...");
      })
      .onReconnected(sameServer -> {
        log.info("Kurento connection status: reconnected");
        log.info("Reconnected to same server: {}", sameServer);
      })
      .connect();

So it is not even respecting the 20000 milliseconds. And when this occurred, I was seeing heartbeat logs in kurento server. So it was connected

2020-04-22T13:15:59,616800 895 0x00007fc091213700    info KurentoWebSocketTransport WebSocketTransport.cpp:296 keepAliveSessions()  Keep alive f01b599d-92c6-472e-9ad4-4ebc8515faaf

I am kind of blank what to do here. Stabilizing it is really seems difficult. I appreciate any help/suggestions from you guys' experience.

@gharia
Copy link
Author

gharia commented Apr 22, 2020

FYI, I am using kurento-client version 6.13.0

@j1elo
Copy link
Member

j1elo commented Apr 22, 2020 via email

@gharia
Copy link
Author

gharia commented Apr 22, 2020

Okay, I will update the deployment tomorrow and get back to you. Even I entered into the shell of the docker image and did curl -i http://x.x.x.x:8888/kurento, to check if it is there any issue accessing the Kurento server WS from the pod. But I got a response

HTTP/1.1 426 Upgrade Required
Server: WebSocket++/0.7.0

So I was sure there is something going on with Kurento client. Thnks for suggestion. I will get back to you with the feedback after I do upgrade client version.

@j1elo
Copy link
Member

j1elo commented Apr 22, 2020 via email

@gharia
Copy link
Author

gharia commented May 8, 2020

@j1elo I am able to recreate the timeout issue, which eventually leads to disconnect and reconnect cycle. Basically I am testing SFU mode (removed Composite) on out staging. And on our broadcaster portal, when broadcaster hit the Start button, I send SDP offer to create a WebRTC connection for broadcasting. When I hit the Start/Stop button continuously, at some points (after a minute or 2) it starts giving request timeout exception. Attached Kurento logs when this happened for reference. kurento_timeout.log

And this is kind of related to reconnection. On the first disconnect, the client is able to reconnect and everything backs to normal. But on 2nd reconnect the kurento-client is not able to reconnect. And only solution is to restart the application (not kurento server). It can be related to not starting ping/pong after reconnection. I will be trying this PR branch today on our staging.

Any insights why it gives timeout? However, I was able to connect to kurento WebSocket using the curl command you gave and that from a k8s pod of the application server.

@gharia
Copy link
Author

gharia commented May 11, 2020

I created a new branch 6.13.2(https://github.com/gharia/kurento-java/tree/6.13.2) from tag 6.13.1 in my forked repo, added reconnection ping/pong fix, and some logs and created a new release 6.13.2. Generated artifacts and installed artifacts in Github in a different branch repository at https://github.com/gharia/kurento-java/tree/repository. I am using that repository in our application and did some testing on staging. I can confirm that it does restart ping/pong after reconnection.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants