Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CommunicationInstance: System.IO.IOException: Remote message handler throws an exception. #326

Open
TaviTruman opened this issue Jul 7, 2020 · 8 comments

Comments

@TaviTruman
Copy link
Contributor

GraphEngine Client crash when attempting to connect to Graph Engine Server running inside a Service Fabric Cluster
image

  • I'm working with Microsoft support and the Service Fabric product team to resolve Load Balancer and Reverse Proxy configuration issues
    Here is the offending source code:
    image

At line 28 in the source, we are just trying to connect TCP endpoint in the cluster; we can reach the GE service listening on the Exposed SF Listener. Looks like it can connect but the custom IMessagePassingEndpoint seems to fall-down when trying to send/receive data.

@TaviTruman
Copy link
Contributor Author

TaviTruman commented Jul 9, 2020

More data and information for research

System.IO.IOException
HResult=0x80131620
Message=Remote message handler throws an exception.
Source=Trinity.Core
StackTrace:
at Trinity.Storage.RemoteStorage._error_check(TrinityErrorCode err)
at Trinity.Storage.RemoteStorage._use_synclient(Func`2 func)
at Trinity.Storage.RemoteStorage.SendMessage(Byte* message, Int32 size, TrinityResponse& response)
at Trinity.Network.CommunicationModule.SendMessage(IMessagePassingEndpoint endpoint, Byte* buffer, Int32 size, TrinityResponse& response)
at Trinity.Storage.MessagePassingExtensionMethods.SendMessage[T](IMessagePassingEndpoint storage, Byte* message, Int32 size, TrinityResponse& response)
at Trinity.Client.TrinityClientModule.MessagePassingExtension.RegisterClient(IMessagePassingEndpoint storage, RegisterClientRequestWriter msg) in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\obj\Debug\netstandard2.0\GeneratedCode\Lib\Protocols.cs:line 279
at Trinity.Client.ClientMemoryCloud.RegisterClient() in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\ClientSide\ClientMemoryCloud.cs:line 48
at Trinity.Client.TrinityClient.RegisterClient() in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\ClientSide\TrinityClient.cs:line 93
at Trinity.Client.TrinityClient.StartPolling() in G:\IKW-GraphEngine\src\Modules\GraphEngine.Client\Trinity.Client\ClientSide\TrinityClient.cs:line 81
at Trinity.Network.CommunicationProtocolGroup._RaiseStartedEvent()
at Trinity.Network.CommunicationInstance._RaiseStartedEvents()
at Trinity.Network.CommunicationInstance.Start()

@TaviTruman
Copy link
Contributor Author

I'm uploading the trinity.log file

trinity-[07_08_2020_08_40_03_PM].log

@TaviTruman
Copy link
Contributor Author

TaviTruman commented Jul 9, 2020

So I am able to duplicate the problem on my local SF Cluster and Azure SF Cluster; it looks like we aren't getting a connection to the remote IClientRegistry (memory cloud). So we are getting a connection and we can send data but unable to receive data.

@TaviTruman
Copy link
Contributor Author

TaviTruman commented Jul 18, 2020

with the SF Load Balancer configured to let TCP traffic flow through to the SF Cluster, traffic seems to flow into the cluster but can't flow out.

image

image

This is incoming traffic from the Azure LB - via Health Probe

image

image

@TaviTruman
Copy link
Contributor Author

TaviTruman commented Jul 24, 2020

Okay - I've travel way down the GE rabbit hole now and have an open ticket with Microsoft Service Fabric Support. We are making progress folks. So to actually connect a GE Client to a GE Server running an Azure Service Fabric cluster is a matter of configuring the Azure LB at Level 4 to let TCP traffic pass through; once that done properly the GE GraphEgnineClient API-set is able to partially connect to the GE Service instance running in the SF Cluster. What I have come to understand and to appreciate the brilliance of the Graph Engine networking stack, and that connecting to the server is a multi-step process and that the GE is dogged w.r.t. keep that connection in place.

Here's all you need to do this point the GE Client to your SF Cluster:

image

FYI: I've been documenting the ins-n-outs of developing with the GE in the SF and will publish them at my GitHub GraphEngine repository soon.

This is what I found most recently. Processing on the GE Client-side will make this call into the GE Server

image

The GE Client is setting up Client Response registration with Server and will get ready to start polling the server before each RPC call as well as lower-level Graph Engine Network infrastructure RPCs into the server; the GE is truly type-safe distributed across memorycloud instances, even in the SF-Cluster. The call, however, fails on the GE Server-side when running in an SF Cluster; otherwise, the stuff just works.

image

This is bad and as a result, a true or complete connection is never made; the GE Client in the means time is re-trying the Polling and of course that fails too.

I've got another remote debugging session schedule with Mike Wong from MS SF Support; this guy is great! I think we are getting down the root-cause of this thing and then a fix can be applied.

@TaviTruman
Copy link
Contributor Author

Okay, so when the GE Client connects (TrinityClient.Start()) to my GE Server, outside of the SF Cluster, the RegisterClientHandler is firstly called before CheckInstanceCookie on the server-side; this call sequence is very important because the RegisterClientHandler is the only method that adds the client cookie to m_client_storages:

`public override void RegisterClientHandler(RegisterClientRequestReader request, RegisterClientResponseWriter response)
    {
        if (m_memorycloud != null)
        {
            response.PartitionCount = m_memorycloud.PartitionCount;

            if (ClientRegistry is null)
                ClientRegistry = m_memorycloud as IClientRegistry;

            var cstg = m_client_storages.AddOrUpdate(request.Cookie, _ =>
            {
                ClientIStorage stg = new ClientIStorage(m_memorycloud) {Pulse = DateTime.Now};

                if (ClientRegistry != null)
                {
                    int new_id = ClientRegistry.RegisterClient(stg);
                    stg.InstanceId = new_id;
                }

                return stg;
            }, (_, stg) => stg);
            response.InstanceId = cstg.InstanceId;
        }
    }`

When the GE server is running in the SF Cluster the RegisterClientHandler is not called first, instead, the CheckInstanceCookie is called first:

  `private ClientIStorage CheckInstanceCookie(int cookie, int instanceId)
    {
        if (m_client_storages.TryGetValue(cookie, out var storage) && instanceId == storage.InstanceId) return storage;
        throw new ClientInstanceNotFoundException();
    }`

and the m_client_storages is empty. So even though the GE Client has made the initial connection to the server the order of method calls on the server-side seem to out of order; this sounds like something might be off in the lower-level code like the MessageDispatcher and DispatcherProc code.

@TaviTruman
Copy link
Contributor Author

when this code is called from a GE client over Trinity Native (Default) Sockets and the GE Client is directed to connect to a GE Server running in an SF Cluster:

image

On the server-side, this code is called via DispatchMessage

image

@TaviTruman
Copy link
Contributor Author

After working with the SF team for a few weeks there is a certain deficiency in the Graph Engine TCP/IP stack; I've been able to narrow it down to code in the TCP layer. I'll come back here to update our continued development and testing we perfect the GE/SF integration, specifically, external GE Client connection to SF/GE Cluster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant