Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlapping resolveServer() calls lead to a crash #437

Closed
KitsuneRal opened this issue Jan 7, 2021 · 2 comments
Closed

Overlapping resolveServer() calls lead to a crash #437

KitsuneRal opened this issue Jan 7, 2021 · 2 comments
Assignees
Labels
crash A crash occurs in the library code

Comments

@KitsuneRal
Copy link
Member

KitsuneRal commented Jan 7, 2021

Apparently, the jobs somehow get entangled with each other, with an immediate cause of the crash being the finished() handler finding that d->resolveJob is already null.

@KitsuneRal KitsuneRal added the crash A crash occurs in the library code label Jan 7, 2021
@KitsuneRal KitsuneRal self-assigned this Jan 7, 2021
@KitsuneRal KitsuneRal changed the title Closely following resolveServer() calls lead to a crash Overlapping resolveServer() calls lead to a crash Jan 7, 2021
@KitsuneRal
Copy link
Member Author

The sequence of events leading to the crash (quite reliably) turned out to be somewhat complicated.

  1. Connection::resolveServer() uses isJobRunning() to find out if the ongoing job should be abandon()ed. However, when a job waits for a retry isJobRunning() returns false because the job is not in Pending state. As a result, the previous job is not cancelled and proceeds to retry.
  2. At this point in time d->baseUrl is already set by the next resolveServer() invocation, and since the job constructs its request at the moment of making it (even if retrying), the new request goes to the new base URL.
  3. As a result, two jobs end up requesting the same server; whichever of them returns first, triggers the lambda handling its finished() signal. If the request was successful or returned 404, the handler calls setHomeserver() which resets d->resolverJob to nullptr.
  4. The other job returns, enters its finished() handler which tries to check the result using d->resolverJob, leading to a segmentation fault.

KitsuneRal added a commit that referenced this issue Jan 7, 2021
KitsuneRal added a commit that referenced this issue Jan 7, 2021
See #437 for the discussion.

(cherry picked from commit 6101971)
KitsuneRal added a commit that referenced this issue Jan 8, 2021
To be very clear what this function checks. See also #437.
KitsuneRal added a commit that referenced this issue Jan 8, 2021
@KitsuneRal
Copy link
Member Author

KitsuneRal commented Jan 8, 2021

The above exposed a few issues:

  1. isJobRunning() shouldn't return false during retry interval. The intention of this facility function is to help with cases where, e.g., the next iteration of a job is invoked from the handler of the previous iteration. A retry means that the previous iteration is not over, and none of its completion handlers (even finished()) was triggered. This is fixed in aa79040 (master, with renaming to isJobPending() in the next commit) and 12e00b2 (0.6.x).
  2. Explicitly resetting a QPointer<> smells of violating the separation of concerns, unless the intention is to explicitly dismiss the ownership/observation for some reason. Abandoning a job leads to resetting the pointer anyway - but only if it still points to that job, which is right. It was enough to not reset d->resolveJob in setHomeserver() for this crash not to happen (and it doesn't break anything, it's not even a pessimisation). This is fixed in 6101971 (master) and 41e57c9 (0.6.x)
  3. The next piece that would render the whole sequence impossible is late binding of base URL in jobs. There's no particular reason for that, except not having to store the network request (or at least the base URL) between retries. On the other hand, late binding in theory may lead to obscure effects around changing the homeserver but that's a corner case: setHomeserver() is only supposed to be called during initial stages of Connection lifecycle, and once the access token is obtained, the homeserver really shouldn't change (see Apply proper RAII to Connection  #439). Perhaps base URL should really be expected to be generally invariant, and the server resolution case be carefully special-cased (as it is by now).
  4. As a side note not directly affecting the issue, the handler in resolveServer() doesn't treat abandoning the job as a special case, leading to emission of resolveError() in such cases. This is not exactly correct; 6af9ae2 (master) and be00308 (0.6.x) fix that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crash A crash occurs in the library code
Projects
Status: Version 0.6 - Released
Development

No branches or pull requests

1 participant