Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

analyze flaky TimerTest #345

Draft
wants to merge 51 commits into
base: master
Choose a base branch
from
Draft

analyze flaky TimerTest #345

wants to merge 51 commits into from

Conversation

chrjohn
Copy link
Member

@chrjohn chrjohn commented Dec 17, 2020

With additional logging we can see that there is no heartbeat sent for three seconds. Added generation of thread dump after two seconds...

<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Accepting session FIX.4.4:ISLD->TW from /127.0.0.1:50618)
<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Acceptor heartbeat set to 2 seconds)
<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Received logon)
<20201217-13:18:19, FIX.4.4:ISLD->TW, event> (Responding to Logon request)
<20201217-13:18:19, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=60�35=A�34=1�49=ISLD�52=20201217-13:18:19.871�56=TW�98=0�108=2�10=168�)
<20201217-13:18:19, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.4�9=60�35=A�34=1�49=ISLD�52=20201217-13:18:19.871�56=TW�98=0�108=2�10=168�)
<20201217-13:18:19, FIX.4.4:TW->ISLD, event> (Received logon)
<20201217-13:18:22, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=58�35=1�34=2�49=ISLD�52=20201217-13:18:22.948�56=TW�112=TEST�10=201�)
<20201217-13:18:22, FIX.4.4:ISLD->TW, event> (Sent test request TEST)

@chrjohn chrjohn self-assigned this Dec 17, 2020
@chrjohn
Copy link
Member Author

chrjohn commented Dec 28, 2020

Error:  Tests run: 5, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 7.991 s <<< FAILURE! - in quickfix.SocketAcceptorTest
Error:  testQuickRestartOfAcceptor(quickfix.SocketAcceptorTest)  Time elapsed: 3.455 s  <<< FAILURE!
java.lang.AssertionError: Exactly 1 'QFJ Message Processor' thread(s) expected expected:<1> but was:<2>

@chrjohn
Copy link
Member Author

chrjohn commented Dec 30, 2020

Current findings:

  • IIRC I only observed this on MacOS.
  • Strange thing is that no Heartbeat is generated for either client or server but the TestRequest because of the missing Heartbeat is generated on client and server although both are triggered by the QFJ Timer thread.
2020-12-29T00:51:00.6306730Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Accepting session FIX.4.4:ISLD->TW from /127.0.0.1:50680)
2020-12-29T00:51:00.6307980Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Acceptor heartbeat set to 2 seconds)
2020-12-29T00:51:00.6309220Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Received logon)
2020-12-29T00:51:00.6310180Z <20201229-00:51:00, FIX.4.4:ISLD->TW, event> (Responding to Logon request)
2020-12-29T00:51:00.6311230Z <20201229-00:51:00, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=60�35=A�34=1�49=ISLD�52=20201229-00:51:00.284�56=TW�98=0�108=2�10=152�)
2020-12-29T00:51:00.6312240Z <20201229-00:51:00, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.4�9=60�35=A�34=1�49=ISLD�52=20201229-00:51:00.284�56=TW�98=0�108=2�10=152�)
2020-12-29T00:51:00.6313170Z <20201229-00:51:00, FIX.4.4:TW->ISLD, event> (Received logon)

# nothing happening here.... for three seconds
# QFJ Timer should trigger generation of Heartbeat after 2 seconds but it doesn't
# but TestRequest after about 3 seconds is generated

2020-12-29T00:51:03.3023960Z <20201229-00:51:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=58�35=1�34=2�49=ISLD�52=20201229-00:51:03.301�56=TW�112=TEST�10=179�)
2020-12-29T00:51:03.3071120Z <20201229-00:51:03, FIX.4.4:ISLD->TW, event> (Sent test request TEST)
2020-12-29T00:51:03.3085660Z <20201229-00:51:03, FIX.4.4:TW->ISLD, outgoing> (8=FIX.4.4�9=58�35=1�34=2�49=TW�52=20201229-00:51:03.301�56=ISLD�112=TEST�10=179�)
2020-12-29T00:51:03.3089020Z <20201229-00:51:03, FIX.4.4:TW->ISLD, event> (Sent test request TEST)

Edit: sounds a little like https://www.quickfixj.org/jira/browse/QFJ-448 where Heartbeats are occassionally missing, especially with low Heartbeat intervals...

@chrjohn
Copy link
Member Author

chrjohn commented Jan 18, 2021

After some more failures it seems indeed to happen on MacOS only. The timing seems to be a bit off, causing the heartbeat check to return false (normally isHeartBeatNeeded() should be called once per second):

2020-12-31T22:09:00.0061350Z <20201231-22:09:00, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=60�35=A�34=1�49=ISLD�52=20201231-22:09:00.005�56=TW�98=0�108=2�10=143�)
2020-12-31T22:09:00.0062670Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false
2020-12-31T22:09:00.0063950Z <20201231-22:09:00, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.4�9=60�35=A�34=1�49=ISLD�52=20201231-22:09:00.005�56=TW�98=0�108=2�10=143�)
2020-12-31T22:09:00.0065060Z <20201231-22:09:00, FIX.4.4:TW->ISLD, event> (Received logon)
2020-12-31T22:09:00.0066220Z XXX FIX.4.4:TW->ISLD isHeartBeatNeeded() = false
# more than one second pause
2020-12-31T22:09:01.0259920Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false
2020-12-31T22:09:01.0261070Z XXX FIX.4.4:TW->ISLD isHeartBeatNeeded() = false
# less than one second pause
2020-12-31T22:09:01.9287190Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false
2020-12-31T22:09:01.9288290Z XXX FIX.4.4:TW->ISLD isHeartBeatNeeded() = false
# more than one second pause
2020-12-31T22:09:03.0341840Z <20201231-22:09:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=58�35=1�34=2�49=ISLD�52=20201231-22:09:03.029�56=TW�112=TEST�10=186�)
2020-12-31T22:09:03.0343040Z <20201231-22:09:03, FIX.4.4:ISLD->TW, event> (Sent test request TEST)
2020-12-31T22:09:03.0344100Z <20201231-22:09:03, FIX.4.4:TW->ISLD, outgoing> (8=FIX.4.4�9=58�35=1�34=2�49=TW�52=20201231-22:09:03.030�56=ISLD�112=TEST�10=178�)
2020-12-31T22:09:03.0345080Z <20201231-22:09:03, FIX.4.4:TW->ISLD, event> (Sent test request TEST)
2020-12-31T22:09:03.0346140Z <20201231-22:09:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=61�35=M�34=3�49=ISLD�52=20201231-22:09:03.030�56=TW�66=somelist�10=209�)
2020-12-31T22:09:03.0347180Z <20201231-22:09:03, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.4�9=58�35=1�34=2�49=ISLD�52=20201231-22:09:03.029�56=TW�112=TEST�10=186�)
2020-12-31T22:09:03.0348230Z <20201231-22:09:03, FIX.4.4:ISLD->TW, incoming> (8=FIX.4.4�9=58�35=1�34=2�49=TW�52=20201231-22:09:03.030�56=ISLD�112=TEST�10=178�)
2020-12-31T22:09:03.0349280Z <20201231-22:09:03, FIX.4.4:TW->ISLD, incoming> (8=FIX.4.4�9=61�35=M�34=3�49=ISLD�52=20201231-22:09:03.030�56=TW�66=somelist�10=209�)
2020-12-31T22:09:03.0350330Z <20201231-22:09:03, FIX.4.4:ISLD->TW, outgoing> (8=FIX.4.4�9=58�35=0�34=4�49=ISLD�52=20201231-22:09:03.031�56=TW�112=TEST�10=180�)
2020-12-31T22:09:03.0351280Z XXX FIX.4.4:ISLD->TW isHeartBeatNeeded() = false

Maybe we should just ignore it, otherwise I am asking myself why it only happens on MacOS.

@the-thing
Copy link
Contributor

How often does it happen? I tried to recreate it, but I modified the YAML script slightly to run "TimerTest" only and after 5 attempts it passed every time on MacOS Java 8.

Either it takes more attempts to fail or it makes me think that some other test case does not properly close/recycle a resource on MacOS and interferes with this test.

@chrjohn
Copy link
Member Author

chrjohn commented Jan 19, 2021

In the complete test suite I think it failed about every two or three runs.
I can imagine that it has to do something with resources but to me it looks like some kind of OS-resource since the timer seems to fire very unreliably as if there is a higher CPU load.

@chrjohn
Copy link
Member Author

chrjohn commented Apr 18, 2021

The test uses a java.util.Timer which is sensitive to system clock changes (which seem to happen from time to time on virtualized systems). Will test if converting to use ScheduledExecutorService fixes the issues.

References:
https://stackoverflow.com/questions/409932/java-timer-vs-executorservice
https://stackoverflow.com/questions/17588167/what-should-timertask-scheduleatfixedrate-do-if-the-clock-changes
https://stackoverflow.com/questions/18803695/how-is-the-timer-class-in-java-sensitive-to-the-system-clock
https://bugs.java.com/bugdatabase/view_bug.do?bug_id=4290274

@chrjohn chrjohn marked this pull request as draft July 5, 2021 11:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants