Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index out of range in ghc-events show and dependencies of ghc-events #109

Open
MangoIV opened this issue Aug 13, 2024 · 24 comments
Open

Index out of range in ghc-events show and dependencies of ghc-events #109

MangoIV opened this issue Aug 13, 2024 · 24 comments

Comments

@MangoIV
Copy link

MangoIV commented Aug 13, 2024

Hi! I have had a couple of problems with eventlog2html and hs-speedscope recently and they seem to be a problem either with the library or the eventlog that the ghc RTS emits. I can more or less (sometimes it doesn't happen) reliably reproduce this error with the following program:

module Main where

import Data.ByteString (ByteString)
import qualified Data.ByteString as BS
import Data.Foldable (for_)
import Data.Monoid (Endo (..))

largeEndo :: Endo ByteString
largeEndo = Endo \x -> BS.replicate 100 97 <> x

manyEndos :: Endo ByteString -> Endo ByteString
manyEndos = mconcat . replicate 10_000

main :: IO ()
main = for_ [1 .. 30] \_ ->
  BS.putStr $ appEndo (manyEndos largeEndo) mempty 

compile with
ghc -rtsopts -prof -fprof-late -O0 ./bla.hs
run with
./bla +RTS -hc -p -l-au

I have not tried to further reduce the example.

This happens on ghc 9.6, 9.8 and 9.10 at least, according to @TeofilC on ghc-events HEAD, and then at least on ghc-events 0.19.0.1.

It affects not only ghc-events but also hs-speedscope and eventlog2html.

related issue on eventlog2html issue tracker:
mpickering/eventlog2html#136

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

I think we're failing in GHC.RTS.Events.Binary.getEvent :: EventParsers -> Get (Maybe Event) where we try to index into the parsers and instead of getting an EventTypeNum within range, we get a really big one.

@TeofilC
Copy link
Contributor

TeofilC commented Aug 13, 2024

Running ghc-events with an eventlog from the reproducer triggers an assertion failure at this line:

assert
((fromIntegral payloadLen :: Int) == sum
[ 4
, 8 -- ticks
, 1 -- stack depth
, fromIntegral profStackDepth * 4
])
(return ())

It seems like the payloadLen is wrong for certain EVENT_PROF_SAMPLE_COST_CENTRE events

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

jup, the last four parses before it blows up look like this:

evSpec: HeapProfSampleCostCentre {heapProfId = 0, heapProfResidency = 912, heapProfStackDepth = 1, heapProfStack = [78]}, etRef: 163
evSpec: ProfSampleCostCentre {profCapset = 369098752, profTicks = 4333765376, profStackDepth = 0, profCcsStack = []}, etRef: 167
evSpec: HeapProfSampleString {heapProfId = 0, heapProfResidency = 2029359963648, heapProfLabel = "\SO"}, etRef: 164
evSpec: CreateThread {thread = 22784}, etRef: 0

It looks like the parse of ProfSampleCostCentre introducest the corruption.

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

@TeofilC is it possible that this is the Ccs stack being too deep for the profStackDepth Word8? And then it overflows and we don't read the Vector to end?

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

That doesn't seem to be it, you're right, the payloadLen before that is already completely off.

@TeofilC
Copy link
Contributor

TeofilC commented Aug 13, 2024

Yes on the GHC side we truncate to 255, which should be fine

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

So afaiu this means that either the parsing of the header on the ghc-events side or the writing of the header on ghc RTS side goes wrong. (At least the previous events look fine, so I'm not assuming that they're already introducing the corruption)

@TeofilC
Copy link
Contributor

TeofilC commented Aug 13, 2024

Strangely enough I can't seem to reproduce this with -threaded. I get the impression that an EVENT_PROF_SAMPLE_COST_CENTRE and an EVENT_HEAP_PROF_SAMPLE_COST_CENTRE event being written at the same time is the cause of this. Yet, I'm not sure how that would be possible with the non-threaded runtime!

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

Also don’t they acquire the global eventBuf lock before writing?

@TeofilC
Copy link
Contributor

TeofilC commented Aug 13, 2024

I also can't reproduce if I only do heap profiling or time profiling with the non-threaded RTS.

So it seems like we need to be doing both with the non-threaded RTS.

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

@TeofilC
Copy link
Contributor

TeofilC commented Aug 13, 2024

Also don’t they acquire the global eventBuf lock before writing?

Ah but that only exists for the threaded RTS

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

I think it's highly likely that somehow we try to write both a heap sample and a time sample at the same time to the eventlog

and the -threaded safe guards against that because it does proper locking while the non-threaded RTS doesn’t but is still somehow concurrent? That’s weird

@TeofilC
Copy link
Contributor

TeofilC commented Aug 13, 2024

What seems to be happening is:

  1. we are running a heap profile
  2. we write half of a heap sample event
  3. the time profile timer triggers
  4. we pause what we are doing, and write a time sample event in the middle of our heap event
  5. we unpause and finish writing our heap event

So we end up with something garbled.

This story is backed up by putting a bunch of traces inside the eventlog printing functions in the RTS. This is the order of events they suggest

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

So there must be a context switch somewhere in dumpCensus, even with the non-threaded RTS?

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

Is it possible that this is happening because the time profile is running asynchronously? (see initTimer -> initTicker -> createAttachedOSThread with handle_tick -> handleProfTick -> traceProfSampleCostCentre) maybe there should be a lock for writing to the eventlog?

@MangoIV
Copy link
Author

MangoIV commented Aug 13, 2024

so maybe it would work if we'd just keep the ACQUIRE_LOCK stuff in existence, even in the non-threaded runtime? 👀

@MangoIV
Copy link
Author

MangoIV commented Aug 14, 2024

@TeofilC I don't think that you observation is generally right - I was going to try if I can completely circumvent the issue by using -threaded or using only one of the two traces but I cannot - with a more elaborate example, which is probably a bit too bulky to share, I still really often (~30% of the time) get this problem.

@mpickering
Copy link
Contributor

mpickering commented Aug 14, 2024

Is there perhaps some similar issue where the initialisation events (cost centre definitions) are being posted to the output, and that is interupted by a heap profile event before all of the definitions are dumped.

Does it happen if you are not using a profiled executable? (ie, don't compile with -prof and using -hT -l)

@mpickering
Copy link
Contributor

Another thing to try is a longer profiling interval (-i10), does that fix the issue?

@TeofilC
Copy link
Contributor

TeofilC commented Aug 14, 2024

Interesting @MangoIV . It sounds like there's potentially multiple bugs. In your larger example, maybe you could try to find the last few events before the eventlog gets corrupted. That might help suggest which event is going wrong

@TeofilC
Copy link
Contributor

TeofilC commented Aug 15, 2024

I've written up the bug we found here: https://gitlab.haskell.org/ghc/ghc/-/issues/25165
We should keep looking for the other issues, but I wanted to make sure we didn't forget about this one

@MangoIV
Copy link
Author

MangoIV commented Aug 20, 2024

@mpickering

  • i10 doesn't fix the issue, this made it so that I got one of the eventlogs through, so I guess it decreases the probability for failure but most still fail
  • another thing to note is that I'm killing the application in question with keyboard signal, so that may be something to consider, as well
  • profiling-detail is late
  • it seems to happen at least a lot less often, if not not at all with -hT -l

@mpickering
Copy link
Contributor

This seems to confirm what @TeofilC suggested about time profiling events interrupting the writing of other events and leading to corruption

@MangoIV
Copy link
Author

MangoIV commented Aug 20, 2024

it also happens when only using -hy, no time profiling. Is time profiling still doing things when we only request heap profiling with -l?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants